Tuesday, February 22, 2011

Let's do Some Design & Code

There are many datataypes which one can use to implement HDF5. But, the base for all is 'DataSet'. You can store your raw data in a DataSet, in a table structure or packet table. We will see it one by one. Before starting anything I'll show you the structure which we are going to create in our file.




Here the 'Circle' represents a group and 'Rectangle' stands for a dataset a container in which we are going to put our data. Here I am demonstrating you my experiences, mistakes and how I deal with them to implement a better solutions for my problems.

The root node is our file's default node/group under which all branches we are going to hold throughout the development of HDF5 file. If you are familiar with B-Trees then your work is more easier with HDF5. The file can be created using following API:

hid_t fileID = H5Fcreate( "../../FileName.h5", H5F_ACC_TRUNC, H5P_Default, 
H5P_Default);

The more APIs related to file properties can be found here.

For a sake of naming convention and to understand it better I have named a newly created group as 'MyGroup'. Now, we'll see how we can create the same with HDF5 API. The HDF5 APIs to create, open, delete a 'Group' are grouped under H5G.

/* Create group 'MyGroup' under 'root' node */
hid_t groupID = H5Gcreate(file_id, "MyGroup", H5P_DEFAULT, H5P_DEFAULT, H5P_DEFAULT);
H5D is the next collection which holds the all the APIs required to work with the datasets(our data containers). Now, what we need is create a dataset under the group, the one which have already created called 'MyGroup'. Let's give our dataset a name 'DataSet1'.

/* Create the dataset */
hid_t dset = H5Dcreate (file, "DataSet1", filetype, filespace, H5P_DEFAULT, cparms, H5P_DEFAULT);

Here if you look at the parameters associated with it, you will get confused. So as for intial understanding consider I have defined few properties for our dataset to write data like data type, enable/disable chunking, etc.

Remember, I started this as creating a dataset for every new set of data. After giving it a test with running it in a thread to write data into file, I came to know that the number of datasets that can be created under one group has limitations i.e I could not create more than 65556 datasets in a single group. Now, this was a big big problem for me. I had almost lost everything that I already have spent much time on this and could not back-track.

To work with this problem I decided to go by another approach. That is, instead of creating altogether a new dataset for every set of new data, if I can use only one. For this I thought I need to understand more internals of datasets. Finaly, I came to know that I can use the same dataset, provided I should extend the dimensions of dataset each time I write a new data to it. This was the another reason why I chose C to program HDF5 as these APIs only available with C/C++ and not with the dot net or Java.

Before putting it altogether, I would like to know you that, I am going to write a variable length data to the dataset.

/* 
 * Create a new file using the default properties.
 */
file = H5Fcreate (FILE, H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT);

/* 
 * Modify dataset creation properties, i.e. enable chunking
 */
cparms = H5Pcreate (H5P_DATASET_CREATE);
status = H5Pset_chunk (cparms, 1, dims);

/*
 * Create dataspace.  Setting maximum size to NULL sets the maximum
 * size to be the current size.
 */
filespace = H5Screate_simple (1, dims, NULL);

memspace = H5Screate_simple (1, dims, NULL);


/*
 * Create file and memory datatypes. 
 */
filetype = H5Tvlen_create(H5T_NATIVE_UCHAR);

memtype = H5Tvlen_create(H5T_NATIVE_UCHAR);


/*
 * Create the dataset
 */
dset = H5Dcreate (file, "DataSet1", filetype, filespace, H5P_DEFAULT, cparms, H5P_DEFAULT);

Till here, we have done with the creating file and dataset ready to write data into it.

Now, next step was to write variable length data in thread/loop. This was actually a interesting part for me as I wanted to see how the dimensions of dataset changes at runtime. After reading out more I understood that there is one more attribute which I need to consider and is HyperSlab.

So, my 'Write' Function was pretty simple:

extdSize[0] = recordNumber + 1;
/* 
 * Extend Dataset 
 */
status = H5Dextend(dset,extdSize);

/* 
 * Define memory space 
 */
memspace = H5Screate_simple (1, dims, NULL);

and finally the most awaited writing:
offset[0] = recordNumber;
offset[1] = 0;

/* 
 * Select a hyperslab  
 */
filespace = H5Dget_space (dset);

status = H5Sselect_hyperslab(filespace, H5S_SELECT_SET, offset, NULL,
    dims, NULL);

/* 
 * Write the variable-length data to the dataset.
 */
status = H5Dwrite (dset, memtype, memspace, filespace, H5P_DEFAULT, data);






No comments: