16

I think Mathematica handles .mat file not very well, python is pretty good at this kind of jobs. In this site, there are plenty of question related to calling Mathematica from Python, or calling Python from Mathematica. The answers are only valid for very small data, I believe like hundreds of numbers. Currently, I am working with data a little larger than that, a typical size is about (40960, 256). Thus I want to ask whether there is a solution to transfer data from one to another? Or a better .mat file interface?

Following is a sample of what is inside the .mat file.

matfile

The data transfer between R and python can be done seamlessly. Check rpy2 for more information. Maybe RLink can be used in this case.

Update

Handling .mat file in Mathematica is not as fast as in Python. However, thanks for linked page suggested by @The Toad, this process is much better.

Following are some tests:

in Python:

%time matlab.whosmat('test.mat')
CPU times: user 152 ms, sys: 10.2 ms, total: 162 ms
Wall time: 164 ms
Out[153]: 
[('exp03', (20480, 2), 'double'),
 ('exp22', (22528, 2), 'double'),
 ('exp09', (22528, 2), 'double'),
 ('exp31', (22528, 2), 'double'),
 ...

in Mathematica:

In[166]:= Timing@Import["~/test.mat", {"MAT", "Labels"}]

Out[166]= {0.851815, {"exp03", "exp22", "exp09", "exp31", "exp30", 
  "exp33", "exp32", "exp35", "exp34", "exp19", "exp18", "exp17", 
  "exp16", "exp15", "exp14", "exp13", "exp12", "exp11", "exp10", 
  "exp37", "exp36", "exp02", "Data", "exp08", "exp23", "exp20", 
  "exp21", "exp26", "exp27", "exp24", "exp25", "exp01", "exp28", 
  "exp29", "exp04", "exp05", "exp06", "exp07"}}

in Python

%time data=matlab.loadmat('test.mat')
CPU times: user 199 ms, sys: 28.2 ms, total: 228 ms
Wall time: 236 ms

in Mathematica:

In[168]:= Timing@Import["~/test.mat", {"MAT", "LabeledData"}];

In[169]:= %[[1]]

Out[169]= 0.838842

For another file:

In Python:

%time data=matlab.loadmat('test2.mat')
CPU times: user 3.55 s, sys: 381 ms, total: 3.93 s
Wall time: 3.98 s

In Mathematica:

In[1]:= Timing@Import["~/test2.mat", {"MAT", "LabeledData"}];

In[2]:= First[%]

Out[2]= 12.558158
Kattern
  • 2,561
  • 19
  • 35
  • 1
  • @Jens yes, a possible solution. However, I have export the data every time and import to another. Frequently export and import are very slow. – Kattern Jun 21 '15 at 04:24
  • 1
    You can import MAT files in Mathematica. Have you tried that? – rm -rf Jun 21 '15 at 04:34
  • @TheToad Yes, the user experience is terrible. The imported results are very hard to manipulate, especially for a .mat file with a lot of variables stored in. Just want to see what is in st = SessionTime[]; Print@Short[test]; SessionTime[] - st takes 15 seconds. – Kattern Jun 21 '15 at 04:44
  • @TheToad Maybe I should Import the file variable by variable. – Kattern Jun 21 '15 at 04:51
  • 2
    Maybe I don't understand what you want, but it sounds like you could really benefit by using HDF5 format as I suggest in the answer linked above. It would just be a replacement for what you're already doing with MAT files, it seems. And I have yet to see any file format that is faster to import and export than HDF5. – Jens Jun 21 '15 at 05:23
  • @TheToad Thanks for your link answer. The speed of MMA is not as fast, but at least is much better than what I get earlier. If you are interested in the performance of MMA, I add some timing results in the question. – Kattern Jun 21 '15 at 05:41
  • @Jens I will check the performance of MMA with HDF5, since it is pretty easy to convert files from .mat to HDF5. – Kattern Jun 21 '15 at 05:43
  • 1
    @Jens HDF5 is impressive, I add some timing results in the question. Maybe I should convert all the data files to .h5. – Kattern Jun 21 '15 at 06:32
  • @Jens do you think I should move the timing results to an answer? HDF5 seems a solution of the problem. – Kattern Jun 21 '15 at 06:40
  • You could write your section about hdf5 as an answer? – chris Jun 21 '15 at 07:41
  • 1
    you might want to try fits as well? – chris Jun 21 '15 at 07:42
  • @chris never thought FITS as a format for data storage. I will check this option out. – Kattern Jun 21 '15 at 08:24
  • @chris FITS seems most working with image data. Can you please clarify how to use it in storage groups of matrix data. – Kattern Jun 21 '15 at 08:42
  • 1
    for best performance use no formatting at all. Write out raw binary and read into mathematica with ReadList – george2079 Jun 21 '15 at 12:53
  • 1
    @george2079 saving raw binary is quite risky, because these files will be read in other computers and softwares. I think it is better to store metadata with the data. – Kattern Jun 21 '15 at 15:15

1 Answers1

17

As suggested by @Jens, HDF5 can be fast imported and manipulated in Mathematica. The performance of importing HDF5 is as efficient as MAT file in Python and you can read only a part of HDF5 file into memory. From the question, Mathematica is about 3~4 times slower in reading MAT files. The speed of reading HDF5 files are very close to the speed in Python.

Performance of HDF5

As suggested by @Jens, performance of HDF5 has been tested. I have to say performance of HDF5 is impressive, the performance is close to .mat file in Python (Because .mat get more compression, the read speed is a little slower.). Since I do not know much about how to convert .mat file to .mat of version 7.3. I directly convert .mat to .hdf file. Following are some timing results:

For the first file: In[1]:= fname = "test.h5"; dat = Timing@Import[fname]

Out[2]= {0.109260, {"/Data", "/__globals__", "/exp01", "/exp02", 
  "/exp03", "/exp04", "/exp05", "/exp06", "/exp07", "/exp08", 
  "/exp09", "/exp10", "/exp11", "/exp12", "/exp13", "/exp14", 
  "/exp15", "/exp16", "/exp17", "/exp18", "/exp19", "/exp20", 
  "/exp21", "/exp22", "/exp23", "/exp24", "/exp25", "/exp26", 
  "/exp27", "/exp28", "/exp29", "/exp30", "/exp31", "/exp32", 
  "/exp33", "/exp34", "/exp35", "/exp36", "/exp37"}}

In[3]:= dat = Timing@Import[fname, "Data"];
First[dat]

Out[4]= 0.142046

In[5]:= dat = Timing@Import[fname, {"Datasets", "Data"}];

In[6]:= dat[[1]]

Out[6]= 0.036610

In[7]:= Dimensions@dat[[2]]

Out[7]= {20480, 37}

For the second file:

In[1]:= fname = "test2.h5";
dat = Timing@Import[fname]

Out[2]= {0.111127, {"/Data", "/Spec", "/__globals__", "/exp001", 
  "/exp002", "/exp003", "/exp004", "/exp005", "/exp006", "/exp007", 
  "/exp008", "/exp009", "/exp010", "/exp011", "/exp012", "/exp013", 
  "/exp014", "/exp015", "/exp016", "/exp017", "/exp018", "/exp...}}

In[3]:= dat = Timing@Import[fname, "Data"];
First[dat]

Out[4]= 2.565427

In[5]:= dat = Timing@Import[fname, {"Datasets", "Data"}];

In[6]:= dat[[1]]

Out[6]= 0.143031

In[7]:= Dimensions@dat[[2]]

Out[7]= {40960, 246}

Convention of MAT file to HDF5

The problem of the solution is how to convert MAT file to HDF5 and the compatibility with new version Matlab is a plus.

Convention MAT file to HDF5 can be done using h5py in Python. I would like to recommend hdf5storage, which is easy to use and have very similar options as savemat in scipy. The HDF5 file can also be load in Matlab using load command.

import hdf5storage as h
from scipy.io import loadmat
data = loadmat('test.mat')
h.savemat('t.mat',data)

Currently, hdf5storage does not support compression. The issue is reported by the author herself last year. Here is a workaround before it is implemented. Changing the default option in __init__.py L192-L196 will always compress the file.

    # Set the h5py options to use for writing scalars and arrays to
    # blank for now.

    self.scalar_options = dict()
    self.array_options = {'compression':'gzip'}  

I will contact the author to fix this issue ASAP.

Kattern
  • 2,561
  • 19
  • 35