3

I have a large file with irregularly spaced data in the format:

{{"2012/08/06 21:05:22", 29}, {"2012/08/06 21:10:14", 
  28}, {"2012/08/06 21:15:12", 29}, {"2012/08/06 21:20:14", 
  29}, {"2012/08/06 21:30:12", 28}, {"2012/08/06 21:35:12", 
  28}, {"2012/08/06 21:40:13", 30}, {"2012/08/06 21:45:13", 
  30}, {"2012/08/06 22:00:13", 29}, {"2012/08/06 22:05:08", 28}}

The whole file is about 100000 lines long and can be downloaded here: 2.4 MB csv file

I have converted the time strings with AbsoluteTime and removed comments and empty lines:

imp = DeleteCases[{#[[1]], 
     Mean[DeleteCases[
       ToExpression[
        Map[If[Head[#] === String, 
           StringReplace[#, "#" ~~ __ -> ""], #] &, #[[2 ;; -1]]]], 
       Null]]} & /@
   Rest[Import["dht22.csv", "Data"]][[
    1 ;; -1]], {_, Mean[{}]}];
absTimp = {AbsoluteTime[#[[1]]], #[[2]]} & /@ imp;

Now I want to calculate hourly means of this data. This is problematic because there can be hours and even days when no data was recorded, and the data is not recorded exactly on the hour, it has some seconds or minutes of delay. So I want to interpolate the data and calculate hourly means of the interpolated function. I have this code:

interpolation = Interpolation[absTimp, InterpolationOrder -> 1]

timeFirst = imp[[1, 1]];
listFirst = DateList[timeFirst];
start = AbsoluteTime[Join[listFirst[[1 ;; 3]], {listFirst[[4]] + 1, 0, 0}]];

timeLast = imp[[-1, 1]];
listLast = DateList[timeLast];
end = AbsoluteTime[Join[listLast[[1 ;; 3]], {listLast[[4]] - 1, 0, 0}]];

AbsoluteTiming[
 Table[Integrate[interpolation[x], {x, t, t + 3600}], {t, start, 
    Min[ (* shorter range for testing *)
     end,
     start + 3600*24
     ], 3600}]/3600]

This calculation for 25 hourly means takes 2 seconds if I use Integrate and 3.5 seconds with NIntegrate on my computer, so it's clearly unusable for my large data set. How can I improve the speed of the calculation?

shrx
  • 7,807
  • 2
  • 22
  • 55
  • 1
    Probably the best way is to take the integral directly as I showed here (look also at Michael's answer ). – Alexey Popkov Jan 06 '14 at 18:01
  • I have already looked at that question. If by direct integration you mean the MovingAverage.Differences method, I don't think it's useable because my integration intervals are usually between data points. – shrx Jan 06 '14 at 18:05
  • It means only that you have to consider first and last semi-intervals separately, it is not difficult. For the rest (i.e. the biggest part of the integration range) the MovingAverage.Differences approach will give you huge speedup (you also can Compile it to get even better performance). – Alexey Popkov Jan 06 '14 at 18:18
  • I'm afraid I don't understand what you mean, could you explain a bit better how should I do this? – shrx Jan 06 '14 at 18:27
  • 1
    You say that the integration range bounds lie between your data points. It means that you have little problem with first and last intervals (by "interval" I mean exactly an interval between adjacent data points) but not with other intervals in the range of integration. So you should dissect you integral into 3 integrals: one for the first interval (which as you say is not a complete interval, I call it semi-interval), one for the last and one for the all other intervals. – Alexey Popkov Jan 06 '14 at 18:37
  • Oh, right, I was confused with the meaning of intervals. – shrx Jan 06 '14 at 18:38

0 Answers0