0

I would like to read multiple text files and build a sparse array by concatenating row-wise one after the other. The text files are ordered in the following format:

\begin{matrix} 0, & 0, & 0, &\dots & 0, \\ 0, & 1, & 3, & \dots & 0, \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0, & 1, & 0, & \dots & 0, \end{matrix}

I read the text file as a sparse array by using the function from this question:

ClearAll[spart, getIC, getJR, getSparseData, getDefaultElement, makeSparseArray];
HoldPattern[spart[SparseArray[s___], p_]] := {s}[[p]];
getIC[s_SparseArray] := spart[s, 4][[2, 1]];
getJR[s_SparseArray] := Flatten@spart[s, 4][[2, 2]];
getSparseData[s_SparseArray] := spart[s, 4][[3]];
getDefaultElement[s_SparseArray] := spart[s, 3];
makeSparseArray[dims : {_, _}, jc : {__Integer}, ir : {__Integer}, 
data_List, defElem_: 0] := 
SparseArray @@ {Automatic, dims, defElem, {1, {jc, List /@ ir}, data}};

Clear[readSparseTable];
readSparseTable[file_String?FileExistsQ, chunkSize_: 100] := 
Module[{stream, dataChunk, start, ic = {}, jr = {}, sparseData = {}, getDataChunkCode, dims}, 
stream = StringToStream[Import[file, "String"]];
getDataChunkCode := If[# === {}, {}, SparseArray[#]] &@
ImportString[
 StringJoin[Riffle[ReadList[stream, "String", chunkSize], "\n"]], "Table", "FieldSeparators" -> ","];
Internal`WithLocalSettings[Null,(*main code*)
start = getDataChunkCode;
ic = getIC[start];
jr = getJR[start];
sparseData = getSparseData[start];
dims = Dimensions[start];
While[True, dataChunk = getDataChunkCode;
If[dataChunk === {}, Break[]];
ic = Join[ic, Rest@getIC[dataChunk] + Last@ic];
jr = Join[jr, getJR[dataChunk]];
sparseData = Join[sparseData, getSparseData[dataChunk]];
dims[[1]] += First[Dimensions[dataChunk]];],(*clean-up*)
Close[stream]];
makeSparseArray[dims, ic, jr, sparseData]]

To build a bigger complete sparse array (Mat), by concatenation of the imported data as sparse arrays, I use Table and iterate over n text files:

n = 200;
MatTab = Table[readSparseTable["C:/drive/textFILE" <> ToString[i] <> ".txt"], {i, 0, n, 1}]]; 
Mat = Flatten[MatTab,1];

Using Table seems to turn the sparse array back to a normal dense matrix and as a result, my system runs out of memory. Using SparseArray function again also does not help:

n = 200;
MatTab = SparseArray[Table[readSparseTable["C:/drive/textFILE" <> ToString[i] <> ".txt"], {i, 0, n, 1}]]]; 
Mat = Flatten[MatTab,1];

How can I solve this problem?.

UPDATE:

A possible solution using code from Leonid Shifrin's answer, but slow:

Clear[accumulateSparseArray];
accumulateSparseArray[Hold[getDataChunkCode_]] := 
Module[{start, ic, jr, sparseData, dims, dataChunk}, 
start = getDataChunkCode;
ic = getIC[start];
jr = getJR[start];
sparseData = getSparseData[start];
dims = Dimensions[start];
While[True, dataChunk = getDataChunkCode;
If[dataChunk === {}, Break[]];
ic = Join[ic, Rest@getIC[dataChunk] + Last@ic];
jr = Join[jr, getJR[dataChunk]];
sparseData = Join[sparseData, getSparseData[dataChunk]];
dims[[1]] += First[Dimensions[dataChunk]];];
makeSparseArray[dims, ic, jr, sparseData]];

Clear[sparseArrayFlatten]
sparseArrayFlatten[m_?(MatrixQ[#, MatchQ[#, _SparseArray] &] &)] := 
Module[{joinRow, code}, 
joinRow[row_List] := 
Module[{i = 1}, With[{l = Append[row, {{}}]}, code := l[[i++]]];
         accumulateSparseArray[Hold[Transpose[code]]]];
            joinRow[joinRow /@ m]]

Remove[Mat, i, temp]
Mat = readSparseTable["C:/drive/textFILE" <> ToString[0] <> ".txt"];
For[i = 1, i <= n, i++,
    temp = Mat;
    Mat = sparseArrayFlatten[{{temp},{readSparseTable["C:/drive/textFILE"<> ToString[i]<>".txt"]}}];] 
dykes
  • 361
  • 1
  • 12
  • Hm. I am afraid that the "child has already fallen into the well": Since the data is not stored sparsely, the main bottleneck is the file reading. Do you have access to the way the data is written to file? – Henrik Schumacher Dec 31 '18 at 09:13
  • There are one or two things that could be improved though. For instance ic = Join[ic, Rest@getIC[dataChunk] + Last@ic]; jr = Join[jr, getJR[dataChunk]]; involves a copy operation of all data that has already been read for each chunk being read. It is better to gather the chunks in an expandable data structure such as Internal`Bag, Association (or with Sow and Reap) and joining only once in the end. – Henrik Schumacher Dec 31 '18 at 09:17
  • It does not harm performance, but ic = getIC[A], jr = getJR[A], getSparseData[A], and getDefaultElement[A] can be replaced by the SparseArray properties A["ColumnIndices"], A["RowPointers"], A["NonzeroValues"], and A["Background"], respectively. – Henrik Schumacher Dec 31 '18 at 09:19
  • Oh, and also the For-loop is awkward, not only because it is a For-loop, but also because it also involves a copy operation of all preceding data; this forces the computational complexity of loop to be O(n^2) while O(n) would be sufficient (see also my comment two comments above) . Btw.: The argument pattern for sparseArrayFlatten can be simplified to sparseArrayFlatten[m_SparseArray?MatrixQ] := ... – Henrik Schumacher Dec 31 '18 at 09:31
  • @HenrikSchumacher No. I do not have access to how the data is written. – dykes Dec 31 '18 at 13:46

1 Answers1

1

This tries to circumvent several of the inefficiencies of OP's code. In particular, I avoid successive use of Join and operate as much as possible on the data that actually defines a sparse array (the column indices, row pointers, and nonzero values).

Only for convenience, I use Mathamatica's "CSV" import/export routines (I know that they are far too slow; but the method presented by OP isn't faster). (It is really due time for Wolfram Research to react on frequent user complaints about the inefficiencies of "CSV" import/export.)

Creating several files with matrices (will take quite a lot of time because "CSV"-export is even slower than import).

nfiles = 20;
filename = "a_";

m = 100;
n = 10000;
nnz = 50000;

Do[
  A = ConstantArray[0, {m, n}];
  i = RandomInteger[{1, m}, nnz];
  j = RandomInteger[{1, n}, nnz];
  a = RandomInteger[{-10, 10}, nnz];
  Do[A[[i[[k]], j[[k]]]] = a[[k]], {k, 1, nnz}];
  Export[filename <> IntegerString[k, 10, 4], A, "CSV"],
  {k, 1, nmatrices}
  ];

A few helper routines, partially compiled. Exactly as Leonid Shifrin did, I will assume that each file can be loaded to memory in a whole chunk. The routine for doing that and for extracting the relevant data is readFile.

getNonzeroValues = Compile[{{row, _Integer, 1}},
   DeleteCases[row, 0],
   CompilationTarget -> "C",
   RuntimeAttributes -> {Listable},
   Parallelization -> True
   ];

getColumnIndices = Compile[{{row, _Integer, 1}},
   Position[Unitize[row], 1],
   RuntimeAttributes -> {Listable},
   Parallelization -> True
   ];

readFile[file_] := If[FileExistsQ[file],
   With[{A = Developer`ToPackedArray@Import[file, "CSV"]},
    {
     Dimensions[A],
     Join @@ getNonzeroValues[A],
     getColumnIndices[A]
     }
    ],
   {{0, 0}, {}, {}, {}}
   ];

Finally, the method readFile is mapped over all filenames in parallel, data is joined once (the row pointers are created only now), and the large SparseArray spA is created.

filenames = Table[filename <> IntegerString[i, 10, 4], {i, 1, nfiles}];
default = 0;
spA = Module[{m, n, a, ci, rp, data},
    data = Transpose@ParallelMap[readFile, filenames];
    m = Total[data[[1, All, 1]]];
    n = Max[data[[1, All, 2]]];
    a = Join @@ data[[2]];
    ci = Join @@ Join @@ data[[3]];
    rp = Accumulate[ Prepend[Flatten[Map[Length, data[[3]], {2}], 1], 0]];
    SparseArray @@ {Automatic, {m, n}, default, {1, {rp, ci}, a}}
    ]
Henrik Schumacher
  • 106,770
  • 7
  • 179
  • 309