20

I have a large table (7000 rows × 17 columns) of terse textual data. In many of the columns, empty entries have been replaced with "." as a marker. Working from the top of the columns downward, I want to replace each successive "." with the data value (non-".") from above until the next data value is found; at which time I want to replace the next sequence of "." with the new data value etc. E.g.

x     -->  x  
"."   -->  x  
y     -->  y  
"."   -->  y  
"."   -->  y  
z     -->  z  
"."   -->  z  

I need to do this for all rows of data. Any help would be greatly appreciated. -k-

m_goldberg
  • 107,779
  • 16
  • 103
  • 257
Keith
  • 203
  • 1
  • 5

8 Answers8

26

How about

f[_, y_] := y
f[x_, "."] := x

fill[data_] := Transpose[FoldList[f, First[#], Rest[#]] & /@ Transpose[data]]
Simon Woods
  • 84,945
  • 8
  • 175
  • 324
  • 1
    +1. I think this one is going to be my favorite. – Michael E2 Jun 09 '13 at 19:40
  • It is so fast :). +1 – Kuba Jun 09 '13 at 21:49
  • Fold is such a natural solution to this problem -- I'm embarrassed I didn't think to try it. It works row-by-row, too, if f is Listable and the data contain no lists: fill[data_] := FoldList[f, First[data], Rest[data]]. – Michael E2 Jun 09 '13 at 22:44
  • @MichaelE2, that's a neat alternative. – Simon Woods Jun 10 '13 at 09:41
  • Just happened across this. Quite nice. My final attempt similar, you might get a chuckle from a prior (impractical, but kind of pretty in a way): FixedPoint[ ReplacePart[#, {r_, c_} /; (#[[r, c]] == ".") :> #[[r - 1, c]]] &, data] ... +1, of course! – ciao May 02 '14 at 06:11
11

Map approach:

g["."] := last;    g[x_] := x;
Map[(last = g[#]) &, Transpose[data], {2}]\[Transpose]

Latest ideas, quite important remarks (in my opinion)

I have asked Keith about gaps in first row. ...and no one have paid atenttion (me too). It is crucial, this is what makes significant difference between this problem and problem linked by Mr. Wizard.

Why?, because this information allows us to effectively use ParallelMap.

First I thought, let Transpose->Map->Transpose, what a big deal. Parallel Map will also help each method, so result will be qualitatively the same but... it is not true.

Conclusions:

  1. I am not able to help Simon's ListFold method with ParralelMap (separate question)
  2. Methods's relative efficiency depends on "." density.
  3. Methods's relative efficiency depends on data type.

It should depend of number_of_columns/avalible_threads but I haven't checked that. I have 2 cores.

Each method contains now ParallelMap (exept red one). 0x4A4D's works faster without parallelization so I left it too. Maybe implementation could have been better, codes are below plots.

enter image description here enter image description here enter image description here

Codes with methods upgraded with ParallelMap:

(*Simon with and without parallelmap*)
f[_, y_] := y;f[x_, "."] := x;
fill[data_] :=  Transpose[FoldList[f, First[#], Rest[#]] & /@ Transpose[data]]

fillP[data_] := Transpose[ParallelMap[
        FoldList[f, First[#], Rest[#]] &,
        Transpose[data]]]
(*mine*)
g["."] := last;g[x_] := x;
ParallelMap[(last = g[#]) &, Transpose[data], {2}]\[Transpose];
(*0x4A4D's*)
fillblanks[list_, dummy_: "."] := Flatten[
  Map[
     ConstantArray[First[#], Length[#]] &,
     Split[list, MatchQ[#2, dummy] &]]]
(*Michael's*)
filldown2 = Transpose@ParallelMap[
 Module[{last = First[#]}, Function[{entry}, last = entry /. "." -> last] /@ #] &, 
 Transpose[#]] &;

Testing procedure:

time = {};
Do[(
k = 10^i;
data = Table[RandomChoice[{"1","."}], {k}, {4}]~Prepend~(Range@4);
AppendTo[time,
 {fill@data; // Timing // First,
  fillP@data; // Timing // First,
  ParallelMap[(last = g[#]) &, Transpose[data], {2}]\[Transpose]; //Timing // First,
  Transpose[fillblanks /@ Transpose[data]]; // Timing // First,
  filldown2@data; // Timing // First
}]
), {i, 3, 6, .5}]

In order to switch % of "." I have simply used RandomChoice[{"1","."}] (for 50%) and RandomChoice[{"1","1","1","1","1","1","1","1","1","."}] for 10% :).


First, single thread ideas

Old There was no MapIndexed way, so now it is:

f[d_, {x_, y_}] := If[d == ".", f[data[[x - 1, y]], {x - 1, y}], d]

MapIndexed[f, data, {2}]

New Map approach:

g["."] := last
g[x_] := x
Map[(last = g[#]) &, Transpose[data], {2}]\[Transpose]

And version which needs only [[1,1]] element to be different from ".":

Partition[
   (last = g[#]) & /@ Flatten[Transpose[data]]
, Length@data]\[Transpose]

33 sec for k=10^6, faster but not fast enough. I'll update plots later.

methods comparison:

tested on 7-column k-row matrix, where k = 10^1, 10^2... Elements are strings.

data=Table[RandomChoice[{"1", "2", "3", "."}], {k}, {7}]~Prepend~(ToString /@Range@7);

enter image description here

Kuba
  • 136,707
  • 13
  • 279
  • 740
5

Maybe there's a more efficient approach, but here's what I came up with:

data = {{"foo", "foo", "blah", "blah", "foo", "blah", "argh"},
        {"pfft", ".", ".", ".", ".", ".", "foo"},
        {"foo", ".", "foo", ".", "pfft", "blah", "."},
        {".", ".", "pfft", ".", ".", ".", "blah"},
        {".", "pfft", "blah", ".", ".", ".", "."},
        {"blah", ".", ".", ".", "argh", ".", "."},
        {"blah", "foo", "foo", "argh", ".", ".", "pfft"},
        {".", ".", "pfft", ".", "argh", ".", "."},
        {"foo", ".", "blah", ".", "argh", ".", "."},
        {"blah", "pfft", ".", "pfft", "foo", "argh", "."}};

fillblanks[list_, dummy_: "."] := 
    Flatten[ConstantArray[First[#], Length[#]] & /@ Split[list, StringMatchQ[#2, dummy] &]]

Transpose[fillblanks /@ Transpose[data]]

which yields

{{"foo", "foo", "blah", "blah", "foo", "blah", "argh"},
 {"pfft", "foo", "blah", "blah", "foo", "blah", "foo"},
 {"foo", "foo", "foo", "blah", "pfft", "blah", "foo"},
 {"foo", "foo", "pfft", "blah", "pfft", "blah", "blah"},
 {"foo", "pfft", "blah", "blah", "pfft", "blah", "blah"},
 {"blah", "pfft", "blah", "blah", "argh", "blah", "blah"},
 {"blah", "foo", "foo", "argh", "argh", "blah", "pfft"},
 {"blah", "foo", "pfft", "argh", "argh", "blah", "pfft"},
 {"foo", "foo", "blah", "argh", "argh", "blah", "pfft"},
 {"blah", "pfft", "blah", "pfft", "foo", "argh", "pfft"}}
J. M.'s missing motivation
  • 124,525
  • 11
  • 401
  • 574
5

This question is arguably a duplicate of: Fill out blanks with a upcoming number in a list?
There the filling is requested in reverse and for a single list, but the principle is the same.

Simon Woods already posted what is IMHO the most natural approach: FoldList, but as we are already restating earlier answers I'd like to share my own variations.

For each method I'll show application to a single list v; application to columns is easily accomplished with a double Transpose.

Heike's method using FoldList and /.:

v = {8, 3, ".", 1, ".", ".", 7, ".", 9, ".", ".", ".", ".", ".", ".", ".", 2, ".", 2, 6};

Rest @ FoldList[#2 /. "." -> # &, ".", v]
{8, 3, 3, 1, 1, 1, 7, 7, 9, 9, 9, 9, 9, 9, 9, 9, 2, 2, 2, 6}

Also of interest is Leonid's method with Split. Here in a different style:

Join @@ (Table[#, {Length@{##}}] & @@@ Split[v, #2 === "." &])
{8, 3, 3, 1, 1, 1, 7, 7, 9, 9, 9, 9, 9, 9, 9, 9, 2, 2, 2, 6}

Simon's code is faster than either of these on unpacked (mixed) data:

SeedRandom[3]
v = Clip[RandomInteger[{-7, 9}, 350000], {1, ∞}, {".", 1}];

Rest@FoldList[#2 /. "." -> # &, ".", v]                        // Timing // First
Join @@ (Table[#, {Length@{##}}] & @@@ Split[v, #2 === "." &]) // Timing // First
f[_, y_] := y; f[x_, "."] := x; FoldList[f, First[v], Rest[v]] // Timing // First

0.2872

0.327

0.0936

However if the problem can be recast to packable data then a form that complies will be faster. I'll use zero to represent fields that should be filled.

v = {8, 3, 0, 1, 0, 0, 7, 0, 9, 0, 0, 0, 0, 0, 0, 0, 2, 0, 2, 6};

Rest @ FoldList[If[#2 == 0, ##] &, 0, v]
{8, 3, 3, 1, 1, 1, 7, 7, 9, 9, 9, 9, 9, 9, 9, 9, 2, 2, 2, 6}
SeedRandom[3]
v = Clip[RandomInteger[{-7, 9}, 350000], {1, ∞}, {0, 1}];

ClearAll[f];

f[_, y_] := y; f[x_, 0] := x; FoldList[f, First[v], Rest[v]]  // Timing // First
Rest @ FoldList[If[#2 == 0, ##] &, 0, v]                      // Timing // First
0.0936

0.0212

I like the compact and self-contained nature of this code as well.

Mr.Wizard
  • 271,378
  • 34
  • 587
  • 1,371
3

Row by row:

Clear[datum];
SeedRandom[1];
datum[] := RandomInteger[{0, 9}];
tab = {RandomInteger[{0, 9}, 5]} ~Join~ RandomChoice[{8, 1} -> {".", x}, {9, 5}] /. x :> datum[]

filldown@tab
{{ 1,   4,   0,   7,   0},
 {".", ".", ".",  8,  "."},
 {".", ".", ".", ".",  5},
 {".", ".", ".", ".", "."},
 {".",  7,   6,  ".", "."},
 {".",  7,  ".", ".", "."},
 {".", ".",  0,  ".",  1},
 {".", ".", ".", ".", "."},
 {".", ".",  0,  ".", "."},
 {".", ".", ".", ".", "."}}
filldown = Module[{last = First[#]},
   Function[{row}, 
     last = ReplacePart[last, # -> Extract[row, #] & /@ Position[row, Except["."]]]] /@ #] &
{{1, 4, 0, 7, 0},
 {1, 4, 0, 8, 0},
 {1, 4, 0, 8, 5},
 {1, 4, 0, 8, 5},
 {1, 7, 6, 8, 5},
 {1, 7, 6, 8, 5},
 {1, 7, 0, 8, 1},
 {1, 7, 0, 8, 1},
 {1, 7, 0, 8, 1},
 {1, 7, 0, 8, 1}}

Down each column:

filldown2 = Transpose@Map[Module[{last = First[#]},
      Function[{entry}, last = entry /. "." -> last] /@ #] &, Transpose[#]] &

filldown2@tab

Same as above.

Michael E2
  • 235,386
  • 17
  • 334
  • 747
2

Caution: This answer is just for fun. ReplaceRepeated is constantly slow.

Using 0x4A4D's data:

#//.{pre___,a_,b:Longest["."..],post___}:>{pre,a,Sequence@@({b}/."."->a),post}&/@
  (data\[Transpose])\[Transpose]

Or,

Flatten[#//.{pre___,a_,b:Longest["."..],post___}:>{pre,a,{b}/."."->a,post}]&/@
  (data\[Transpose])\[Transpose]
Silvia
  • 27,556
  • 3
  • 84
  • 164
2

Borrowing J.M's data

data =
  {{"foo", "foo", "blah", "blah", "foo", "blah", "argh"},
   {"pfft", ".", ".", ".", ".", ".", "foo"},
   {"foo", ".", "foo", ".", "pfft", "blah", "."},
   {".", ".", "pfft", ".", ".", ".", "blah"},
   {".", "pfft", "blah", ".", ".", ".", "."},
   {"blah", ".", ".", ".", "argh", ".", "."},
   {"blah", "foo", "foo", "argh", ".", ".", "pfft"},
   {".", ".", "pfft", ".", "argh", ".", "."},
   {"foo", ".", "blah", ".", "argh", ".", "."},
   {"blah", "pfft", ".", "pfft", "foo", "argh", "."}};

Using SequenceReplace (new in 11.3)

MatrixForm @ Transpose @ Map[
   Flatten @* SequenceReplace[{a_, b : "." ..} :> Array[a &, Length[{b}] + 1]],
   Transpose @ data]

enter image description here

eldo
  • 67,911
  • 5
  • 60
  • 168
1

Using Split, SequenceCases and ReplacePart:

f1[v_?VectorQ] := MapApply[Rule,
    Transpose[
            MapAt[Function[Position[Split @ v, Apply[Alternatives, #]]],
                Transpose[
                    SequenceCases[v,
                        {a_, b : Longest["." ..]} :> {{b}, 
         ConstantArray[a, Length[{b}]]}
                    ]
                ],
                {1}
            ]
        ]
   ];

f2[v_?VectorQ] := Flatten[ReplacePart[Split[#], f1@#]] &@v

Transpose@(f2 /@ Transpose@data) // MatrixForm

enter image description here

E. Chan-López
  • 23,117
  • 3
  • 21
  • 44