Correlation with missing data

Question

I have two lists, age and score, that I want to find correlation on for 50 people. For some people, the score is blank.

Example data:

age={60., 21., 24., 63., 66., 62., 56., 54., 62.,...}
score={720., 880., 980., 820., , 820., 970., 950., 170.,...}

If I simply use DeleteCases to remove the " " from the score list, I change the length and cannot correlate the two lists of different lengths.

Suggestions?

I think you can only compare those which has both data so: Pick[age, NumericQ/@score] and then compare with score/.""->Sequence[], or DeleteCases like with your approach. — Kuba, Jul 25 '13 at 15:29

score 6 · Answer 1 · edited Apr 13 '17 at 12:55

First, let's generate some sample data. In your list you actually have an implicit Null rather than the string " " which you also mention. I'll use Null in mine, but the method should be the same whichever it is: Null, " ", "", etc.

age = 1 / Range[10];
score = 10 / age;
score[[{2, 6, 7, 10}]] = Null;

You can pair and filter the lists as Anon showed. I would use Cases or DeleteCases myself, rather than Sequence[]:

DeleteCases[{age, score}\[Transpose], {_, }]

{{1, 10}, {1/3, 30}, {1/4, 40}, {1/5, 50}, {1/8, 80}, {1/9, 90}}

Cases[{age, score}\[Transpose], {_, _?NumberQ}]

{{1, 10}, {1/3, 30}, {1/4, 40}, {1/5, 50}, {1/8, 80}, {1/9, 90}}

Kuba recommended Pick and DeleteCases in a comment:

Pick[age, NumericQ /@ score]
DeleteCases[score, Null]

{1, 1/3, 1/4, 1/5, 1/8, 1/9}
{10, 30, 40, 50, 80, 90}

You could also make use of Position:

pos = Position[score, _?NumberQ];
Extract[#, pos] & /@ {age, score}

{{1, 1/3, 1/4, 1/5, 1/8, 1/9}, {10, 30, 40, 50, 80, 90}}

A different method using Pick:

Pick[{age, score}, {#, #}] &[NumericQ /@ score]

{{1, 1/3, 1/4, 1/5, 1/8, 1/9}, {10, 30, 40, 50, 80, 90}}

Performance

SetAttributes[timeAvg, HoldFirst]
timeAvg[func_] := Do[If[# > 0.3, Return[#/5^i]] & @@ Timing@Do[func, {5^i}], {i, 0, 15}]

age = 1/Range[100000];
score = 10/age;
score[[ RandomSample[Range@100000, 10000] ]] = Null;

Transpose[{age, score}] /. {__, } -> Sequence[]                       // timeAvg
Cases[{age, score}\[Transpose], {_, _?NumberQ}]                       // timeAvg
DeleteCases[{age, score}\[Transpose], {_, }]                          // timeAvg
(pos = Position[score, _?NumberQ]; Extract[#, pos] & /@ {age, score}) // timeAvg
{Pick[age, NumericQ /@ score], DeleteCases[score, Null]}              // timeAvg
Pick[{age, score}, {#, #}] &[NumericQ /@ score]                       // timeAvg

Pick appears to be the fastest on this data; DeleteCases comes in second place.

C. E. · Answer 2 · 2013-07-25T15:54:59.437

4

One way to get rid of the missing elements and the corresponding values from the score list would be this:

tmp=Transpose[{age, score}] /. {__, " "} -> Sequence[]

Then to update age and score you could do this:

{age, score} = Transpose[tmp]

edited Jul 25 '13 at 15:54

answered Jul 25 '13 at 15:49

C. E.

70,533
6
140
264

Correlation with missing data

2 Answers2

Performance

Linked