How to perform fuzzy lookup between lists?

Question

I have two big lists with 10,000 lines each, with product names from different databases. They have a lot of common products (50%), but they are not typed in the same way. For example there is one sample of 20 items that match:

List One

Refresco em Pó TANG de Pêssego 30g
Refresco em Pó CLIGHT de Maracujá 9g
Refrigerante ANTARCTICA Soda Limão Garrafa 600ml
Refresco em Pó CLIGHT de Abacaxi com Hortelã 8g
Refrigerante ANTARCTICA Guaraná Lata 350ml
Refresco em Pó CLIGHT de Lima Limão 9g
Refrigerante ANTARCTICA Zero Guaraná Garrafa 600ml
Ração para Gatos Sabor Peixe e Frutos do Mar WHISKAS 500g
Refresco em pó TANG de Tangerina 30g
Refresco em Pó CLIGHT de Uva 9g

List Two

SUCO PO PESSEGO TANG 30G
SUCO PO DIET MARACUJA CLIGHT 8G
REFRIGERANTE SODA LIMONADA ANT 600ML
SUCO PO DIET ABACAXI C/HORTELA CLIGHT 8G
REFRIGERANTE GUARANA ANT LT 350ML
SUCO PO DIET LIMAO LIMAO CLIGHT 8G
REFRIGERANTE GUARANA ZERO ANT PET 600ML
RACAO PEIXE E FRUTOS DO MAR WHISKAS 500G
SUCO PO TANGERINA TANG 30G
SUCO PO DIET UVA CLIGHT 8G

In the original list they are out of order. In SQL Server Integration Services (SSIS) they have Fuzzy Lookup Transformations to do things like that.

Question: What is the best way to make the match as automatically as possible in Mathematica? I tried something very incipient with EditDistance, but it's not enough. I know that some human intervention will be needed in the end, but currently the whole process is done manually.

You can import a bigger list version with 400 lines each, direct in Mathematica using:

list1=StringSplit[Import["http://arquivo.rodrigomurta.com/Fuzzy_list1.txt"],"\n"]
list2=StringSplit[Import["http://arquivo.rodrigomurta.com/Fuzzy_list2.txt"],"\n"]

The lines 001 to 200 have the elements that are not equivalent, and don't match. The lines 201 to 400 have the elements that are equivalent, and are in the same order. So 201 in list 1 is equivalent in 201 in list 2.

Thanks in advance

Could you give an example of the "matches" between the two lists? — J. M.'s missing motivation, Nov 10 '12 at 17:05
List 1 and List 2 are these examples. Item n in L1 is the same of item n in L2. — Murta, Nov 10 '12 at 23:34
You could pose this as a machine learning problem if you have enough known samples of matches. Or if you have details of the algorithms employed by SSIS Fuzzy Lookup, you could replicate that approach. — image_doctor, Nov 11 '12 at 09:14
I notice that all the 9g examples in the first list are matched with 8G counterparts in the second list. Are you sure these are the same? — Sjoerd C. de Vries, Nov 11 '12 at 16:06
Perhaps you could post somewhere a longer lists (200 entries?) set to test some strategies. Also please specify if each product appears only once in each list. — Dr. belisarius, Nov 11 '12 at 16:48
@belisarius Yes. Each product appears only once in each list. I'll create two text files with more items and put in the question for download as suggested. Tks — Murta, Nov 11 '12 at 19:09

score 14 · Accepted Answer · edited May 23 '17 at 12:35

One major improvement to any matching algorithm in this specific case will be to get rid of the diacritical marks in the first list that don't appear in the second list. I'm not aware of any Mathematica function to do that and too lazy to roll a conversion table myself, but there is a simple Java routine to be found here that does this. I slightly modified it so that I can use Leonid's Java Reloader:

javaConvert = "

 import java.text.Normalizer;
 import java.util.regex.Pattern;

 public class convert{ 
   public static String deAccent(String str) {
     String nfdNormalizedString = Normalizer.normalize(str, Normalizer.Form.NFD); 
     Pattern pattern = Pattern.compile(\"\\\\p{InCombiningDiacriticalMarks}+\");
     return pattern.matcher(nfdNormalizedString).replaceAll(\"\");
   }
 }
 "

With Leonid's package loaded all you need to do is:

JCompileLoad[javaConvert]

It works as follows:

convert`deAccent["äeóã"]

"aeoa"

Now, with the two lists as follows:

l1 = {"Refresco em Pó TANG de Pêssego 30g",
   "Refresco em Pó CLIGHT de Maracujá 9g",
   "Refrigerante ANTARCTICA Soda Limão Garrafa 600ml",
   "Refresco em Pó CLIGHT de Abacaxi com Hortelã 8g",
   "Refrigerante ANTARCTICA Guaraná Lata 350ml",
   "Refresco em Pó CLIGHT de Lima Limão 9g",
   "Refrigerante ANTARCTICA Zero Guaraná Garrafa 600ml",
   "Ração para Gatos Sabor Peixe e Frutos do Mar WHISKAS 500g",
   "Refresco em pó TANG de Tangerina 30g",
   "Refresco em Pó CLIGHT de Uva 9g"};


l2 = {"SUCO PO PESSEGO TANG 30G",
   "SUCO PO DIET MARACUJA CLIGHT 8G",
   "REFRIGERANTE SODA LIMONADA ANT 600ML",
   "SUCO PO DIET ABACAXI C/HORTELA CLIGHT 8G",
   "REFRIGERANTE GUARANA ANT LT 350ML",
   "SUCO PO DIET LIMAO LIMAO CLIGHT 8G",
   "REFRIGERANTE GUARANA ZERO ANT PET 600ML",
   "RACAO PEIXE E FRUTOS DO MAR WHISKAS 500G",
   "SUCO PO TANGERINA TANG 30G",
   "SUCO PO DIET UVA CLIGHT 8G"};

and l1 converted:

l1 = convert`deAccent /@ ToUpperCase /@ l1;

a brute force approach will yield better results. I split the strings in words and for all combinations of list entries I count the number of words in their intersection. The following manipulate shows what pairs can be considered matches depending on a threshold you set:

Manipulate[
 Outer[
   Boole[
     Length@Intersection[StringSplit[#1], StringSplit[#2]] >= t] &, 
   l1, l2] // MatrixPlot,
 {t, 0, 10, 1}
 ]

Mathematica graphics

So, it generally works OK, but we do get some false alarms.

Just picking the maximum intersection length seems to work here:

Ordering[#, -1][[1]] & /@ Outer[Length@Intersection[StringSplit[#1], StringSplit[#2]] &, l1, l2]

{1, 2, 3, 4, 5, 6, 7, 8, 9, 10}

but I can imagine that a global optimization routine working on the lengths matrix may be necessary to solve the general problem. This may be rather expensive in computational sense.

Very nice way @Sjoerd. I have made a post to solve your java string normalization code inside Mathematica here — Murta, Nov 11 '12 at 19:37
+1 for diacritics remover (I'm sure I'll find good uses for that in the future!) — Szabolcs, Nov 12 '12 at 15:08

score 10 · Answer 2 · edited Mar 15 '15 at 18:35

Nearest automatically uses EditDistance by default:

l1 = {"Refresco em Pó TANG de Pêssego 30g", 
   "Refresco em Pó CLIGHT de Maracujá 9g", 
   "Refrigerante ANTARCTICA Soda Limão Garrafa 600ml", 
   "Refresco em Pó CLIGHT de Abacaxi com Hortelã 8g", 
   "Refrigerante ANTARCTICA Guaraná Lata 350ml", 
   "Refresco em Pó CLIGHT de Lima Limão 9g", 
   "Refrigerante ANTARCTICA Zero Guaraná Garrafa 600ml", 
   "Ração para Gatos Sabor Peixe e Frutos do Mar WHISKAS 500g", 
   "Refresco em pó TANG de Tangerina 30g", 
   "Refresco em Pó CLIGHT de Uva 9g"};

l2 = {"SUCO PO PESSEGO TANG 30G", "SUCO PO DIET MARACUJA CLIGHT 8G", 
   "REFRIGERANTE SODA LIMONADA ANT 600ML", 
   "SUCO PO DIET ABACAXI C/HORTELA CLIGHT 8G", 
   "REFRIGERANTE GUARANA ANT LT 350ML", 
   "SUCO PO DIET LIMAO LIMAO CLIGHT 8G", 
   "REFRIGERANTE GUARANA ZERO ANT PET 600ML", 
   "RACAO PEIXE E FRUTOS DO MAR WHISKAS 500G", 
   "SUCO PO TANGERINA TANG 30G", "SUCO PO DIET UVA CLIGHT 8G"};

Find two nearest solutions (this works rather well with diacritical marks since the EditDistances are rather small for those, but even better when removing them beforehand)

nearest = {#, 
    Nearest[l1, #, 2, 
     DistanceFunction -> (EditDistance [#1, #2, 
         IgnoreCase -> True] &)]} & /@ l2;

Column[Grid[{{#[[1]]}, {Column@#[[2]]}}, Frame -> All] & /@ nearest]

Mathematica graphics

The above is not optimal for large lists, as the first list is processed for every item in the second list leading to at least quadratic complexity. To solve this we can use Nearest to generate a NearestFunction first.

nearestFun = Nearest[l1, DistanceFunction -> (EditDistance[#1, #2, IgnoreCase -> True] &)]

Mathematica graphics

nearest2 = {#, nearestFun[#]} & /@ l2;
Column[Grid[{{#[[1]]}, {Column@#[[2]]}}, Frame -> All] & /@ nearest]

This may serve as a starting point... You could reduce false matches by ranking the EditDistance of different pairs, starting with good matches and subsequent elimination of those from the original lists.

Probably you should break that down to individual words and search for strong correlations on a word-to-word basis (simply dropping weakly correlated fill-words).

It's probably better to use Nearest here to make a NearestFunction first and to use that instead. This will be very much faster for large data sets. — Sjoerd C. de Vries, Mar 15 '15 at 12:52
OK, done. BTW I was wondering about the 2 as third argument in your Nearest. In the NearestFunction version I can't use that, but is it really necessary? — Sjoerd C. de Vries, Mar 15 '15 at 18:39

How to perform fuzzy lookup between lists?

2 Answers2

Linked

Related