19

I have two big lists with 10,000 lines each, with product names from different databases. They have a lot of common products (50%), but they are not typed in the same way. For example there is one sample of 20 items that match:

List One

Refresco em Pó TANG de Pêssego 30g
Refresco em Pó CLIGHT de Maracujá 9g
Refrigerante ANTARCTICA Soda Limão Garrafa 600ml
Refresco em Pó CLIGHT de Abacaxi com Hortelã 8g
Refrigerante ANTARCTICA Guaraná Lata 350ml
Refresco em Pó CLIGHT de Lima Limão 9g
Refrigerante ANTARCTICA Zero Guaraná Garrafa 600ml
Ração para Gatos Sabor Peixe e Frutos do Mar WHISKAS 500g
Refresco em pó TANG de Tangerina 30g
Refresco em Pó CLIGHT de Uva 9g

List Two

SUCO PO PESSEGO TANG 30G
SUCO PO DIET MARACUJA CLIGHT 8G
REFRIGERANTE SODA LIMONADA ANT 600ML
SUCO PO DIET ABACAXI C/HORTELA CLIGHT 8G
REFRIGERANTE GUARANA ANT LT 350ML
SUCO PO DIET LIMAO LIMAO CLIGHT 8G
REFRIGERANTE GUARANA ZERO ANT PET 600ML
RACAO PEIXE E FRUTOS DO MAR WHISKAS 500G
SUCO PO TANGERINA TANG 30G
SUCO PO DIET UVA CLIGHT 8G

In the original list they are out of order. In SQL Server Integration Services (SSIS) they have Fuzzy Lookup Transformations to do things like that.

Question: What is the best way to make the match as automatically as possible in Mathematica? I tried something very incipient with EditDistance, but it's not enough. I know that some human intervention will be needed in the end, but currently the whole process is done manually.

You can import a bigger list version with 400 lines each, direct in Mathematica using:

list1=StringSplit[Import["http://arquivo.rodrigomurta.com/Fuzzy_list1.txt"],"\n"]
list2=StringSplit[Import["http://arquivo.rodrigomurta.com/Fuzzy_list2.txt"],"\n"]

The lines 001 to 200 have the elements that are not equivalent, and don't match. The lines 201 to 400 have the elements that are equivalent, and are in the same order. So 201 in list 1 is equivalent in 201 in list 2.

Thanks in advance

Murta
  • 26,275
  • 6
  • 76
  • 166

2 Answers2

14

One major improvement to any matching algorithm in this specific case will be to get rid of the diacritical marks in the first list that don't appear in the second list. I'm not aware of any Mathematica function to do that and too lazy to roll a conversion table myself, but there is a simple Java routine to be found here that does this. I slightly modified it so that I can use Leonid's Java Reloader:

javaConvert = "

 import java.text.Normalizer;
 import java.util.regex.Pattern;

 public class convert{ 
   public static String deAccent(String str) {
     String nfdNormalizedString = Normalizer.normalize(str, Normalizer.Form.NFD); 
     Pattern pattern = Pattern.compile(\"\\\\p{InCombiningDiacriticalMarks}+\");
     return pattern.matcher(nfdNormalizedString).replaceAll(\"\");
   }
 }
 "

With Leonid's package loaded all you need to do is:

JCompileLoad[javaConvert]

It works as follows:

convert`deAccent["äeóã"]

"aeoa"

Now, with the two lists as follows:

l1 = {"Refresco em Pó TANG de Pêssego 30g",
   "Refresco em Pó CLIGHT de Maracujá 9g",
   "Refrigerante ANTARCTICA Soda Limão Garrafa 600ml",
   "Refresco em Pó CLIGHT de Abacaxi com Hortelã 8g",
   "Refrigerante ANTARCTICA Guaraná Lata 350ml",
   "Refresco em Pó CLIGHT de Lima Limão 9g",
   "Refrigerante ANTARCTICA Zero Guaraná Garrafa 600ml",
   "Ração para Gatos Sabor Peixe e Frutos do Mar WHISKAS 500g",
   "Refresco em pó TANG de Tangerina 30g",
   "Refresco em Pó CLIGHT de Uva 9g"};


l2 = {"SUCO PO PESSEGO TANG 30G",
   "SUCO PO DIET MARACUJA CLIGHT 8G",
   "REFRIGERANTE SODA LIMONADA ANT 600ML",
   "SUCO PO DIET ABACAXI C/HORTELA CLIGHT 8G",
   "REFRIGERANTE GUARANA ANT LT 350ML",
   "SUCO PO DIET LIMAO LIMAO CLIGHT 8G",
   "REFRIGERANTE GUARANA ZERO ANT PET 600ML",
   "RACAO PEIXE E FRUTOS DO MAR WHISKAS 500G",
   "SUCO PO TANGERINA TANG 30G",
   "SUCO PO DIET UVA CLIGHT 8G"};

and l1 converted:

l1 = convert`deAccent /@ ToUpperCase /@ l1;

a brute force approach will yield better results. I split the strings in words and for all combinations of list entries I count the number of words in their intersection. The following manipulate shows what pairs can be considered matches depending on a threshold you set:

Manipulate[
 Outer[
   Boole[
     Length@Intersection[StringSplit[#1], StringSplit[#2]] >= t] &, 
   l1, l2] // MatrixPlot,
 {t, 0, 10, 1}
 ]

Mathematica graphics

So, it generally works OK, but we do get some false alarms.

Just picking the maximum intersection length seems to work here:

Ordering[#, -1][[1]] & /@ Outer[Length@Intersection[StringSplit[#1], StringSplit[#2]] &, l1, l2]

{1, 2, 3, 4, 5, 6, 7, 8, 9, 10}

but I can imagine that a global optimization routine working on the lengths matrix may be necessary to solve the general problem. This may be rather expensive in computational sense.

Sjoerd C. de Vries
  • 65,815
  • 14
  • 188
  • 323
10

Nearest automatically uses EditDistance by default:

l1 = {"Refresco em Pó TANG de Pêssego 30g", 
   "Refresco em Pó CLIGHT de Maracujá 9g", 
   "Refrigerante ANTARCTICA Soda Limão Garrafa 600ml", 
   "Refresco em Pó CLIGHT de Abacaxi com Hortelã 8g", 
   "Refrigerante ANTARCTICA Guaraná Lata 350ml", 
   "Refresco em Pó CLIGHT de Lima Limão 9g", 
   "Refrigerante ANTARCTICA Zero Guaraná Garrafa 600ml", 
   "Ração para Gatos Sabor Peixe e Frutos do Mar WHISKAS 500g", 
   "Refresco em pó TANG de Tangerina 30g", 
   "Refresco em Pó CLIGHT de Uva 9g"};

l2 = {"SUCO PO PESSEGO TANG 30G", "SUCO PO DIET MARACUJA CLIGHT 8G", 
   "REFRIGERANTE SODA LIMONADA ANT 600ML", 
   "SUCO PO DIET ABACAXI C/HORTELA CLIGHT 8G", 
   "REFRIGERANTE GUARANA ANT LT 350ML", 
   "SUCO PO DIET LIMAO LIMAO CLIGHT 8G", 
   "REFRIGERANTE GUARANA ZERO ANT PET 600ML", 
   "RACAO PEIXE E FRUTOS DO MAR WHISKAS 500G", 
   "SUCO PO TANGERINA TANG 30G", "SUCO PO DIET UVA CLIGHT 8G"};

Find two nearest solutions (this works rather well with diacritical marks since the EditDistances are rather small for those, but even better when removing them beforehand)

nearest = {#, 
    Nearest[l1, #, 2, 
     DistanceFunction -> (EditDistance [#1, #2, 
         IgnoreCase -> True] &)]} & /@ l2;

Column[Grid[{{#[[1]]}, {Column@#[[2]]}}, Frame -> All] & /@ nearest]

Mathematica graphics

The above is not optimal for large lists, as the first list is processed for every item in the second list leading to at least quadratic complexity. To solve this we can use Nearest to generate a NearestFunction first.

nearestFun = Nearest[l1, DistanceFunction -> (EditDistance[#1, #2, IgnoreCase -> True] &)]

Mathematica graphics

nearest2 = {#, nearestFun[#]} & /@ l2;
Column[Grid[{{#[[1]]}, {Column@#[[2]]}}, Frame -> All] & /@ nearest]

This may serve as a starting point... You could reduce false matches by ranking the EditDistance of different pairs, starting with good matches and subsequent elimination of those from the original lists.

Probably you should break that down to individual words and search for strong correlations on a word-to-word basis (simply dropping weakly correlated fill-words).

Sjoerd C. de Vries
  • 65,815
  • 14
  • 188
  • 323
Yves Klett
  • 15,383
  • 5
  • 57
  • 124