I have two big lists with 10,000 lines each, with product names from different databases. They have a lot of common products (50%), but they are not typed in the same way. For example there is one sample of 20 items that match:
List One
Refresco em Pó TANG de Pêssego 30g
Refresco em Pó CLIGHT de Maracujá 9g
Refrigerante ANTARCTICA Soda Limão Garrafa 600ml
Refresco em Pó CLIGHT de Abacaxi com Hortelã 8g
Refrigerante ANTARCTICA Guaraná Lata 350ml
Refresco em Pó CLIGHT de Lima Limão 9g
Refrigerante ANTARCTICA Zero Guaraná Garrafa 600ml
Ração para Gatos Sabor Peixe e Frutos do Mar WHISKAS 500g
Refresco em pó TANG de Tangerina 30g
Refresco em Pó CLIGHT de Uva 9g
List Two
SUCO PO PESSEGO TANG 30G
SUCO PO DIET MARACUJA CLIGHT 8G
REFRIGERANTE SODA LIMONADA ANT 600ML
SUCO PO DIET ABACAXI C/HORTELA CLIGHT 8G
REFRIGERANTE GUARANA ANT LT 350ML
SUCO PO DIET LIMAO LIMAO CLIGHT 8G
REFRIGERANTE GUARANA ZERO ANT PET 600ML
RACAO PEIXE E FRUTOS DO MAR WHISKAS 500G
SUCO PO TANGERINA TANG 30G
SUCO PO DIET UVA CLIGHT 8G
In the original list they are out of order. In SQL Server Integration Services (SSIS) they have Fuzzy Lookup Transformations to do things like that.
Question:
What is the best way to make the match as automatically as possible in Mathematica?
I tried something very incipient with EditDistance, but it's not enough.
I know that some human intervention will be needed in the end, but currently the whole process is done manually.
You can import a bigger list version with 400 lines each, direct in Mathematica using:
list1=StringSplit[Import["http://arquivo.rodrigomurta.com/Fuzzy_list1.txt"],"\n"]
list2=StringSplit[Import["http://arquivo.rodrigomurta.com/Fuzzy_list2.txt"],"\n"]
The lines 001 to 200 have the elements that are not equivalent, and don't match. The lines 201 to 400 have the elements that are equivalent, and are in the same order. So 201 in list 1 is equivalent in 201 in list 2.
Thanks in advance


