data = Import[
"http://www.deeplook.ir/wp-content/uploads/2021/11/data.txt",
"Text"]
Benefitting from this discussion (and don't ask me about the pattern):
tx = StringCases[data,
RegularExpression[
"(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+\
)*|\"(?:[\\x01-\\x08\\x0b\\x0c\\x0e-\\x1f\\x21\\x23-\\x5b\\x5d-\\x7f]|\
\\\\[\\x01-\\x09\\x0b\\x0c\\x0e-\\x7f])*\")@(?:(?:[a-z0-9](?:[a-z0-9-]\
*[a-z0-9])?\\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\\[(?:(?:(2(5[0-5]|[0-\
4][0-9])|1[0-9][0-9]|[1-9]?[0-9]))\\.){3}(?:(2(5[0-5]|[0-4][0-9])|1[0-\
9][0-9]|[1-9]?[0-9])|[a-z0-9-]*[a-z0-9]:(?:[\\x01-\\x08\\x0b\\x0c\\\
x0e-\\x1f\\x21-\\x5a\\x53-\\x7f]|\\\\[\\x01-\\x09\\x0b\\x0c\\x0e-\\\
x7f])+)\\])
"]];
list1 = Union@(StringTrim @tx)
This still has multiple email entries and some website strings that have not been filtered.
For Mathematica users, a more native (and better) way would be:
temail = StringCases[data,
Whitespace | StartOfLine ~~
Shortest[("_" | "-" | WordCharacter | ".") ..] ~~ "@" ~~
Shortest[(WordCharacter | ".") ..] ~~ (Whitespace | "," |
EndOfLine | "&")];
list2 = Union@(StringTrim @temail);
and it is quite readable on its own.
StringTrim and Unionsimply do the required cleanup.
For comparison
{Length@list1, Length@list2} (*{652, 785} *)
As an exercise, do the following to verify the differences:
Complement[list1, list2]; (*25 ?entries due to uppercase etc*)
Complement[list2, list1];
databyresand get 33788 empty lists! – Wisdom Nov 17 '21 at 13:46F1and try examples in the docs. Experiment with your own file step by step. Please let me know, which emails this pattern left out? – Syed Nov 17 '21 at 14:44DeleteDuplicatecommand. I know the 652 is close to correct results. – Wisdom Nov 17 '21 at 14:48list3 = TextCases[data, "EmailAddress"] // DeleteDuplicates;is also a built-in way of doing all this in Mma, but I will let you explore this further and see the differences yourself. I took the liberty of editing the title/post twice. If you don't like the changes, you can revert them. See you later. – Syed Nov 17 '21 at 15:45