3

I have a text file which I have uploaded here. I want to find all email addresses in this text file. I just know that some of them are after "Subscription Date" expression (I don't know if there is a better way to find all of email addresses in a txt file). I saw several questions like this but they don't work for me. Also I tried this

res = Import["data.txt", "Lines"];
tar = Position[res, "\"Subscription Date\""][[1, 1]];

but this is not the answer. Any ideas?

Syed
  • 52,495
  • 4
  • 30
  • 85
Wisdom
  • 1,258
  • 7
  • 13

2 Answers2

6
data = Import[
  "http://www.deeplook.ir/wp-content/uploads/2021/11/data.txt", 
  "Text"]

Benefitting from this discussion (and don't ask me about the pattern):

tx = StringCases[data, 
   RegularExpression[
    "(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+\
)*|\"(?:[\\x01-\\x08\\x0b\\x0c\\x0e-\\x1f\\x21\\x23-\\x5b\\x5d-\\x7f]|\
\\\\[\\x01-\\x09\\x0b\\x0c\\x0e-\\x7f])*\")@(?:(?:[a-z0-9](?:[a-z0-9-]\
*[a-z0-9])?\\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\\[(?:(?:(2(5[0-5]|[0-\
4][0-9])|1[0-9][0-9]|[1-9]?[0-9]))\\.){3}(?:(2(5[0-5]|[0-4][0-9])|1[0-\
9][0-9]|[1-9]?[0-9])|[a-z0-9-]*[a-z0-9]:(?:[\\x01-\\x08\\x0b\\x0c\\\
x0e-\\x1f\\x21-\\x5a\\x53-\\x7f]|\\\\[\\x01-\\x09\\x0b\\x0c\\x0e-\\\
x7f])+)\\])
"]];

list1 = Union@(StringTrim @tx)

This still has multiple email entries and some website strings that have not been filtered.


For Mathematica users, a more native (and better) way would be:

temail = StringCases[data, 
   Whitespace | StartOfLine ~~ 
    Shortest[("_" | "-" | WordCharacter | ".") ..] ~~ "@" ~~ 
    Shortest[(WordCharacter | ".") ..] ~~ (Whitespace | "," | 
      EndOfLine | "&")];

list2 = Union@(StringTrim @temail);

and it is quite readable on its own.

StringTrim and Unionsimply do the required cleanup.


For comparison

{Length@list1, Length@list2} (*{652, 785} *)

As an exercise, do the following to verify the differences:

Complement[list1, list2]; (*25 ?entries due to uppercase etc*)
Complement[list2, list1];

Syed
  • 52,495
  • 4
  • 30
  • 85
  • 1
    Thanks but how this works?! I replace data by res and get 33788 empty lists! – Wisdom Nov 17 '21 at 13:46
  • 1
    Many Thanks, your pattern is very clean, but it doesn't give all emails. Anyway I will be so grateful if you add some explanations to your pattern for clarification. – Wisdom Nov 17 '21 at 14:43
  • 1
    For learning about these reserved words, you can press F1 and try examples in the docs. Experiment with your own file step by step. Please let me know, which emails this pattern left out? – Syed Nov 17 '21 at 14:44
  • I didn't compare, but your pattern gives 428 results while the first one gives 652 after DeleteDuplicate command. I know the 652 is close to correct results. – Wisdom Nov 17 '21 at 14:48
  • 1
    @Wisdom I have updated the answer. – Syed Nov 17 '21 at 15:21
  • Now your pattern became perfect. Great job! – Wisdom Nov 17 '21 at 15:38
  • 1
    list3 = TextCases[data, "EmailAddress"] // DeleteDuplicates; is also a built-in way of doing all this in Mma, but I will let you explore this further and see the differences yourself. I took the liberty of editing the title/post twice. If you don't like the changes, you can revert them. See you later. – Syed Nov 17 '21 at 15:45
  • OMG! such a simple way! Thanks a lot. and feel free to edit everything. – Wisdom Nov 17 '21 at 18:46
4

The built-in function TextCases gets you most of the way there as it has a built-in "EmailAddress" text content type, but it also includes email addresses with some preceding text (like /my/subscriber?=foo@bar.com), so we also filter that out with Select, and then DeleteDuplicates:

DeleteDuplicates@
 Select[TextCases[
   Import["http://www.deeplook.ir/wp-content/uploads/2021/11/data.txt", "String"], 
 "EmailAddress"], Not@StringContainsQ[#, "=" | "/"] &]

which returns 777 email addresses.

Carl Lange
  • 13,065
  • 1
  • 36
  • 70