5

I am using Mathematica to scrape information from webpages. To get file information, I am gathering plain text that is enclosed by <tr> and </tr> tags. I have made a list, of which each element is the entire data. I want to somehow scrape the plaintext readable element out of this.

By way of example, here are two elements from my list:

   "<tr>
         <td class=\"td_data\" valign=\"top\">1.</td>
         <td class=\"td_data\" valign=\"top\">1841-1869 (Province of \
    Canada), number 195, 21 June 1845, page 15</a></td>
    <td class=\"td_data\" valign=\"top\"><a \
    href=\"093/001060-119.01-e.php?image_id_nbr=2592&document_id_nbr=1857&\
    f=g&PHPSESSID=kq2k6i3u6qodbjdp1ardk0ca96\">GIF</a> | <a \
    href=\"093/001060-119.01-e.php?image_id_nbr=2592&document_id_nbr=1857&\
    f=p&PHPSESSID=kq2k6i3u6qodbjdp1ardk0ca96\">PDF</a></td>
    </tr>", "<tr>
         <td class=\"td_data\" valign=\"top\">2.</td>
         <td class=\"td_data\" valign=\"top\">1841-1869 (Province of \
    Canada), number 402, Extra, 16 May 1849, page 4</a></td>
    <td class=\"td_data\" valign=\"top\"><a \
    href=\"093/001060-119.01-e.php?image_id_nbr=6979&document_id_nbr=2061&\
    f=g&PHPSESSID=kq2k6i3u6qodbjdp1ardk0ca96\">GIF</a> | <a \
    href=\"093/001060-119.01-e.php?image_id_nbr=6979&document_id_nbr=2061&\
    f=p&PHPSESSID=kq2k6i3u6qodbjdp1ardk0ca96\">PDF</a></td>
    </tr>"

I've been playing around with Shorten[] and StringCases[] to try to find a way, but would there be a way to quickly map a function on this list that would result in the following:

"1841-1869 (Province of Canada), number 195, 21 June 1845, page 15",
"1841-1869 (Province of Canada), number 402, Extra, 16 May 1849, page 4."

The actual text changes, i.e. it doesn't always say 1841-1869 etc., but the general format of the <tr> portion remains consistent. I don't mind if the GIF | PDF part remains either. Would there be a quick way to just render the actually visible part of the HTML file in this list?


If somebody wants to reproduce the list I am using, here is my code (I include this only to make things simpler, but this is probably more of a theoretical question):

baseurl = 
  "http://www.collectionscanada.gc.ca/databases/canada-gazette/001060-\
110.01-e.php?q1=youth&q3=&interval=199&sk=";
pagelist = 
  Import[baseurl <> ToString[#], "Source"] & /@ Range[1, 1889, 199];
pagetext = Apply[StringJoin, pagelist];
trlist = StringCases[pagetext, Shortest["<tr>" ~~ ___ ~~ "</tr>"]];
canadian_scholar
  • 3,754
  • 1
  • 27
  • 45
  • Have you tried importing the html file as "Data"? It might be easier to extract your data from that. – Heike Feb 02 '12 at 00:26
  • 1
    HTML is hell of a complicated file format (Example: <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"><><title//<p ltr<span></span</p></> is a valid document), can't you get your hands on some XML version of the page? (I didn't even consider what horrible code the average page has, catchword quirks mode) – David Feb 02 '12 at 01:10

2 Answers2

5

I don't know how robust this is, but this function seems to do what you want:

ImportString[
    ExportString[Delete[ImportString[#, "Table"], {{2}, {-2}}], "Table"], 
    "HTML"
] &
Leonid Shifrin
  • 114,335
  • 15
  • 329
  • 420
4

For the two strings in your first example, this seems to work

ImportString[string, "HTML"]

For the baseurl as in the original post, Import[baseUrl, "Data"] gives something like

data = Import[baseUrl, "Data"]
data[[2, ;; 4]]

{{"Item", "View Options"}, {
  1., "1841-1869 (Province of Canada), number 195, 21 June 1845, page \
15", "GIF | PDF"}, {
  2., "1841-1869 (Province of Canada), number 402, Extra, 16 May \
1849, page 4", "GIF | PDF"}, {
  3., "1841-1869 (Province of Canada), number 405, 26 May 1849, page \
15", "GIF | PDF"}}

so it looks like data[[2, ;; ,2]] gives you the list you're after.

Heike
  • 35,858
  • 3
  • 108
  • 157
  • Not quite. It produces extra pieces. – Leonid Shifrin Feb 02 '12 at 00:33
  • @LeonidShifrin do you mean the indices and the "GIF | PDF" at the beginning and end? I guess importing as "Table" as in you solution gets rid of those, but it's probably easier to just import the file as "Data" in the first place in this case. – Heike Feb 02 '12 at 00:46
  • Yes, I meant that. When I was posting a comment, only the first part of your answer was available. Using "Data" is a good alternative, I agree. – Leonid Shifrin Feb 02 '12 at 00:48
  • Great, thank you. Let me play with this as soon as my internet connection works! – canadian_scholar Feb 02 '12 at 01:02