Cleaning up a List of HTML Data to Render Usable Information

Question

I am using Mathematica to scrape information from webpages. To get file information, I am gathering plain text that is enclosed by <tr> and </tr> tags. I have made a list, of which each element is the entire data. I want to somehow scrape the plaintext readable element out of this.

By way of example, here are two elements from my list:

   "<tr>
         <td class=\"td_data\" valign=\"top\">1.</td>
         <td class=\"td_data\" valign=\"top\">1841-1869 (Province of \
    Canada), number 195, 21 June 1845, page 15</a></td>
    <td class=\"td_data\" valign=\"top\"><a \
    href=\"093/001060-119.01-e.php?image_id_nbr=2592&document_id_nbr=1857&\
    f=g&PHPSESSID=kq2k6i3u6qodbjdp1ardk0ca96\">GIF</a> | <a \
    href=\"093/001060-119.01-e.php?image_id_nbr=2592&document_id_nbr=1857&\
    f=p&PHPSESSID=kq2k6i3u6qodbjdp1ardk0ca96\">PDF</a></td>
    </tr>", "<tr>
         <td class=\"td_data\" valign=\"top\">2.</td>
         <td class=\"td_data\" valign=\"top\">1841-1869 (Province of \
    Canada), number 402, Extra, 16 May 1849, page 4</a></td>
    <td class=\"td_data\" valign=\"top\"><a \
    href=\"093/001060-119.01-e.php?image_id_nbr=6979&document_id_nbr=2061&\
    f=g&PHPSESSID=kq2k6i3u6qodbjdp1ardk0ca96\">GIF</a> | <a \
    href=\"093/001060-119.01-e.php?image_id_nbr=6979&document_id_nbr=2061&\
    f=p&PHPSESSID=kq2k6i3u6qodbjdp1ardk0ca96\">PDF</a></td>
    </tr>"

I've been playing around with Shorten[] and StringCases[] to try to find a way, but would there be a way to quickly map a function on this list that would result in the following:

"1841-1869 (Province of Canada), number 195, 21 June 1845, page 15",
"1841-1869 (Province of Canada), number 402, Extra, 16 May 1849, page 4."

The actual text changes, i.e. it doesn't always say 1841-1869 etc., but the general format of the <tr> portion remains consistent. I don't mind if the GIF | PDF part remains either. Would there be a quick way to just render the actually visible part of the HTML file in this list?

If somebody wants to reproduce the list I am using, here is my code (I include this only to make things simpler, but this is probably more of a theoretical question):

baseurl = 
  "http://www.collectionscanada.gc.ca/databases/canada-gazette/001060-\
110.01-e.php?q1=youth&q3=&interval=199&sk=";
pagelist = 
  Import[baseurl <> ToString[#], "Source"] & /@ Range[1, 1889, 199];
pagetext = Apply[StringJoin, pagelist];
trlist = StringCases[pagetext, Shortest["<tr>" ~~ ___ ~~ "</tr>"]];

Have you tried importing the html file as "Data"? It might be easier to extract your data from that. — Heike, Feb 02 '12 at 00:26
HTML is hell of a complicated file format (Example: <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"><><title//<p ltr<span></span</p></> is a valid document), can't you get your hands on some XML version of the page? (I didn't even consider what horrible code the average page has, catchword quirks mode) — David, Feb 02 '12 at 01:10

score 5 · Answer 1 · answered Feb 02 '12 at 00:31

5

I don't know how robust this is, but this function seems to do what you want:

ImportString[
    ExportString[Delete[ImportString[#, "Table"], {{2}, {-2}}], "Table"], 
    "HTML"
] &

answered Feb 02 '12 at 00:31

Leonid Shifrin

114,335
15
329
420

Great, thank you. For some reason, my internet connection is having trouble with Mathematica so it may take me a bit of time to verify this! – canadian_scholar Feb 02 '12 at 01:02
@Ian No problem, was happy to help. – Leonid Shifrin Feb 02 '12 at 01:15

Heike · Accepted Answer · 2012-02-02T00:49:45.747

4

For the two strings in your first example, this seems to work

ImportString[string, "HTML"]

For the baseurl as in the original post, Import[baseUrl, "Data"] gives something like

data = Import[baseUrl, "Data"]
data[[2, ;; 4]]

{{"Item", "View Options"}, {
  1., "1841-1869 (Province of Canada), number 195, 21 June 1845, page \
15", "GIF | PDF"}, {
  2., "1841-1869 (Province of Canada), number 402, Extra, 16 May \
1849, page 4", "GIF | PDF"}, {
  3., "1841-1869 (Province of Canada), number 405, 26 May 1849, page \
15", "GIF | PDF"}}

so it looks like data[[2, ;; ,2]] gives you the list you're after.

edited Feb 02 '12 at 00:49

answered Feb 02 '12 at 00:32

Heike

35,858
3
108
157

Not quite. It produces extra pieces. – Leonid Shifrin Feb 02 '12 at 00:33
@LeonidShifrin do you mean the indices and the "GIF | PDF" at the beginning and end? I guess importing as "Table" as in you solution gets rid of those, but it's probably easier to just import the file as "Data" in the first place in this case. – Heike Feb 02 '12 at 00:46
Yes, I meant that. When I was posting a comment, only the first part of your answer was available. Using "Data" is a good alternative, I agree. – Leonid Shifrin Feb 02 '12 at 00:48
Great, thank you. Let me play with this as soon as my internet connection works! – canadian_scholar Feb 02 '12 at 01:02

Cleaning up a List of HTML Data to Render Usable Information

2 Answers2

Linked