15

How to scrape headlines from New York Times, Wall Street Journal main pages to create datasets similar to this service?

Importing HTML from nyt.com (HTML4) results in a String and the markup is not preserved. Is there a workaround? For wsj.com (XHTML) is either not valid (is there a W3C XHTML validator?) or a problem for XML`Parser. Any clues?

In[362]:= StringQ@Import["http://nyt.com","HTML"]
Out[362]= True

Import["http://wsj.com","XML"]
During evaluation of In[361]:= XML`Parser`XMLGet::nfprserr: Attribute 'property' is not declared for element 'meta' at Line: 11 Character: 71 in /tmp/m00009067531/wsj.
During evaluation of In[361]:= XML`Parser`XMLGet::prserr: Expected an attribute name at Line: 50 Character: 45 in /tmp/m00009067531/wsj.
During evaluation of In[361]:= Import::fmterr: Cannot import data as XML format. >>
Out[361]= $Failed
rm -rf
  • 88,781
  • 21
  • 293
  • 472
alancalvitti
  • 15,143
  • 3
  • 27
  • 92

3 Answers3

18

You can always do Import["http://wsj.com","XMLObject"]. That has the side effect of producing some irregular XML whenever the underlying HTML doesn't quite map cleanly to XML, but it mostly produces an XMLObject[] expression tree that you can match over and extract data from, and I've never seen a web page for which it won't return something.

sblom
  • 6,453
  • 3
  • 28
  • 45
  • thanks... If only pattern matching was straightforward without a document model. – alancalvitti Apr 21 '12 at 02:21
  • This page available also in the help is useful for extracting info from an XMLObject: http://reference.wolfram.com/mathematica/XML/tutorial/TransformingXML.html – faysou May 03 '12 at 16:07
16

I agree wholeheartedly with the comment of celtschk to the OP. Both journals have RSS feeds (with pointers at the bottom of their main pages) that are designed exactly for the purpose that you describe. I doubt that either journal wants you to "scrape" their content; scraping is specifically forbidden by the WSJ Terms of Use.

I don't know how much easier the RSS feed could be:

NotebookPut[Import["http://blogs.wsj.com/digits/feed/", "RSS"]];

enter image description here

You can also import explicitly as XML, if you prefer to massage the result into some other form. Since RSS is ultimately expressed as genuine XML, you won't risk the dangers inherit in importing HTML as XML.

Mark McClure
  • 32,469
  • 3
  • 103
  • 161
  • This answer is so obvious now that I think about it. Great job. – Searke May 03 '12 at 17:07
  • @MarkMcClure Fantastic! I didn't know this Mathematica functionality! – Rod Jun 08 '13 at 19:57
  • @RodLm Yes, that is pretty awesome, isn't it. Mathematica's import/export is generally outstanding in two ways - the broad range of filetypes it works with and the way it consistently represents things in it's own language. You might also be interested in this question, which deals with importing HTML as XML. – Mark McClure Jun 08 '13 at 21:13
  • @MarkMcClure Yes, I've seen this question before... I actually developed a similar code to count "unanswered" questions at Yahoo! Answers according to given keywords... It's also amazing! – Rod Jun 08 '13 at 21:19
5

Just import the source of the page instead of its rendered content:

Import["http://nyt.com", "Source"]
F'x
  • 10,817
  • 3
  • 52
  • 92