How to scrape the headlines from New York Times and Wall Street Journal?

Question

How to scrape headlines from New York Times, Wall Street Journal main pages to create datasets similar to this service?

Importing HTML from nyt.com (HTML4) results in a String and the markup is not preserved. Is there a workaround? For wsj.com (XHTML) is either not valid (is there a W3C XHTML validator?) or a problem for XML`Parser. Any clues?

In[362]:= StringQ@Import["http://nyt.com","HTML"]
Out[362]= True

Import["http://wsj.com","XML"]
During evaluation of In[361]:= XML`Parser`XMLGet::nfprserr: Attribute 'property' is not declared for element 'meta' at Line: 11 Character: 71 in /tmp/m00009067531/wsj.
During evaluation of In[361]:= XML`Parser`XMLGet::prserr: Expected an attribute name at Line: 50 Character: 45 in /tmp/m00009067531/wsj.
During evaluation of In[361]:= Import::fmterr: Cannot import data as XML format. >>
Out[361]= $Failed

I don't think scraping the nyt site will be much fun. It's mostly javascript. — Sjoerd C. de Vries, Apr 20 '12 at 19:19
If you are after the headlines, http://feeds.nytimes.com/nyt/rss/HomePage is probably a better link to use. — celtschk, May 03 '12 at 15:50

score 18 · Accepted Answer · answered Apr 20 '12 at 19:13

18

You can always do Import["http://wsj.com","XMLObject"]. That has the side effect of producing some irregular XML whenever the underlying HTML doesn't quite map cleanly to XML, but it mostly produces an XMLObject[] expression tree that you can match over and extract data from, and I've never seen a web page for which it won't return something.

answered Apr 20 '12 at 19:13

sblom

6,453
3
28
45

thanks... If only pattern matching was straightforward without a document model. – alancalvitti Apr 21 '12 at 02:21
This page available also in the help is useful for extracting info from an XMLObject: http://reference.wolfram.com/mathematica/XML/tutorial/TransformingXML.html – faysou May 03 '12 at 16:07

score 16 · Answer 2 · answered May 03 '12 at 16:27

16

I agree wholeheartedly with the comment of celtschk to the OP. Both journals have RSS feeds (with pointers at the bottom of their main pages) that are designed exactly for the purpose that you describe. I doubt that either journal wants you to "scrape" their content; scraping is specifically forbidden by the WSJ Terms of Use.

I don't know how much easier the RSS feed could be:

NotebookPut[Import["http://blogs.wsj.com/digits/feed/", "RSS"]];

enter image description here

You can also import explicitly as XML, if you prefer to massage the result into some other form. Since RSS is ultimately expressed as genuine XML, you won't risk the dangers inherit in importing HTML as XML.

answered May 03 '12 at 16:27

Mark McClure

32,469
3
103
161

This answer is so obvious now that I think about it. Great job. – Searke May 03 '12 at 17:07
@MarkMcClure Fantastic! I didn't know this Mathematica functionality! – Rod Jun 08 '13 at 19:57
@RodLm Yes, that is pretty awesome, isn't it. Mathematica's import/export is generally outstanding in two ways - the broad range of filetypes it works with and the way it consistently represents things in it's own language. You might also be interested in this question, which deals with importing HTML as XML. – Mark McClure Jun 08 '13 at 21:13
@MarkMcClure Yes, I've seen this question before... I actually developed a similar code to count "unanswered" questions at Yahoo! Answers according to given keywords... It's also amazing! – Rod Jun 08 '13 at 21:19

score 5 · Answer 3 · answered Apr 20 '12 at 20:34

5

Just import the source of the page instead of its rendered content:

Import["http://nyt.com", "Source"]

answered Apr 20 '12 at 20:34

F'x

10,817
3
52
92

Import using "Source" works but "XMLObject yields more structured results. It comes down to being able to match XML vs Text expressions. Either way Most of the work is in modeling the schema. – alancalvitti Apr 21 '12 at 02:20
1

Yes… also, as we all know, you can’t parse HTML/XML with regexps – F'x Apr 21 '12 at 06:07
maybe you can't parse general HTML (why not XML?) but that doesn't imply you can't parse NYT or WSJ, modulo legal restrictions as pointed out by McClure. – alancalvitti May 04 '12 at 03:52

How to scrape the headlines from New York Times and Wall Street Journal?

3 Answers3

Linked