1

I've linked phrases in texts to entities in Wikipedia:

Going over the bridge, coming from Aliante Casino, you cant miss the nice 
view of the href="http://en.wikipedia.org/wiki/Waterfall">waterfall</a> 
that is at the forefron

Now, I would like to get raw text of each linked Wikipedia page. Since the number of Wikipedia pages to process is around 20k, I would like to perform the process offline.

I've downloaded latest Wikipedia dump in XML and extracted raw test using Wikiextractor, here's one line from it:

{
    "id": "54551", 
    "url": "https://en.wikipedia.org/wiki?curid=54551", 
    "title": "Preston Tucker", 
    "text": "Preston Tucker\n\nPreston Thomas Tucker (September 21, 1903 – December 26, 1956) was an American automobile entrepreneur...."
}

It looks like Wikipedia dump used Wikipedia page IDs instead of URLS:

http://en.wikipedia.org/wiki/Waterfall
https://en.wikipedia.org/wiki?curid=54551

How can I convert from one to another?

dzieciou
  • 233
  • 1
  • 8

2 Answers2

1

The title field in that datadump appears to be the same as the page title, and thus can be used as the page URL - standard Wikipedia URLs are of the form .../wiki/Page_title

 {
     "id": "54551", 
     "url": "https://en.wikipedia.org/wiki?curid=54551", 
     "title": "Preston Tucker", 
     "text": "Preston Tucker\n\nPreston Thomas Tucker"
 }

So rather than using the provided URL, https://en.wikipedia.org/wiki?curid=54551, you could construct one in the form https://en.wikipedia.org/wiki/Preston_Tucker

The only change that is needed is to change any spaces for underscores. Otherwise (I believe) it should work straightforwardly as a URL.

Andrew is gone
  • 803
  • 4
  • 9
0

Why not use a database with FTS. This is what databases do.

Duck Duck Go's Instant Answer API does a lot of this already also. https://duckduckgo.com/api

BioMap
  • 21
  • 1