1

I want to create and train a model which classifies a new text content into finance, programming, analytics, design etc. Where can I get enough dataset to train my models?

TIA.

Abhishek
  • 111
  • 3
  • 1
    Try downloading posts from http://data.stackexchange.com/ . StackOverflow posts are programming texts, money.SE and quant.SE finance, and so on. – Anton Tarasenko Jun 04 '15 at 12:35
  • Thats actually a great idea. :D. I am thinking of crawling wikipedia according to their categories, as in: http://en.wikipedia.org/wiki/Portal:Contents/Categories Also, I am thinking of making my result open source, to help someone else looking for similar result. Could you suggest where can I upload them? – Abhishek Jun 04 '15 at 12:54
  • As for Wikipedia, you may try this: https://cloud.google.com/bigquery/docs/dataset-wikipedia . – Anton Tarasenko Jun 04 '15 at 16:34

1 Answers1

2

There are plenty of resources out there but you'll have to do a bit of poking around to verify that they contain the specific information you're looking for:

1) The Hathi Trust has available a dataset derived from 4.8 million public domain volumes totaling 1.8 billion pages. The dataset includes over 734 billion words, dozens of languages, and spans multiple centuries. Features are informative, quantified characteristics of a text, and include: * Volume-level metadata * Page-level features * Part-of-speech-tagged token counts * Header and footer identification * Sentence and line count * Algorithmic language detection * Line-level features * Beginning and end line character count * Maximum length of the sequence of capital characters starting a line

sharc [dot] hathitrust [dot] org/features

2) Google's GDELT Project is ambitious in scope. Here's some verbiage from their website, "GDELT monitors the world's broadcast, print, and web news from nearly every corner of every country in over 100 languages and identifies the people, locations, organizations, counts, themes, sources, emotions, counts, quotes and events driving our global society every second of every day, creating a free open platform for computing on the entire world."

gdeltproject [dot] org/#intro

3) Quandl [dot] com has literally thousands and thousands of indexes available for download

4) You didn't say whether or not you had a budget. Here's a source that is worth looking into but isn't free...

www [dot] wallstreethorizon [dot] com/historical-events-data

Hope this helps...

DJohnson
  • 141
  • 3