17

I am working on building a language classifier in speech/audio samples. I have been trying to find a dataset which may have considerable number of speech samples in various languages. The audio files maybe of any standard format like wav, mp3 etc. containing human voice/conversation with least amount of background noise/music.

I am unable to find any such dataset. Can someone share link of any speech dataset that may be good for this research.

Magus
  • 181
  • 1
  • 1
  • 4

6 Answers6

9

You can use the Tatoeba website which has full sentences in text and audio as downloads.

Sentences with audio

Download

http://downloads.tatoeba.org/exports/sentences_with_audio.tar.bz2

Fields and structure

Sentence id 

File description

Contains the ids of the sentences, in all languages, for which audio is available. 

Thanks to Nicolas Raoul in this answer.

philshem
  • 17,647
  • 7
  • 68
  • 170
8

I suggest looking at the SpokenLanguages2 dataset, which is very extensive, containing a bit more than one hour of speech for each of the 172 languages: https://community.topcoder.com/longcontest/?module=ViewProblemStatement&rd=16555&pm=13978

The audio is very high quality. Although they do not provide access to the testing set, you have to do your own split.

Ninn
  • 81
  • 1
  • 3
7

Maybe you can do something with https://rhinospike.com/language/ It's a bit like Tatoeba mentioned above

edit : Take also a look at librivox (public domain audiobooks). It comes with an API, then you could send a request like https://librivox.org/api/feed/audiobooks?fields={url_zip_file,language}&offset=XXX&limit=YYY to extract files' id

fdeloche
  • 131
  • 2
  • 3
7

UCLA Phonetics Lab Archive offers many audio files of spoken word in many languages, although you'll have to dig through and mix-match to your liking:
http://archive.phonetics.ucla.edu/
IPA Handbook downloads actually seems to be exactly what you are asking for, however its only for personal use; any other usage requires contacting them for permission...so not 100% open data:
https://www.internationalphoneticassociation.org/content/ipa-handbook-downloads

albert
  • 11,885
  • 4
  • 30
  • 57
6

I've found something else. It is called the wide language index. See on github. It consists in a listing of radio podcasts. It's pretty impressive in the numbers of languages covered.

Note : It was made for the game The Great language game which is fun ;).

fdeloche
  • 131
  • 2
  • 3
2

Maybe you can check this website, it collects many audio-related datasets.

https://www.audiocontentanalysis.org/data-sets/

foxer lee
  • 21
  • 1