Is it possible to extract the bibliography from a PDF file as a .bibtex?

Question

There is a pdf report that has a good list of references for my field, can I extract them as a bibtex file to reuse them ?

Welcome to TeX.SX! Very probably that is not possible. The tool doesn't know in which format the bibliography is printed, and it probably lost information in the process. If you would automate it, you would probably need serious time to check if everything got extracted correctly. At that point you can just copy it yourself. — Juri Robl, Feb 03 '15 at 19:28
If the doi are presents, you can try http://www.doi2bib.org/#/doi — Clément, Feb 03 '15 at 19:58
See also Convert .bbl file to .bib file. That posting starts with the assumption that the .bbl file is available (in addition to the pdf file). However, the discussion applies to pdf files as well. — Mico, Feb 03 '15 at 20:57

score 15 · Answer 1 · edited Nov 26 '20 at 21:54

15

If someone is still looking for a solution, anystyle is a good one-stop-shop:

$ anystyle find <your pdf>
# returns a json-formatted list of all the references in the paper

Or, for BibTeX output:

$ anystyle -f bib find main.pdf
# returns BibTeX formatted list of all the references in the paper

edited Nov 26 '20 at 21:54

Joe Corneli

4,340

answered Apr 23 '20 at 20:46

skadge

151
1
2

2

It seems this tool also supports Bibtex output (which is what was asked about in this question), maybe you can add a bit of explanation to your answer to show how this is done? – Marijn Apr 23 '20 at 20:56

score 2 · Answer 2 · answered Oct 26 '15 at 12:24

2

ParsCit should be what you're looking for. It is capable of extracting header metadata (title, authors, etc. of the document itself), logical document structure and citation metadata (individual fields of reference strings and citation contexts).

The Web demo offers both parsing of whole documents and parsing individual reference strings. Poorly, it doesn't support PDF files at the moment. So you'll have to copy the text contents of your PDF file.

answered Oct 26 '15 at 12:24

t3c

21
3

Welcome to TeX.SX! You can have a look at our starter guide to familiarize yourself further with our format. – Martin Schröder Oct 26 '15 at 12:30

Phil Gooch · Answer 3 · 2018-07-06T15:12:35.177

Full disclosure: I developed this tool referred to below and am the founder of Scholarcy.

If the PDF is at a public URL and the host doesn't block remote downloads, then

https://www.scholarcy.com/bookmarklets

will do this. Otherwise, you can upload the PDF to

https://ref.scholarcy.com/api/

and download the references as .RIS or BibTeX

It's not open source right now, but here's the basic approach:

Get the PDF from the current url (the Python requests library is handy for this)
Extract the text using one of the many libraries available for this purpose (poppler, pdfminer, xpdf etc)
Look for a heading called References, Bibliography or similar. Start reading from there.
Keep reading until you hit another heading
Try to get each reference on a single line. This can be tricky, but removing line breaks in between lower case characters will get you most of the way there
Now you have a list of reference strings, feed them to a parser such as https://github.com/opensourceware/Neural-ParsCit (mentioned above) or Anystyle

Apologies, disclosure now added – Phil Gooch Jul 04 '18 at 08:03 — Phil Gooch, Jul 04 '18 at 08:03

score 2 · Answer 4 · answered May 12 '20 at 13:10

2

Use refextract python library. By using a small piece of code, you can extract bib info from multiple pdfs.

answered May 12 '20 at 13:10

Ali Erkan

21

1

Welcome to TeX.SE! Can you please explain what one has to do to solve the given issue? – Mensch May 12 '20 at 15:10
1

Github page of refextract explains the usage: https://github.com/inspirehep/refextract. – Ali Erkan Sep 03 '20 at 10:26

Is it possible to extract the bibliography from a PDF file as a .bibtex?

4 Answers4