9

There is a pdf report that has a good list of references for my field, can I extract them as a bibtex file to reuse them ?

eee
  • 91
  • 4
    Welcome to TeX.SX! Very probably that is not possible. The tool doesn't know in which format the bibliography is printed, and it probably lost information in the process. If you would automate it, you would probably need serious time to check if everything got extracted correctly. At that point you can just copy it yourself. – Juri Robl Feb 03 '15 at 19:28
  • If the doi are presents, you can try http://www.doi2bib.org/#/doi – Clément Feb 03 '15 at 19:58
  • 2
    See also Convert .bbl file to .bib file. That posting starts with the assumption that the .bbl file is available (in addition to the pdf file). However, the discussion applies to pdf files as well. – Mico Feb 03 '15 at 20:57

4 Answers4

15

If someone is still looking for a solution, anystyle is a good one-stop-shop:

$ anystyle find <your pdf>
# returns a json-formatted list of all the references in the paper

Or, for BibTeX output:

$ anystyle -f bib find main.pdf
# returns BibTeX formatted list of all the references in the paper
Joe Corneli
  • 4,340
skadge
  • 151
  • 1
  • 2
  • 2
    It seems this tool also supports Bibtex output (which is what was asked about in this question), maybe you can add a bit of explanation to your answer to show how this is done? – Marijn Apr 23 '20 at 20:56
2

ParsCit should be what you're looking for. It is capable of extracting header metadata (title, authors, etc. of the document itself), logical document structure and citation metadata (individual fields of reference strings and citation contexts).

The Web demo offers both parsing of whole documents and parsing individual reference strings. Poorly, it doesn't support PDF files at the moment. So you'll have to copy the text contents of your PDF file.

t3c
  • 21
  • 3
2

Full disclosure: I developed this tool referred to below and am the founder of Scholarcy.

If the PDF is at a public URL and the host doesn't block remote downloads, then

https://www.scholarcy.com/bookmarklets

will do this. Otherwise, you can upload the PDF to

https://ref.scholarcy.com/api/

and download the references as .RIS or BibTeX

It's not open source right now, but here's the basic approach:

  1. Get the PDF from the current url (the Python requests library is handy for this)
  2. Extract the text using one of the many libraries available for this purpose (poppler, pdfminer, xpdf etc)
  3. Look for a heading called References, Bibliography or similar. Start reading from there.
  4. Keep reading until you hit another heading
  5. Try to get each reference on a single line. This can be tricky, but removing line breaks in between lower case characters will get you most of the way there
  6. Now you have a list of reference strings, feed them to a parser such as https://github.com/opensourceware/Neural-ParsCit (mentioned above) or Anystyle
2

Use refextract python library. By using a small piece of code, you can extract bib info from multiple pdfs.