There is a pdf report that has a good list of references for my field, can I extract them as a bibtex file to reuse them ?
-
4Welcome to TeX.SX! Very probably that is not possible. The tool doesn't know in which format the bibliography is printed, and it probably lost information in the process. If you would automate it, you would probably need serious time to check if everything got extracted correctly. At that point you can just copy it yourself. – Juri Robl Feb 03 '15 at 19:28
-
If the doi are presents, you can try http://www.doi2bib.org/#/doi – Clément Feb 03 '15 at 19:58
-
2See also Convert .bbl file to .bib file. That posting starts with the assumption that the .bbl file is available (in addition to the pdf file). However, the discussion applies to pdf files as well. – Mico Feb 03 '15 at 20:57
4 Answers
If someone is still looking for a solution, anystyle is a good one-stop-shop:
$ anystyle find <your pdf>
# returns a json-formatted list of all the references in the paper
Or, for BibTeX output:
$ anystyle -f bib find main.pdf
# returns BibTeX formatted list of all the references in the paper
- 4,340
- 151
- 1
- 2
-
2It seems this tool also supports Bibtex output (which is what was asked about in this question), maybe you can add a bit of explanation to your answer to show how this is done? – Marijn Apr 23 '20 at 20:56
ParsCit should be what you're looking for. It is capable of extracting header metadata (title, authors, etc. of the document itself), logical document structure and citation metadata (individual fields of reference strings and citation contexts).
The Web demo offers both parsing of whole documents and parsing individual reference strings. Poorly, it doesn't support PDF files at the moment. So you'll have to copy the text contents of your PDF file.
- 21
- 3
-
Welcome to TeX.SX! You can have a look at our starter guide to familiarize yourself further with our format. – Martin Schröder Oct 26 '15 at 12:30
Full disclosure: I developed this tool referred to below and am the founder of Scholarcy.
If the PDF is at a public URL and the host doesn't block remote downloads, then
https://www.scholarcy.com/bookmarklets
will do this. Otherwise, you can upload the PDF to
https://ref.scholarcy.com/api/
and download the references as .RIS or BibTeX
It's not open source right now, but here's the basic approach:
- Get the PDF from the current url (the Python requests library is handy for this)
- Extract the text using one of the many libraries available for this purpose (poppler, pdfminer, xpdf etc)
- Look for a heading called References, Bibliography or similar. Start reading from there.
- Keep reading until you hit another heading
- Try to get each reference on a single line. This can be tricky, but removing line breaks in between lower case characters will get you most of the way there
- Now you have a list of reference strings, feed them to a parser such as https://github.com/opensourceware/Neural-ParsCit (mentioned above) or Anystyle
- 317
Use refextract python library. By using a small piece of code, you can extract bib info from multiple pdfs.
- 21
-
1Welcome to TeX.SE! Can you please explain what one has to do to solve the given issue? – Mensch May 12 '20 at 15:10
-
1Github page of refextract explains the usage: https://github.com/inspirehep/refextract. – Ali Erkan Sep 03 '20 at 10:26