0

I have a LaTeX project (a research paper) that uses .bib file (old.bib). I received a new .bib file (new.bib), which includes the BibTeX entries present in old.bib with more complete information, but sometimes using a different entry name.

How can I merge new.bib with old.bib so that new.bib uses entry names from old.bib, for the entries present in both new.bib and old.bib? Note that I don't want to add any entries in new.bib, but only change entry names for entries also present in old.bib.


Example (in practice I'll have much longer .bib files):

Input: new.bib:

@inproceedings{lesterpaper,
  title={The Power of Scale for Parameter-Efficient Prompt Tuning},
  author={Lester, Brian and Al-Rfou, Rami and Constant, Noah},
  booktitle={Empirical Methods in Natural Language Processing},
  pages={3045--3059},
  publisher = {Association for Computational Linguistics},
  year={2021}
}

@article{bommasani2023holistic, title={Holistic Evaluation of Language Models}, author={Bommasani, Rishi and Liang, Percy and Lee, Tony}, journal={Annals of the New York Academy of Sciences}, year={2023}, publisher={Wiley Online Library} }

old.bib:

@inproceedings{lester,
  title={The Power of Scale for Parameter-Efficient Prompt Tuning},
  author={Lester, Brian and Al-Rfou, Rami and Constant, Noah},
  booktitle={EMNLP},
  publisher = {Association for Computational Linguistics},
  year={2021}
}

@inproceedings{tokpo2022text, title={Text Style Transfer for Bias Mitigation using Masked Language Modeling}, author={Tokpo, Ewoenam Kwaku and Calders, Toon}, booktitle={NAACL: HLT-SRW}, pages={163--171}, publisher = {Association for Computational Linguistics}, year={2022} }

Output new new.bib:

@inproceedings{lester,
  title={The Power of Scale for Parameter-Efficient Prompt Tuning},
  author={Lester, Brian and Al-Rfou, Rami and Constant, Noah},
  booktitle={Empirical Methods in Natural Language Processing},
  pages={3045--3059},
  publisher = {Association for Computational Linguistics},
  year={2021}
}

@article{bommasani2023holistic, title={Holistic Evaluation of Language Models}, author={Bommasani, Rishi and Liang, Percy and Lee, Tony}, journal={Annals of the New York Academy of Sciences}, year={2023}, publisher={Wiley Online Library} }

The only change in that new new.bib in the example is that the Bibtex entry name lesterpaper was changed to lester.

  • 1
    I don't see how any automated tool could be expected to do this reliably. lester and lesterpaper are not the same entry as booktitle differs. What is to distinguish a case where the same authors have published under the same title in two different books issued by the same publisher in the same year versus a case where you've entered different titles for the same book in different places? If you had a unique identifier, I could see how it might work - at least in principle - but that would need something like ISBN for books. – cfr Mar 28 '24 at 03:31
  • @cfr Thanks, let's assume the title is enough to merge. – Franck Dernoncourt Mar 28 '24 at 03:33
  • 1
    Maybe look at scripting biber in tool mode? Not sure if you could do it just with biber or if you'd need something else. Thinking about it, if your bib files are reasonably consistently formatted, you could probably just use awk. You could even just cut old.bib and new.bib into pieces, pull title from each piece of new, grep old and get the key if there's a match. Then just sed/awk to replace the key. Of course, it probably wouldn't work on Windows, but it seems doable anywhere else. – cfr Mar 28 '24 at 04:19
  • 1
    As @cfr, I see no completely automated way to do this (at least, if you expect it to be "safe"). But I don't see why some kind of interactive tool could not assist with the task. I'm not a JabRef user myself, but I recall it having some features of the sort. See, for example: https://docs.jabref.org/finding-sorting-and-cleaning-entries/findduplicates, https://docs.jabref.org/finding-sorting-and-cleaning-entries/mergeentries. – gusbrs Mar 28 '24 at 11:39
  • @gusbrs Thanks, "edit distance algorithm. Extra weighting is put on the fields author, editor, title. and journal." is pretty much exactly how I would code it. – Franck Dernoncourt Mar 28 '24 at 15:25
  • 1
    @gusbrs I wondered about that, but I don't know JabRef well enough to make any suggestion. So it may come down to how consistent the formatting is and how much work is involved in GUI-assisted versus scripting. Does JabRef overwrite keys, though? I seem to remember that being a problem with many GUI bib managers, but I've not looked at any for years. – cfr Mar 28 '24 at 19:29
  • 1
    @cfr As I've said, I'm also not a JabRef user, so I'm not well acquainted with the process. But I'd expect it to interactively identify potential duplicates and let the user decide what to keep, overwriting if necessary, of course. So I think it is a supervised process rather than an automatic one. It only helps find the duplicates, which is better than trying to "eyeball" them, and guides one through conflict resolution. Well, that's what I'd expect, I've never actually used it. ;-) Anyway, for this particular task, for the reasons you mentioned, I'd go with a GUI rather than a script. – gusbrs Mar 28 '24 at 19:48

1 Answers1

1

Reddit user CommanderCoo wrote this Python script based on the Python library BibTexParser to merge the two Bibtex files based on title, authors and year:

import bibtexparser

#Define bibtex files old_bib = "old.bib" new_bib = "new.bib"

#Open bibtex files with open(old_bib) as bibtex_file: old_bib_database = bibtexparser.load(bibtex_file) old_entries = {entry['ID']: entry for entry in old_bib_database.entries}

with open(new_bib) as bibtex_file: new_bib_database = bibtexparser.load(bibtex_file) new_entries = {entry['ID']: entry for entry in new_bib_database.entries}

#Compare bibtex files for new_id, new_entry in new_entries.items(): for old_id, old_entry in old_entries.items(): #Parameters for bibtex entry comparison if old_entry.get('title') == new_entry.get('title') and
old_entry.get('author') == new_entry.get('author') and
old_entry.get('year') == new_entry.get('year'): new_entries[new_id]['ID'] = old_id

#Aggregate merged bibtex information new_bib_database.entries = list(new_entries.values())

#Throw the bibtex into a new file with open('merged.bib', 'w') as bibtex_file: bibtexparser.dump(new_bib_database, bibtex_file)

That works.