Seamless invocation of bib de-duplication script from document source preamble?

Question

Unfortunately, for reasons beyond my control, I am stuck with mendeley as my citation manager which provides an option to export its library to a bib file.

My mendeley library is organised into folders of relevent topics. I have a few identical entries in two folders, because there is a large overlap in the technical areas/topics pertinent to those folders, e.g. paper "x" may be in topic A (folder A), but also equally belongs to topic B (folder B). This results in duplicate entries in the exported bib file.

There is a point-n-click style deduplication facility provided within mendeley akin to similar functionality in jabref,endnote and the other such tools. However, I do not wish to use such a procedure for two reasons.

My citation manager (in conjunction with its cloud interface) is the only way I organise my reference literature for later reading. This helps to provide a portable way to manage a large library across multiple devices. I wish to retain paper "x" in both folder A and folder B, since at times I want to read on topic A and other times on topic B, e.g. I sometimes use the sort-by-year option within a certain topic, and I don't want to miss out on paper x just because it has been de-duplicated.
Point and click interfaces are not conducive to seamless workflow. Currently, I am writing a thesis. As I collect more references in my library on the fly(a few simultaneously go to topic A and topic B), they are continuously exported by mendeley to a bib file, resulting in a mess of duplicates. This necessitates a manual point and click de-duplication procedure which defeats #1 above and is also tedious.

I understand that de-deduplication of bibtex entries is hard. I am only talking about de-duplication of absolutely identical entries. But given its utility, is there any script (shell, perl, python or others) that can handle this de-duplication gracefully. Again, invoking this script from the command-line interface is tedious, and so an automated solution to call this script from the main.tex document (perhaps a line of code in the preamble resembling \bibtexdedup{}) would be advantageous. I am thinking to something like makeindex or makeglossaries which use the shell-escape mechanism to do their job.

Is there any solution available that will achieve these goals?

I suppose bibtool could help you if the keys of duplicate entries could be guaranteed to coincide (see also https://tex.stackexchange.com/q/76420/35864 and https://tex.stackexchange.com/q/20027/35864), but I assume this is not the case. Then it of course gets much harder to de-duplicate entries. I believe JabRef has a command line interface, but I don't know if it can be used to de-duplicate. — moewe, Jun 28 '18 at 20:54
@moewe Yes. They are identical, since I deliberately save duplicates to different folders using mendeley web importer plugin. I looked at jabref CLI, but this page is sparse. How would one call it from the source document? Is my case valid enough to somehow entice the elder geeks here? Using another full-blown citation manager doesn't sound elegant. Shouldn't we look for a do one thing well solution. We have a bib file in plain text. We have one task - dedup it. I thought sed, awk, or their elder brother perl might be upto the task with some seriously-heavy regex-fu. — Dr Krishnakumar Gopalakrishnan, Jun 28 '18 at 21:02
@moewe thank you. I had looked at bibtool. It does not mention the word duplicate anywhere in its manual. Can you maybe point out specifically to the section that could potentially help me. How could I call bibtool from within the preamble every time I compile my document? I am looking for something like \makeglossaries{} — Dr Krishnakumar Gopalakrishnan, Jun 28 '18 at 21:09
§A.13.1. Finding Double Entries, see also https://tex.stackexchange.com/q/20027/35864 — moewe, Jun 28 '18 at 21:10
You can call arbitrary programmes from TeX if shell escape/write 18 is enabled: https://tex.stackexchange.com/q/20444/35864, https://tex.stackexchange.com/q/5433/35864 — moewe, Jun 28 '18 at 21:11
Why do you care about duplicates? Imho biber will simply ignore a duplicate entry, and perhaps issue a warning but not more. — Ulrike Fischer, Jun 28 '18 at 22:07
@UlrikeFischer But is the OP using Biber? But I completely agree, in this situation, if using Biber, I think just ignoring duplicates makes most sense. — cfr, Jun 28 '18 at 22:22
I can use biber, but most journals prefer bibtex+natbib. I also want to share the library with my supervisors and co-authors and would like to give them a small, clean file without 300 duplicate entries. — Dr Krishnakumar Gopalakrishnan, Jun 28 '18 at 22:25
@cfr I thought I remember Krishna would use it, but bibtex skips repeated entries too. — Ulrike Fischer, Jun 28 '18 at 22:42
@UlrikeFischer BibTeX seems to belong to another world. I no longer remember what it does or doesn't do. :( — cfr, Jun 28 '18 at 23:08
@cfr I didn't remember either. I had to run a small test ;-). — Ulrike Fischer, Jun 29 '18 at 07:24

score 4 · Accepted Answer · answered Jun 28 '18 at 22:43

4

You can use

biber --tool duplicates.bib

This will create a duplicates-bibertool.bib without duplicates.

answered Jun 28 '18 at 22:43

Ulrike Fischer

327,261

Great. Thank you. That will work. The solution for bibtex+natbib will be bibtool I guess. I shall call these accordingly with shell-escape. – Dr Krishnakumar Gopalakrishnan Jun 28 '18 at 22:48
@Krishna You can use Biber for this. All it is doing is rewriting the .bib. – cfr Jun 28 '18 at 23:08
You can as cfr wrote, use biber --tool always. It is a generic tool to rewrite bib-files. Check the documentation. – Ulrike Fischer Jun 29 '18 at 07:25

Seamless invocation of bib de-duplication script from document source preamble?

1 Answers1