I've got a couple of text files (a.txt and b.txt) containing a bunch of URLs, each on a separate line. Think of these files as blacklists. I want to sanitize my c.txt file, scrubbing it of any of the strings in a.txt and b.txt. My approach is to rename c.txt to c_old.txt, and then build a new c.txt by grepping out the strings in a.txt and b.txt.
type c_old.txt | grep -f a.txt -v | grep -f b.txt -v > c.txt
For a long while, it seemed like my system was working just fine. However, lately, I've lost nearly everything that was in c.txt, and new additions are being removed despite not occurring in a.txt or b.txt. I have no idea why.
P.S. I'm on Windows 7, so grep has been installed separately. I'd appreciate it if there are solutions that don't require me to install additional Linux tools.
Update: I've discovered one mistake in my batch file. I used ren c.txt c_old.txt without realising that ren refuses to overwrite the target file if it exists. Thus, the type c_old.txt | ... always used the same data. This explains why new additions to c.txt were being wiped out, but it does not explain why so many entries that were in c.txt have gone missing.
>causes the text file to be overwritten each time, you would use>>to append to an existing file. – Ƭᴇcʜιᴇ007 Jul 31 '14 at 18:15echo sdfsd >c.txtyou'd have seen that > overwrites. And that it's thus not a grep problem. As techie said, use >> – barlop Jul 31 '14 at 19:09>is intentional. Appending would never remove any entries from c.txt, thus failing to eliminate entries in c.txt that also exist in a.txt or b.txt. – gibson Jul 31 '14 at 19:31grep -f a.txt -v c_old.txt | grep -f b.txt -v > c.txt, becausetype … |looks likecat … |. (2) Trygrep -f a.txt -f b.txt -v c_old.txt > c.txt. (Neither of these should make a difference in the result, but they are stylistically simpler.) … – Scott - Слава Україні Jul 31 '14 at 21:19-F(--fixed-strings) in case you’re getting any weird results where a.ina.txtorb.txtmatches some other character inc.txt. (4) Checka.txtandb.txtto verify that neither of them has acquired a very short line (not a full URL) that’s matching lots of things. (5) Try to find a URL that’s getting stripped out, and find the line ina.txtorb.txtthat’s causing that to happen. – Scott - Слава Україні Jul 31 '14 at 21:20