Using grep on URLs not working

Question

I've got a couple of text files (a.txt and b.txt) containing a bunch of URLs, each on a separate line. Think of these files as blacklists. I want to sanitize my c.txt file, scrubbing it of any of the strings in a.txt and b.txt. My approach is to rename c.txt to c_old.txt, and then build a new c.txt by grepping out the strings in a.txt and b.txt.

type c_old.txt | grep -f a.txt -v | grep -f b.txt -v > c.txt

For a long while, it seemed like my system was working just fine. However, lately, I've lost nearly everything that was in c.txt, and new additions are being removed despite not occurring in a.txt or b.txt. I have no idea why.

P.S. I'm on Windows 7, so grep has been installed separately. I'd appreciate it if there are solutions that don't require me to install additional Linux tools.

Update: I've discovered one mistake in my batch file. I used ren c.txt c_old.txt without realising that ren refuses to overwrite the target file if it exists. Thus, the type c_old.txt | ... always used the same data. This explains why new additions to c.txt were being wiped out, but it does not explain why so many entries that were in c.txt have gone missing.

A single > causes the text file to be overwritten each time, you would use >> to append to an existing file. — Ƭᴇcʜιᴇ007, Jul 31 '14 at 18:15
if you'd broken that down e.g. try echo sdfsd >c.txt you'd have seen that > overwrites. And that it's thus not a grep problem. As techie said, use >> — barlop, Jul 31 '14 at 19:09
The > is intentional. Appending would never remove any entries from c.txt, thus failing to eliminate entries in c.txt that also exist in a.txt or b.txt. — gibson, Jul 31 '14 at 19:31
Thoughts: (1) Try grep -f a.txt -v c_old.txt | grep -f b.txt -v > c.txt, because type … | looks like cat … |. (2) Try grep -f a.txt -f b.txt -v c_old.txt > c.txt. (Neither of these should make a difference in the result, but they are stylistically simpler.) … — Scott - Слава Україні, Jul 31 '14 at 21:19
… Then (3) Try adding -F (--fixed-strings) in case you’re getting any weird results where a . in a.txt or b.txt matches some other character in c.txt. (4) Check a.txt and b.txt to verify that neither of them has acquired a very short line (not a full URL) that’s matching lots of things. (5) Try to find a URL that’s getting stripped out, and find the line in a.txt or b.txt that’s causing that to happen. — Scott - Слава Україні, Jul 31 '14 at 21:20
perhaps to help track down the problem, would be.. Say you run that once a day? Then include in a bat file, a pause before and after and a check of the file size of c.txt.. you can check it by eye and make sure you haven't lost stuff from it unexpectedly, and if so then catch it near the time it happens — barlop, Jul 31 '14 at 23:09

score 0 · Accepted Answer · answered Aug 06 '14 at 19:05

Well, I don't really have much data to go on, since there's not a huge number of new additions to a.txt and b.txt since I originally asked the question, but since fixing the ren issue (replaced it with move /Y), things have been working smoothly.

So, things are working better. I'm still not sure how the initial data loss happened, but it may be that I messed up at some point when editing the scripts, and didn't do my test runs in a safe environment.

Using grep on URLs not working

1 Answers1