I have a .tsv file (values separated by tabs) with four values. So each line should have only three tabs and some text around each tab like this:
value value2 value3 value4
But it looks that some lines are broken (there is more than three tabs). I need to find out these lines.
I came up with following grep pattern.
grep -v "^[^\t]+\t[^\t]+\t[^\t]+\t[^\t]+$"
My thinking:
- first ^ matches the beggining
- [^\t]+ matches more than one "no tab character"
- \t matches single tab character
- $ matches end
And than I just put it into right order with correct number of times. That should match correct lines. So I reverted it by -v option to get the wrong lines.
But with the -v option it matches any line in the file and also some random text I tried that don't have any tabs inside.
What is my mistake please?
EDIT: I am using debian and bash.
grepfor lines with at least four tabs:grep '\t.*\t.*\t.*\t'? – Philippos Aug 16 '22 at 11:53.*ends immediately before a tab, so it need never backtrack into the middle of one, only to a different tab-position, which is either unnecessary (in the case that the pattern matches) or proven impossible by a fixed-string search (in the case that it doesn't). – hobbs Aug 17 '22 at 03:40.*?is also a Perl-ism, and won't work in standardgrep. The awk solution terdon posted half a day earlier runs in one tenth of the time. – ilkkachu Aug 17 '22 at 07:48grep -Pmade it even faster than the awk, withgrep -Eat around 3 s, awk at 0.3 s andgrep -Pat less than 0.1 s, even without changing the regexes. – ilkkachu Aug 17 '22 at 08:12$'^([^\t]*\t){24}'and 20 to 30 tab input lines I got about a 10x speed difference between Busybox and GNU grep's-Eengine, and another ~10x between Busybox and GNU grep-P. With-Ebeing slowest, Busybox in the middle andgrep -Pbeing fastest. And awk slightly slower thangrep -P. – ilkkachu Aug 17 '22 at 11:46grep -Pv "^(.+\t){29}.+$" test.csvfails withgrep: exceeded PCRE's backtracking limit.grep -Pv "^([^\t]+\t){29}[^\t]+$" test.csvandgrep -Pv "^(.+?\t){29}.+$" test.csvboth return instantly. With more lines but "only" 19 tabs, the.+version returns, but is ~10000 times slower than with.+?or[^\t]+. – Eric Duminil Aug 20 '22 at 11:32