I want to find the number of the lines that have both words/patterns "gene" and "+" in them. Is this possible to do this with grep?
1 Answers
Yes, you can do this with grep:
grep -c 'gene.*+' file
That will look for lines where the word gene appears first and as a separate word (the \b means "word-break") and then, on the same line, you also have + as a separate word. The -c flag tells grep to print the number of matching lines. If you also need to find cases where the + comes before gene, you can do:
grep -Ec '(gene.*\+)|(\+.*gene)' file
This, however, will also match things like Eugene+Mary came for dinner which is probably not what you want. Given the words you are looking for, I am guessing that you are looking at gff/gtf files, so you might want to do something more sophisticated and only look for gene in the third field of each line and + in the seventh, on lines that don't start with a # (the gff headers). If this is indeed what you need, you can do:
awk -F"\t" '!/^#/ && $3=="gene" && $7=="+"{c++}END{print c}'
- 242,166
-
For the Eugene case and grep, we can use word boundary markers:
grep -Ec '(\<gene\>.*\+)|(\+.*\<gene\>)' file– glenn jackman Oct 15 '20 at 17:55
genealways occur before+on the lines that you are interested in? Would the basic regular expressiongene.*+be enough? Do you need to filter out lines that contain words likegenesorthegene(i.e. wheregeneis just a substring and not its own word)? Can you show some example data? – Kusalananda Oct 15 '20 at 14:41grep gene | grep +. That is a kind of and operator. You also need to consider all the question Kusalananda is asking. – nobody Oct 15 '20 at 15:20wcshould be used at the end to count the linesgrep gene | grep + | wc -l. – nobody Oct 15 '20 at 15:28grep -c +to count matching lines – steeldriver Oct 15 '20 at 15:36wc, you can usegrep -c. – terdon Oct 15 '20 at 15:48