Is it possible to find any lines in a file that exceed 79 characters?
2 Answers
In order of decreasing speed (on a GNU system in a UTF-8 locale and on ASCII input) according to my tests:
grep '.\{80\}' file
perl -nle 'print if length$_>79' file
awk 'length>79' file
sed -n '/.\{80\}/p' file
Except for the perl¹ one (or for awk/grep/sed implementations (like mawk or busybox) that don't support multi-byte characters), that counts the length in terms of number of characters (according to the LC_CTYPE setting of the locale) instead of bytes.
If there are bytes in the input that don't form part of valid characters (which happens sometimes when the locale's character set is UTF-8 and the input is in a different encoding), then depending on the solution and tool implementation, those bytes will either count as 1 character, or 0 or not match ..
For instance, a line that consists of 30 as a 0x80 byte, 30 bs, a 0x81 byte and 30 UTF-8 és (encoded as 0xc3 0xa9), in a UTF-8 locale would not match .\{80\} with GNU grep/sed (as that standalone 0x80 byte doesn't match .), would have a length of 30+1+30+1+2*30=122 with perl or mawk, 3*30=90 with gawk.
If you want to count in terms of bytes, fix the locale to C with LC_ALL=C grep/awk/sed....
That would have all 4 solutions consider that line above contains 122 characters. Except in perl and GNU tools, you'd still have potential issues for lines that contain NUL characters (0x0 byte).
¹ the perl behaviour can be affected by the PERL_UNICODE environment variable though
- 544,893
- 31,277
Shell approach:
while IFS= read -r line || [ -n "$line" ];
do
[ "${#line}" -gt 79 ] && printf "%s\n" "$line"
done < input.txt
Python approach:
python -c 'import sys;f=open(sys.argv[1]);print "\n".join([ l.strip() for l in f if len(l) >79 ]);f.close()' input.txt
Or as a short script for readability:
#!/usr/bin/env python
import sys
with open(sys.argv[1]) as f:
for line in f:
if len(line) > 79:
print line.strip()
If we wanted to exclude newline character \n from calculations, we can make if len(line) > 79 be if len(line.strip()) > 79
Side note: this is Python 2.7 syntax. Use print() for Python 3
- 16,527
awkcan come closer if you drop($0), which is implicit anyway ;). – Thor Jul 12 '12 at 18:36grephad a surprise for me: it beatawk. So I had to edit it. – manatwork Jul 12 '12 at 18:38sedseems to be doing something wrong. – Thor Jul 13 '12 at 11:23^, it's slightly faster: e.g.grep '^.\{80\}' file. – cas Jul 29 '12 at 09:32grep '^.\{1000\}' filereturnsgrep: invalid repetition count(s), whileawk 'length>1000' filesucceeds.) – mdahlman Dec 18 '14 at 21:00grep -n '.\{80\}' file | cut -f1 -d:– Anthony Hatzopoulos Sep 23 '15 at 16:07