Find any lines exceeding a certain length

Question

Is it possible to find any lines in a file that exceed 79 characters?

score 127 · Accepted Answer · edited Sep 17 '17 at 06:34

127

In order of decreasing speed (on a GNU system in a UTF-8 locale and on ASCII input) according to my tests:

grep '.\{80\}' file

perl -nle 'print if length$_>79' file

awk 'length>79' file

sed -n '/.\{80\}/p' file

Except for the perl¹ one (or for awk/grep/sed implementations (like mawk or busybox) that don't support multi-byte characters), that counts the length in terms of number of characters (according to the LC_CTYPE setting of the locale) instead of bytes.

If there are bytes in the input that don't form part of valid characters (which happens sometimes when the locale's character set is UTF-8 and the input is in a different encoding), then depending on the solution and tool implementation, those bytes will either count as 1 character, or 0 or not match ..

For instance, a line that consists of 30 as a 0x80 byte, 30 bs, a 0x81 byte and 30 UTF-8 és (encoded as 0xc3 0xa9), in a UTF-8 locale would not match .\{80\} with GNU grep/sed (as that standalone 0x80 byte doesn't match .), would have a length of 30+1+30+1+2*30=122 with perl or mawk, 3*30=90 with gawk.

If you want to count in terms of bytes, fix the locale to C with LC_ALL=C grep/awk/sed....

That would have all 4 solutions consider that line above contains 122 characters. Except in perl and GNU tools, you'd still have potential issues for lines that contain NUL characters (0x0 byte).

^{¹ the perl behaviour can be affected by the PERL_UNICODE environment variable though}

edited Sep 17 '17 at 06:34

Stéphane Chazelas

544,893

answered Jul 12 '12 at 18:27

manatwork

31,277

What do you mean by "efficient"? – rowantran Jul 12 '12 at 18:30
I think manatwork means typing efficiency. awk can come closer if you drop ($0), which is implicit anyway ;). – Thor Jul 12 '12 at 18:36
Oh, thanks for the clarification. I don't particularly care about speed though. – rowantran Jul 12 '12 at 18:37
Sorry, I was thinking to performance. But GNU grep had a surprise for me: it beat awk. So I had to edit it. – manatwork Jul 12 '12 at 18:38
@manatwork how often did you run each test, and how did you measure performance? – Thor Jul 12 '12 at 21:48
@Thor, I put an output with some details on pastebin: http://pastebin.com/CLtP2iSH – manatwork Jul 13 '12 at 06:46
@manatwork, interesting, I ran similar tests with mostly similar results, although sed seems to be doing something wrong. – Thor Jul 13 '12 at 11:23
10

BTW, if you anchor the regexp to the beginning of the line with ^, it's slightly faster: e.g. grep '^.\{80\}' file. – cas Jul 29 '12 at 09:32
5

The perl solution does not account for variable size encoding such as UTF-8, unlike all the other solutions. – BatchyX Jan 30 '13 at 16:19
mawk runs the length function quite a bit faster than the gnu awk, might try that. – Marcin Jan 30 '13 at 17:09
7

Sufficiently large values of N fail with grep but succeed with awk. (e.g. grep '^.\{1000\}' file returns grep: invalid repetition count(s), while awk 'length>1000' file succeeds.) – mdahlman Dec 18 '14 at 21:00
return only the line numbers: grep -n '.\{80\}' file | cut -f1 -d: – Anthony Hatzopoulos Sep 23 '15 at 16:07

score 1 · Answer 2 · answered Dec 03 '17 at 03:54

Shell approach:

while IFS= read -r line || [ -n "$line" ];
do 
    [ "${#line}" -gt 79 ] && printf "%s\n" "$line"
done < input.txt

Python approach:

python -c 'import sys;f=open(sys.argv[1]);print "\n".join([ l.strip() for l in f if len(l) >79 ]);f.close()' input.txt

Or as a short script for readability:

#!/usr/bin/env python
import sys

with open(sys.argv[1]) as f:
    for line in f:
        if len(line) > 79:
            print line.strip()

If we wanted to exclude newline character \n from calculations, we can make if len(line) > 79 be if len(line.strip()) > 79

Side note: this is Python 2.7 syntax. Use print() for Python 3

Find any lines exceeding a certain length

2 Answers2

Linked