How can I extract a url from a directory full of garbage using a regexp?

Question

I have thousands of files I downloaded text files of and they all follow the same pattern. The pattern seemed to work in a parser (and notepad++) but when I try to find it on the console and then ultimately want to pipe it to wget for downloading, I get grep: Invalid range end

grep -E "\(https://foo.domain.com/([A-z])\w+.pdf\)" * > wget

I am unfamiliar with proper wildcarding, as I tried .* or similiar, escaping the forward slashes to all no avail. I am sure it is something stupid.

Essentially everything is correct except there is a random string of text between the .com/zzz.pdf

Provide example of any source text, escaped parentheses is required? — Gedweb, Mar 31 '19 at 07:33
@sparse Can you post as an answer, while I had to do additional steps in vi, (it added prefixed items and duplicate lines) but was trivial enough to fix and allowed me to use wget -i from a file rather than piping, thank you! — Jonathan, Mar 31 '19 at 22:36

anx · Answer 1 · 2019-04-04T21:36:27.847

0

By default, grep matches case-sensitively, therefore you must end any range with a character following the range start.

This is invalid: [A-z] (because lower case z comes before upper case A)
This is valid: [A-Z] (because upper case Z comes after upper case A)
This is valid: [a-z] (because lower case z comes after lower case a)

I suspect you meant to write the third one (meaning all your matched URLs start with lower case)

The pattern may have worked in a different environment because that was configured to match case-insensitively, or, more likely, with a different collation order (try LC_COLLATE=C grep 'A-z').

edited Apr 04 '19 at 21:36

answered Mar 31 '19 at 11:53

anx

9,748

The file names are all random cased, so it may be AzRRjkL.pdf for example. The length of the file name isn't fixed. So I just need to redirect what grep finds to wget for download. – Jonathan Mar 31 '19 at 21:23

score 0 · Answer 2 · answered Apr 01 '19 at 07:05

0

grep -oP "https:\/\/foo\.domain\.com\/[A-z]+\w+\.pdf" | wget -i -

answered Apr 01 '19 at 07:05

sparse

81
3

How can I extract a url from a directory full of garbage using a regexp?

2 Answers2