How should I write a regex to match a specific word?

Question

I've been trying to get a specific regex working but I can't get it to do what I need.

Basically, I want it to look for ROCKET. The regex should match ROCKET in upper or lower cases, and with or without punctuation, but not when part of another word. So, the regex would trigger on any of these:

rocket
RoCKEt
hi Rocket
This is a rocket.
ROCKET's engine

but NOT trigger on ROCKET when it is found in something like

Rocketeer
Sprocket

I've been trying to get it right using a regex generator online but I can't get it to match exactly.

This is one of those [infrequent] situations where the question might be better suited for Stack Overflow. Be sure to provide a language and/or platform as each language has its own peculiarities. For example, Windows. .Net and the Regex class. (Usually, its the other way around. Stack Overflow gets hundreds of off-topic questions from developers that are better suited for Super User). — jww, Apr 18 '15 at 22:14

Xaser · Accepted Answer · 2015-04-18T21:41:56.367

31

I suggest bookmarking the MSDN Regular Expression Quick Reference

you want to achieve a case insensitive match for the word "rocket" surrounded by non-alphanumeric characters. A regex that would work would be:

\W*((?i)rocket(?-i))\W*

What it will do is look for zero or more (*) non-alphanumeric (\W) characters, followed by a case insensitive version of rocket ( (?i)rocket(?-i) ), followed again by zero or more (*) non-alphanumeric characters (\W). The extra parentheses around the rocket-matching term assigns the match to a separate group. The word rocket will thus be in match group 1.

UPDATE 1: Matt said in the comment that this regex is to be used in python. Python has a slightly different syntax. To achieve the same result in python, use this regex and pass the re.IGNORECASE option to the compile or match function.

\W*(rocket)\W*

On Regex101 this can be simulated by entering "i" in the textbox next to the regex input.

UPDATE 2 Ismael has mentioned, that the regex is not quite correct, as it might match "1rocket1". He posted a much better solution, namely

(?:^|\W)rocket(?:$|\W)

edited Apr 18 '15 at 21:41

answered Apr 18 '15 at 17:32

Xaser

856

1

Testing this out with regex testers online (https://regex101.com for example) shows it as invalid and not matching example strings that I enter.
This is intended to be used as part of a python script. Does that make any difference in how it should be written?
– Kefka Apr 18 '15 at 19:55
1

yes it does. you can see on regex101.com that you can choose a regex "flavour" on the top left, python is slightly different. I'll update my answer with the python equivalent. – Xaser Apr 18 '15 at 20:31
1

Thanks. I thought regexes were basically language independent. – Kefka Apr 18 '15 at 20:33
1

They ought to be, but minor implementation differences exist. – Xaser Apr 18 '15 at 20:36
1

@Xaser \W*(rocket)\W* is an invalid regex for all those 3 flavors. It should be /\W*(rocket)\W*/i. An ugly alternative: /[^_a-zA-Z]([rR][oO][cC][kK][eE][tT])[^_a-zA-Z]/ which works in every single regex engine (for apache's 'engine' using RewriteEngine you need to remove the slashes around it). – Ismael Miguel Apr 18 '15 at 21:25
2

And \W*(rocket)\W* matches lrocketl. It should be (?:^|\W)(rocket)(?:$|\W) (without the * and you have to check if it matches the start and/or end of the string). – Ismael Miguel Apr 18 '15 at 21:33
1

@IsmaelMiguel technically yes, but as the implementation differs, this is not universal. In python for example, your regex string won't work. You're right, the *'s are not needed. and the use of non-capturing groups is elegant as well. – Xaser Apr 18 '15 at 21:35
1

@Xaser - you really should test your regex's before providing them as an answer. There's no reason to provide a broken regex (and worse, have it accepted as the answer). – jww Apr 18 '15 at 22:13
1

The regex in UPDATE 2 is not better because it won't match the example "Rocket's". – laurent Dec 29 '18 at 16:53
@IsmaelMiguel - It would be valuable if you could describe (?:^|\W) and (?:$|\W) – Motivated Feb 03 '19 at 06:23
@Motivated (?: ) is a non-capturing group (aka: there's nothing in $1, $2, ...); | means that there's 2 options (or) and \W is just a fancier way to write [^a-zA-Z0-9_]. – Ismael Miguel Feb 03 '19 at 07:24
@IsmaelMiguel - Why not just write it as (^|\W)(rocket)($|\W)? Also i noticed that it doesn't match RoCKEt unless there is an alphanumeric character before it (https://www.regexpal.com/?fam=107435) – Motivated Feb 03 '19 at 16:47
1

@Motivated You forgot to turn on "multiline", in the "flags" (that's flag m). Also, writting (...) makes a capturing group, which has (slightly) reduced performance, as the result of the match will store "stuff" in $1 and $2, which you don't need. Also, you should use regex101.com, as it explains everything and much more, including all the options for a specific regex engine. – Ismael Miguel Feb 04 '19 at 13:34
@IsmaelMiguel - Thanks. I didn't realize that. Interestingly enough the other selections are highlighted without the use of multiple lines. That make sense although i would have thought the performance implications would be minute. – Motivated Feb 04 '19 at 16:40
1

@Motivated You are right, the performance difference is minimal. . . If you plan to run it for a few lines. The O.P. didn't specify if it is for a single line of text of a (using an absurd example) 3GB file with over 100 million lines. This tiny tiny tiny tiny tiny performance difference adds up in the long run. – Ismael Miguel Feb 05 '19 at 09:45
@IsmaelMiguel I just tried your solution and it seems to also include the spaces before and after the word. Any way to make it match just the word itself? Thank you. I am using it with JavaScripts replace method if it helps. https://regexr.com/4o8be – Tekeste Kidanu Nov 05 '19 at 21:20
@TekesteKidanu If you want to simply know if the word exists, you don't need to do anything. If you want to extract the word, you need to just add a capturing group around the word, like how Motivated wrote above. The word will be on $1. If you want to replace the word, you can just make the other non-capturing groups into capturing groups and replace with $1<new word>$2, depending on if you need to capture the word as well, for processing with a function. For a complete answer, please write a question in StackOverflow, with exactly what you want to do. – Ismael Miguel Nov 06 '19 at 08:16
@IsmaelMiguel I have posted a question on SO. https://stackoverflow.com/questions/58781626/javascript-flavored-regex-to-replace-a-word-in-a-string-that-works-for-both-engl
Let me know if it is not okay to post links to SO in comments. :)
– Tekeste Kidanu Nov 09 '19 at 17:54
not perfect for close words: "... rocket rocket ..." – Yitzchak May 10 '21 at 07:37
Still it won't match rocket after or before an underscore, like in "_rocket", because underscore counts as \w, not \W. – Mehrdad Mirreza Feb 19 '24 at 21:16

score 18 · Answer 2 · answered Apr 19 '15 at 06:17

18

I think the look-aheads are overkill in this case, and you would be better off using word boundaries with the ignorecase option,

\brocket\b

In other words, in python:

>>> x="rocket's"
>>> y="rocket1."
>>> c=re.compile(r"\brocket\b",re.I)  # with the ignorecase option
>>> c.findall(y)
[]
>>> c.findall(x)
['rocket']

answered Apr 19 '15 at 06:17

beroe

1,157

technically, non-capturing groups are no lookarounds, however the /b option yields the exact same result as Ismael's solution, but may be a little more elegant. – Xaser Apr 19 '15 at 10:28
perfect solution.. covers all edge cases – Yitzchak May 10 '21 at 07:41
This won't match rocket after or before an underscore, like in "_rocket", because underscore counts as \w, not \W. – Mehrdad Mirreza Feb 19 '24 at 21:15

score 1 · Answer 3 · answered Apr 19 '15 at 04:00

With grep and sed, you can use \<rocket\>. With grep, the -i option will make it case-insensitive (ignore case):

grep -i '\<rocket\>'

I don't know any way to make all sed regexes case-insensitive, but there's always the caveman way:

sed -n '/\<[Rr][Oo][Cc][Kk][Ee][Tt]\>/p'

Rex Schweiss · Answer 4 · 2019-11-23T11:04:45.683

0

Use the Search for whole words only option.

As far as punctuations, you can't answer it till you know the flavour/flavor.

It's a very old thread, so posted for someone who might visit with a need, later. Ones who originated the thread might have moved to something else... No?

edited Nov 23 '19 at 11:04

answered Nov 23 '19 at 10:06

Rex Schweiss

1

What is whole words only option using grep or php? Sorry, but your answer doesn't give any added value compared with other answers. – Toto Nov 23 '19 at 11:22

score 0 · Answer 5 · answered Mar 04 '21 at 10:14

0

I think you can use something like this to specific your word that you want: /^(rocket|RoCKEt)$/g

answered Mar 04 '21 at 10:14

Techit Kakaew

1

2

What about ROCKET? – Toto Mar 04 '21 at 11:22

score 0 · Answer 6 · answered May 25 '21 at 11:37

For online regex generators(if the text is constant):

/\brocket\b/gi

And if you need to use a variable in a regular expression, then: Ex.:

let inputStr = "I need to check the following text: rocket RoCKEt hi Rocket This is a rocket. ROCKET's engine Rocketeer Sprocket";
let replaceThis = "ROCKET";
let re = new RegExp(\\b${replaceThis}\\b, 'gi');
console.log(inputStr.replace(re, "****")); // "I need to check the following text: **** ****** hi ****** This is a ****. ****'s engine Rocketeer Sprocket"

score 0 · Answer 7 · answered May 13 '22 at 04:16

0

I don't have enough reputation to comment, so I have to make a post to share why I think the user beroe's solution is the best way to do this problem. Take for example this string of text from the codewars challenge Most frequently used words in a text:

a a a b c c d d d d e e e e e

The goal of this challenge is to count the occurrences of words in the text. If we go with the most popular solution:

(?:^|\W)rocket(?:$|\W)

in our string of text if we search for 'a' instead of 'rocket' using re.findall for python it will only return two matches (the first and last a), since the \W capture overlaps the middle a from matching. Using \b for the word barrier on the other hand returns all 3 a's as matches

\brocket\b

Agian, credit to user beroe's solution above

answered May 13 '22 at 04:16

Rob R

1

1

You don't get around easily met standards by breaking the site rules. – music2myear May 13 '22 at 05:00
1

This does not provide an answer to the question. Once you have sufficient reputation you will be able to comment on any post; instead, provide answers that don't require clarification from the asker. - From Review – music2myear May 13 '22 at 05:00
(?:^|\W)rocket(?:$|\W) would be the answer to the question, which is provided in my comment :) – Rob R Sep 14 '22 at 11:45

How should I write a regex to match a specific word?

7 Answers7