How can I find repeated words in a file using grep/egrep?

regex bash unix grep word-boundary

Mouse · Oct 28, 2015 · Viewed 11.6k times · Source

I need to find repeated words in a file using egrep (or grep -e) in unix (bash)

I tried:

egrep "(\<[a-zA-Z]+\>) \1" file.txt

and

egrep "(\b[a-zA-Z]+\b) \1" file.txt

but for some reason these consider things to be repeats that aren't! for example, it thinks the string "word words" meets the criteria despite the word boundary condition \> or \b.

Answer

\1 matches whatever string was matched by the first capture. That is not the same as matching the same pattern as was matched by the first capture. So the fact that the first capture matched on a word boundary is no longer relevant, even though the \b is inside the capture parentheses.

If you want the second instance to also be on a word boundary, you need to say so:

egrep "(\b[a-zA-Z]+) \1\b" file.txt

That is no different from:

egrep "\b([a-zA-Z]+) \1\b" file.txt

The space in the pattern forces a word boundary, so I removed the redundant \bs. If you wanted to be more explicit, you could put them in:

egrep "\<([a-zA-Z]+)\> \<\1\>" file.txt

How can I find repeated words in a file using grep/egrep?

Answer

Related questions