Unix & Linux
bash grep
Updated Tue, 13 Sep 2022 10:24:32 GMT

Grep repeating patterns with a loop


I have two files:

file1:

ABA
FFR
HHI
HAB

file2:

ABAABAABAABAABAABAABAABAABATRCFUJIKHRTHVFHJJHVHJJKKHGCC
FFRFFRFFRFFRFFRFFRFFRFFRFFRFFRFFRFFRFFRFHJKGHKKBVDTHJNJ
HHIHHIHHIHHIHHIDEDRJFKOLGCUOUUKJGLNJKKKKJKJKJGGHHBCFDII
HABHABHABHABHABHABHABHABGTHFOOLLLHHHUUJCIICXXTKCIABAGGC

Each line in file1 is a pattern that is repeating in the beginning of the corresponding lines in file2. I would like to get the parts of each line from file2 that are not the repeating patterns from file1.

desired output:

TRCFUJIKHRTHVFHJJHVHJJKKHGCC
FHJKGHKKBVDTHJNJ
DEDRJFKOLGCUOUUKJGLNJKKKKJKJKJGGHHBCFDII
GTHFOOLLLHHHUUJCIICXXTKCIABAGGC

I tried to use this loop:

while read -r line
do
grep -v "$line{1,}"   file2.txt 
done < file1.txt

But I go this output:

ABAABAABAABAABAABAABAABAABATRCFUJIKHRTHVFHJJHVHJJKKHGCC
FFRFFRFFRFFRFFRFFRFFRFFRFFRFFRFFRFFRFFRFHJKGHKKBVDTHJNJ
HHIHHIHHIHHIHHIDEDRJFKOLGCUOUUKJGLNJKKKKJKJKJGGHHBCFDII
HABHABHABHABHABHABHABHABGTHFOOLLLHHHUUJCIICXXTKCIABAGGC
ABAABAABAABAABAABAABAABAABATRCFUJIKHRTHVFHJJHVHJJKKHGCC
FFRFFRFFRFFRFFRFFRFFRFFRFFRFFRFFRFFRFFRFHJKGHKKBVDTHJNJ
HHIHHIHHIHHIHHIDEDRJFKOLGCUOUUKJGLNJKKKKJKJKJGGHHBCFDII
HABHABHABHABHABHABHABHABGTHFOOLLLHHHUUJCIICXXTKCIABAGGC
ABAABAABAABAABAABAABAABAABATRCFUJIKHRTHVFHJJHVHJJKKHGCC
FFRFFRFFRFFRFFRFFRFFRFFRFFRFFRFFRFFRFFRFHJKGHKKBVDTHJNJ
HHIHHIHHIHHIHHIDEDRJFKOLGCUOUUKJGLNJKKKKJKJKJGGHHBCFDII
HABHABHABHABHABHABHABHABGTHFOOLLLHHHUUJCIICXXTKCIABAGGC



Solution

With e.g. ABA in the variable, grep -v "$line{1,}" would give grep the pattern ABA{1,}, meaning it'd look for a single A, a single B and then at least one A. The last repetition doesn't matter, though, as there's nothing after that, so even a single ABA would match that.

Well, except that by default, grep uses basic regular expressions (BRE), where the counted repetition must be written with backslashes, as \{n,m\}. In extended regular expressions (ERE), {1,} would one or more repeats (and so would +); but in BRE, it's just four literal characters (and + also is a regular character).

But grep prints full lines that match, or with -v, don't match; it doesn't remove parts of the line. (Except with grep -o where it only prints the matching part, but I don't think that would work with -v.) Also, with that loop, grep would look at all the lines for each pattern, which is why you get the contents of file2 repeated multiple times.


We'd need a loop that reads one line from each input on each iteration. It could be done in shell, but it would be slow. Something like AWK would be better, e.g.:

$ awk '{getline pat < "file1"; sub("^(" pat ")*", ""); print}' file2
TRCFUJIKHRTHVFHJJHVHJJKKHGCC
FHJKGHKKBVDTHJNJ
DEDRJFKOLGCUOUUKJGLNJKKKKJKJKJGGHHBCFDII
GTHFOOLLLHHHUUJCIICXXTKCIABAGGC

The AWK program implicitly loops over the lines of file2 (and other files given on the command line), and here, we explicitly read one line from file1 each iteration. Then "^(" pat ")*" constructs a pattern like ^(ABA)*, which is matched against the current line, and substituted with the empty string.

This would not remove any instances of the pattern from further in the line, and e.g. ABAABAFOOABABAR would turn into FOOABABAR. If you want to remove those too, change it to gsub("(" pat ")*", "");.





Comments (3)

  • +0 – I seem to recall having used grep -vo at least once, and I'm pretty sure it did work as expected. — Jul 22, 2022 at 01:34  
  • +0 – @Kevin, well, I'm not even sure what to expect from the combination, really. -v says to print lines without matches, and -o says to print the matching parts. Their intersection is empty. GNU grep seems to print nothing, but return an exit status based on -v; Busybox either printed nothing or segfaulted; and the BSD grep on my mac printed the non-matching lines. So, they basically did the same as -vq, or just the plain -v. — Jul 22, 2022 at 08:50  
  • +0 – IIRC it prints "the part of the line that didn't match," at least on GNU. — Jul 22, 2022 at 17:47