Let me start off by saying I don't want to print only the duplicate lines nor do I want to remove them.
I am trying to use grep with a pattern file to parse a large data file.
The Pattern file for example may look like this:
1243
1234
1234
1234
1354
1356
1356
1677
etc. with more single and duplicate entries.
The Input data file might look like this:
aatta 1243 qqqqqq
yyyyy 1234 vvvvvv
ttttt 1555 bbbbbb
ppppp 1354 pppppp
yyyyy 3333 zzzzzz
qqqqq 1677 eeeeee
iiiii 4444 iiiiii
etc. for 27000 lines.
when i use
grep -f 'Patternfile.txt' 'Inputfile.txt' > 'Outputfile.txt'
I get an output file that resembles this:
aatta 1243 qqqqqq
yyyyy 1234 vvvvvv
ppppp 1354 pppppp
how would can i get it to also report the duplicates so i end up with something like this?:
aatta 1243 qqqqqq
yyyyy 1234 vvvvvv
yyyyy 1234 vvvvvv
yyyyy 1234 vvvvvv
ppppp 1354 pppppp
qqqqq 1677 zzzzzz
Additionally I would also like to print a blank line should a query in the pattern file not match a substring in the input file.
Thank you!
One solution, not with grep
, but with perl
:
With patternfile.txt
and inputfile.txt
with data of your original post. Next content of script.pl
should do the job (I assume that the string to match is the second column, otherwise it should be modified to use a regexp
instead. This way is faster):
use warnings;
use strict;
## Check arguments.
die qq[Usage: perl $0 <pattern-file> <input-file>\n] unless @ARGV == 2;
## Open input files.
open my $pattern_fh, qq[<], shift @ARGV or die qq[Cannot open pattern file\n];
open my $input_fh, qq[<], shift @ARGV or die qq[Cannot open input file\n];
## Hash to save patterns.
my (%pattern, %input);
## Read each pattern and save how many times appear in the file.
while ( <$pattern_fh> ) {
chomp;
if ( exists $pattern{ $_ } ) {
$pattern{ $_ }->[1]++;
}
else {
$pattern{ $_ } = [ $., 1 ];
}
}
## Read file with data and save them in another hash.
while ( <$input_fh> ) {
chomp;
my @f = split;
$input{ $f[1] } = $_;
}
## For each pattern, search it in the data file. If it appears, print line those
## many times saved previously, otherwise print a blank line.
for my $p ( sort { $pattern{ $a }->[0] <=> $pattern{ $b }->[0] } keys %pattern ) {
if ( $input{ $p } ) {
printf qq[%s\n], $input{ $p } for ( 1 .. $pattern{ $p }->[1] );
}
else {
# Old behaviour.
# printf qq[\n];
# New requirement.
printf qq[\n] for ( 1 .. $pattern{ $p }->[1] );
}
}
Run it like:
perl script.pl patternfile.txt inputfile.txt
And gives next output:
aatta 1243 qqqqqq
yyyyy 1234 vvvvvv
yyyyy 1234 vvvvvv
yyyyy 1234 vvvvvv
ppppp 1354 pppppp
qqqqq 1677 eeeeee