V

Grep with large pattern files

When I investigated the run times of grep -f related to large pattern files (with hundreds of lines and more) it became apparent that one can speed up grep by splitting up the pattern files into smaller chunks.

I tested this with GNU grep 2.5.3 on a machine with an Intel Xeon 2400 MHz CPU running Debian Lenny. grep -f with a pattern file of 1900 lines1) takes at least 72 seconds (with no match found).

$ TIMEFORMAT="%R seconds"
$ time grep -f pattern-file testmail_for_grep.txt
73.852 seconds

To speed things up, for instance, one can use split to split pattern-file into smaller chunks and run grep with each of them. Here is an example which uses chunks of 50 lines:

split -l 50 pattern-file pattern-file.split.
for CHUNK in pattern-file.split.* ; do
        grep -f "$CHUNK" testmail_for_grep.txt
done
rm pattern-file.split.*

Using the same pattern file as above (1900 lines) grep -f of all pattern files splitted with split -l 50 takes only about 1.1 second!

Optimal chunks

For the fun of it I tried to find the optimal size of chunks. I assume that it depends on a multitude of factors (such as the speed of your file system) but in this particular case I gained best results for chunks of about 20 lines. The run time for chunks of 20 lines was 0.70 seconds. I did the same tests with a similar pattern file of 900 lines. There, the optimal chunk size was about 20 lines, too.

Pattern by pattern

In the related bug report bug #16305: grep much less efficient ..., Levi Waldron suggested to use

for line in `cat patterns.txt`;do grep $line data.txt >> matches.txt;done

This won't work in general (it e.g. breaks when patterns contain spaces). However, a while loop using bash's read might do better (testmail_for_grep.txt being data.txt).

while read line ; do grep "$line" testmail_for_grep.txt ; done < pattern-file

The run time of this method was a bit more than 4 seconds for 1900 patterns. This is much slower than using split but in cases where one does not want to write temporary files it might come in handy. Note also that this method always scales linearly (in contrast to grep -f) which at least makes it easier to estimate run times of arbitrarily large pattern files.

Duplicates and nothing

BTW, grep does not check whether there are duplicate lines in your pattern file. If there are you might want to run it through sort -u or uniq first. Also, it does not check whether your input file is empty. Running my pattern file of 1900 lines over an empty file still took nearly 10 seconds ;-)

$ rm testmail_for_grep.txt ; touch testmail_for_grep.txt
$ time grep -f pattern-file testmail_for_grep.txt
9.879 seconds
1) The same as before.

Discussion

Milan, 2011/05/05 13:05

I am using this suggestion and it works great. But it is not usable for the -v option. Any ideas?

Andreas Schamanek, 2011/05/06 20:28

Thanks for the great question :-) I haven't tried it myself with large pattern files but you might want to try comm.

$ cat data.txt
111
AAA
222
BBB
333
CCC
$ cat patterns.txt
A
33
$ # we simulate the approach of splitting up patterns.txt here:
$ ( grep 'A' data.txt ; grep '33' data.txt ) >inverse.txt
$ cat inverse.txt
AAA
333
$ comm -23 <(sort data.txt) <(sort inverse.txt)
111
222
BBB
CCC
Simon Heath, 2014/03/07 13:13

Useful info, but wanted to add my own observations - found this page when having grep performance trouble myself. I've found the reading a line by line with cat version a lot quicker. This is on a production box where I was looking to pull circa 13,000 accounts in my grep list file out of some batch files containing 1.2 million records. Although the box has a 2x12core xeons it makes no difference if grep is single threaded ;-|

Tried the split method with 20 rows per split file, and after an hour it had produced about 90 accounts in the output file. :-(. I'd be quicker find, cutting and pasting from vi :-)

Reading a single line with cat and for loop got me all 12,918 accounts in 25 minutes. Notes this is Red Hat Enterprise Linux Server release 5.8 (Tikanga)

Script I knocked up to make it reusable:

cat greplots.sh

#Usage : greplots.sh <greplistfile> <filetogrep>
 
#Reads 1 line from $1 search string file and greps the file.
 
#Used as grep gets exponentially slower with large input files.
 
for searchstring in `cat $1`
do
  grep -Fa "$searchstring" $2
done

Note the -a is there as the files I'm looking for have lots of control characters making grep think they are binary files. the -F says - these are fixed string, not expressions which speed up a bit too. Given a bit longer I'd have split the file being searched into 12 streams and fired off multiple processes using the . to run a subshell with a wait at the end and then concatenate the results all back together.

Andreas Schamanek, 2014/03/10 22:19

@all: Please keep in mind that locales can have a huge impact on the performance of text operations. If possible avoid UTF-8 locales and set e.g. LC_ALL=C before greping.

@Simon: Thanks for your data and feedback.

Enter your comment. Wiki syntax is allowed:
JRKXF
 
 
blog/080727_grep_with_large_pattern_files.txt · Last modified: 2008/07/28 10:09 by andreas