Table of Contents

Grep with large pattern files

When I investigated the run times of grep -f related to large pattern files (with hundreds of lines and more) it became apparent that one can speed up grep by splitting up the pattern files into smaller chunks.

I tested this with GNU grep 2.5.3 on a machine with an Intel Xeon 2400 MHz CPU running Debian Lenny. grep -f with a pattern file of 1900 lines1) takes at least 72 seconds (with no match found).

$ TIMEFORMAT="%R seconds"
$ time grep -f pattern-file testmail_for_grep.txt
73.852 seconds

To speed things up, for instance, one can use split to split pattern-file into smaller chunks and run grep with each of them. Here is an example which uses chunks of 50 lines:

split -l 50 pattern-file pattern-file.split.
for CHUNK in pattern-file.split.* ; do
        grep -f "$CHUNK" testmail_for_grep.txt
rm pattern-file.split.*

Using the same pattern file as above (1900 lines) grep -f of all pattern files splitted with split -l 50 takes only about 1.1 second!

Optimal chunks

For the fun of it I tried to find the optimal size of chunks. I assume that it depends on a multitude of factors (such as the speed of your file system) but in this particular case I gained best results for chunks of about 20 lines. The run time for chunks of 20 lines was 0.70 seconds. I did the same tests with a similar pattern file of 900 lines. There, the optimal chunk size was about 20 lines, too.

Pattern by pattern

In the related bug report bug #16305: grep much less efficient ..., Levi Waldron suggested to use

for line in `cat patterns.txt`;do grep $line data.txt >> matches.txt;done

This won't work in general (it e.g. breaks when patterns contain spaces). However, a while loop using bash's read might do better (testmail_for_grep.txt being data.txt).

while read line ; do grep "$line" testmail_for_grep.txt ; done < pattern-file

The run time of this method was a bit more than 4 seconds for 1900 patterns. This is much slower than using split but in cases where one does not want to write temporary files it might come in handy. Note also that this method always scales linearly (in contrast to grep -f) which at least makes it easier to estimate run times of arbitrarily large pattern files.

Duplicates and nothing

BTW, grep does not check whether there are duplicate lines in your pattern file. If there are you might want to run it through sort -u or uniq first. Also, it does not check whether your input file is empty. Running my pattern file of 1900 lines over an empty file still took nearly 10 seconds ;-)

$ rm testmail_for_grep.txt ; touch testmail_for_grep.txt
$ time grep -f pattern-file testmail_for_grep.txt
9.879 seconds
The same as before.