When I investigated the run times of grep -f related to large pattern files (with hundreds of lines and more) it became apparent that one can speed up
grep by splitting up the pattern files into smaller chunks.
I tested this with GNU grep 2.5.3 on a machine with an Intel Xeon 2400 MHz CPU running Debian Lenny.
grep -f with a pattern file of 1900 lines1) takes at least 72 seconds (with no match found).
$ TIMEFORMAT="%R seconds" $ time grep -f pattern-file testmail_for_grep.txt 73.852 seconds
To speed things up, for instance, one can use
split to split
pattern-file into smaller chunks and run
grep with each of them. Here is an example which uses chunks of 50 lines:
split -l 50 pattern-file pattern-file.split. for CHUNK in pattern-file.split.* ; do grep -f "$CHUNK" testmail_for_grep.txt done rm pattern-file.split.*
Using the same pattern file as above (1900 lines)
grep -f of all pattern files splitted with
split -l 50 takes only about 1.1 second!
For the fun of it I tried to find the optimal size of chunks. I assume that it depends on a multitude of factors (such as the speed of your file system) but in this particular case I gained best results for chunks of about 20 lines. The run time for chunks of 20 lines was 0.70 seconds. I did the same tests with a similar pattern file of 900 lines. There, the optimal chunk size was about 20 lines, too.
In the related bug report bug #16305: grep much less efficient ..., Levi Waldron suggested to use
for line in `cat patterns.txt`;do grep $line data.txt >> matches.txt;done
This won't work in general (it e.g. breaks when patterns contain spaces). However, a
while loop using
read might do better (
while read line ; do grep "$line" testmail_for_grep.txt ; done < pattern-file
The run time of this method was a bit more than 4 seconds for 1900 patterns. This is much slower than using
split but in cases where one does not want to write temporary files it might come in handy. Note also that this method always scales linearly (in contrast to
grep -f) which at least makes it easier to estimate run times of arbitrarily large pattern files.
grep does not check whether there are duplicate lines in your pattern file. If there are you might want to run it through
sort -u or
uniq first. Also, it does not check whether your input file is empty. Running my pattern file of 1900 lines over an empty file still took nearly 10 seconds ;-)
$ rm testmail_for_grep.txt ; touch testmail_for_grep.txt $ time grep -f pattern-file testmail_for_grep.txt 9.879 seconds