One of my servers is powered by a P-II CPU running at 266 MHz. This is still fast enough to serve well as a firewall, and as a mail and file server. Though, it is not fast enough for a
grep -f if the file with the patterns has at least a few hundred lines including the match-all regex
On this particular server the pattern file had about 750 lines of which 99 % looked like
^From: .*firstname.lastname@example.org ^From: .*email@example.com ^From: .*firstname.lastname@example.org ...
grep -f pattern_file mail_message takes about 1 minute.
|grep speeds compared||266 MHz||2400 MHz|
| 700 patterns with ||55 s||5.4 s|
| 700 patterns without ||21 s||1.2 s|
In this particular case where I wanted to match on specific mail addresses only it is of course much faster to remove the leading
^From: .* from the pattern file, and to use something like
grep address_from_message file_with_addresses
In order to be able to use this method from within a
.procmailrc I wrote mail_lookuplist. So, now you know why ;)
When I was evaluating the run times of various forms and uses of
grep I found an explanation why these longer run times probably never occurred to me. According to my tests using different sizes of pattern files I found that run times increase polynomially or exponentially.
It needs sufficiently large pattern files or ancient CPU speeds to get into the ranges where one's (at least my) alarm bells fire off ;)
By the way, I would like to ask to bear with me that these graphs and numbers all lack proper statistical procedures. Everyone interested is kindly invited to reproduce the tests.
As far as I can say the main factor is the size of the pattern files. I have also compared times of case sensitive (see "msg w/ 1 partial match") versus case insensitive
grep ("same but case insensitive"). The differences here are small.
Rather interesting is that the difference in run times were also rather small when I reduced the message file being tested to the 1 single "From:" line it contained (which in fact would be the way to go since all patterns started with
^From: anyway), see "1 line partial match" in the graph.
Whether there is a matching line or not could make a difference, though. In all my tests I used a real mail messages (RFC 822 style, 6 kB, 114 lines). Of course, there was the "From:" line which always partially matched with the patterns. Run times were much shorter when I changed it by e.g. adding a
> in front of the
From: , see "no partial match".
Update of 2008-07-27: Edgar Holleis kindly pointed out1) that the graphs probably do not show exponential but quadratic growth. Indeed, further analysis suggests that the underlying mechanisms exhibit some combination. The most detailed data I could gather (of up to 1900 patterns) was best fitted by 3rd and 4th degree polynomial functions.
Update of 2008-07-28: For a workaround to speed up
grep -f see
Grep with large pattern files.