V

Expensive grep

One of my servers is powered by a P-II CPU running at 266 MHz. This is still fast enough to serve well as a firewall, and as a mail and file server. Though, it is not fast enough for a grep -f if the file with the patterns has at least a few hundred lines including the match-all regex .*.

On this particular server the pattern file had about 750 lines of which 99 % looked like

^From: .*mail1@address.sth
^From: .*mail2@domain.this
^From: .*mail1@sub.domain.that
...

Here, grep -f pattern_file mail_message takes about 1 minute.

grep speeds compared 266 MHz 2400 MHz
700 patterns with .* 55 s 5.4 s
700 patterns without .* 21 s 1.2 s

In this particular case where I wanted to match on specific mail addresses only it is of course much faster to remove the leading ^From: .* from the pattern file, and to use something like

grep address_from_message file_with_addresses

In order to be able to use this method from within a .procmailrc I wrote mail_lookuplist. So, now you know why ;)

Exponential or polynomial run times

Run times of grep feature polynomial or exponential growth depending on the size of the pattern files

When I was evaluating the run times of various forms and uses of grep I found an explanation why these longer run times probably never occurred to me. According to my tests using different sizes of pattern files I found that run times increase polynomially or exponentially.

It needs sufficiently large pattern files or ancient CPU speeds to get into the ranges where one's (at least my) alarm bells fire off ;)

By the way, I would like to ask to bear with me that these graphs and numbers all lack proper statistical procedures. Everyone interested is kindly invited to reproduce the tests.

As far as I can say the main factor is the size of the pattern files. I have also compared times of case sensitive (see "msg w/ 1 partial match") versus case insensitive grep ("same but case insensitive"). The differences here are small.

Comparing grep run times

Rather interesting is that the difference in run times were also rather small when I reduced the message file being tested to the 1 single "From:" line it contained (which in fact would be the way to go since all patterns started with ^From: anyway), see "1 line partial match" in the graph.

Whether there is a matching line or not could make a difference, though. In all my tests I used a real mail messages (RFC 822 style, 6 kB, 114 lines). Of course, there was the "From:" line which always partially matched with the patterns. Run times were much shorter when I changed it by e.g. adding a > in front of the From: , see "no partial match".

Update of 2008-07-27: Edgar Holleis kindly pointed out1) that the graphs probably do not show exponential but quadratic growth. Indeed, further analysis suggests that the underlying mechanisms exhibit some combination. The most detailed data I could gather (of up to 1900 patterns) was best fitted by 3rd and 4th degree polynomial functions.

Update of 2008-07-28: For a workaround to speed up grep -f see Grep with large pattern files.

Discussion

Andreas Schamanek, 2008-07-20 21:27

All tests were done with GNU grep 2.5.1 (on the slower machine) and grep 2.5.3 on the faster one, both running Debian 4.x. Tests for the graph "GREP TIMES (2)" were of course done on the faster machine.

Andreas Schamanek, 2008-07-27 21:18

Eventually, I found some related threads on the web:

On March 20, 2006, Narfi Stefansson reported that grep -f scales extremely poorly with number of lines in pattern file. There, Julian Foad later pointed out that a bug report for this was created by Levi Waldron on April 8, 2006: grep much less efficient when matching multiple patterns than when matching each pattern sequentially. It's still open as of today.

Enter your comment. Wiki syntax is allowed:
J B Y S X
 
 
blog/080720_expensive_grep.txt · Last modified: 2008-07-28 00:18 by andreas