Benchmark of Regex Libraries

Introduction

Regular expressions (regex) is often the most convinient approach for parsing text files. The performance of regex matching may be critical when the input files are huge, which is quite common in analyzing biological data. Most scripting languages support regex and there are various C/C++ libraries, too, but surprisingly few have evaluated their performance. This page aims to provide a preliminary benchmark on the performance of several regex libraries as well as scripting languages.

Design

Input text

The text file for this benchmark is a version of concatenated Linux howto document. This file has 39,422,105 bytes, 5,330,321 words and 1,086,077 lines (according to wc). The maximum line length is 533 and the median is 33.

Patterns

URI (protocol://server/path): ([a-zA-Z][a-zA-Z0-9]*)://([^ /]+)(/[^ ]*)?
Email (name@server): ([^ @]+)@([^ @]+)
Date (month/day/year): ([0-9][0-9]?)/([0-9][0-9]?)/([0-9][0-9]([0-9][0-9])?)
URI|Email: ([a-zA-Z][a-zA-Z0-9]*)://([^ /]+)(/[^ ]*)?|([^ @]+)@([^ @]+)

It should be note that the above regular expressions may not be strictly correct or the optimal. We just need something to test the performance.

It is also worth noting that matching word boundaries is not required by POSIX regex. For the patterns above, we may simply check the character before and after the matching region to test if the entire match is a word.

Matching

The benchmarking program reads the file and tests matching line by line. The tailing new line character is trimmed as different libraries may treat it differently.

Results

All the benchmark programs or scripts are available at this page. The following table shows the performance on a 64-bit Macbook Pro clocked at 2.53GHz with Mac OS X 10.6.4 and gcc-4.2.1 installed.

Library/program	Algo	UTF-8	Perl	Light	POSIX	URI	Email	Date	Sum3	URI\|Email
boost::regex (1.42)	BT	?	Yes	No	No	5.44	1.82	1.15	8.41	9.84
boost::xpressive (1.42)	?	?	Yes	No	No	3.72	1.51	1.22	6.45	7.80
Glib (2.24.1)	?	?	Yes	No	No	1.08	0.84	0.79	2.35	18.02
onig-posix (5.9.2)	BT	Yes	Yes	No	Yes	0.34	0.44	0.41	1.19	10.89
PCRE-posix (8.10)	BT	Yes	Yes	No	Yes	0.67	0.39	0.54	1.60	13.48
RE2 (2010-07-19)	FSA	Yes	Yes	No	No	0.57	0.56	0.55	1.68	0.58
Regex (Mac, 10.6.4)	?	?	No	No	Yes	1.18	0.54	1.28	3.00	19.66
Regex (NCBI, ?)	BT	No	No	Yes	Yes	7.21	9.83	3.45	20.49	22.35
regexp9 (20100715)	FSA	Yes	No	Yes	Similar	1.90	2.12	0.95	4.97	4.55
Tcl (8.5)	?	Yes	No	No	Similar	2.17	1.73	1.52	5.42	3.66
TRE (0.8.0)	FSA	?	No	No	Similar	4.34	2.42	1.73	8.49	8.57
T-Rex (1.0)	?	No	No	Yes	No	4.39	6.45	2.45	13.29	9.35
egrep (GNU, 2.5.1)	?	Yes	No	NA	NA	0.07	0.08	0.10	0.25	0.16
gawk (3.1.8)	?	Yes	No	NA	NA	1.43	1.40	1.38	4.21	1.43
Javascript V8 (2.3.1)	?	?	?	NA	NA	2.30	2.57	1.17	6.04	3.70
nawk (20070501)	FSA	No	No	NA	NA	1.40	1.40	1.40	4.20	1.40
Perl (5.8.9)	BT	Yes	Yes	NA	NA	0.63	0.59	0.52	1.74	10.61
Python (2.6.1)	?	Yes	No	NA	NA	6.37	10.19	2.18	18.74	16.03

Column `Algo' indicates the algorithm for regex. `BT'=backtracking, and `FSA'=finite state machine.
Column `UTF-8' indicates if the library explicitly supports UTF-8 encoding. See also the Unicode section on wiki.
Column `Perl' denotes if the library supports advanced perl-compatible syntax, such as word boundaries and back reference.
Column `Light' shows whether the library is implemented in two files: a header and a source file. Such two-file libraries can be easily included in a 3rd-party program at the source-code level, which makes the 3rd-party program less dependent on external APIs.
Column `URI', `Email' and `Date' give the CPU seconds for finding all matching lines given the three regular expressions above, respectively. Column `Sum3' gives the sum of CPU seconds for `URI', `Email' and `Date'.
T-Rex gives additional matching lines for `URI', which is incorrect.

Discussions

Note: boost::xpressive, Glib, Tcl and V8 were added later. The discussions below did not consider these implementations.

Backtracking vs. state machines

Regex libraries are typically implemented with backtracking (BT) or finite state machine (FSA). The former approach is more flexible but can be exponential in time given some regex. The latter guarantees polynomial time in searching, but usually less flexible. RE2 is believed to be the only library that uses FSA and supports Perl-like syntax at the same time.

In practice, implementations based on BT and FSA seem to have similar performance on short regex. However, FSA may greatly outperform BT given the `|' operator in regex. This means for BT-based implementations, matching with multiple regex is preferred.

Speed of C/C++ regex libraries

For `|'-free regex, onig is the winner. PCRE and RE2 are similar in performance. The regex library from Mac OS X is comes in the next place. Of the three light-weight library, only regexp9 is close to the performance of matured libraries. TRE and boost::regex are nearly 8X slower than onig on `|'-free regex.. The T-Rex and NCBI port of regex are the slowest implementations.

For `|'-contained regex, the results are changed a lot. RE2 clearly beats all the rest. Regexp9 and TRE, the other two FSA based libraries come in the second and the third places, respectively. Several BT-based algorithms such as onig and PCRE are 10X slower than using two regex separately, although this is not true for all BT-based algorithms.

Speed of standalone programs and scripting languages

GNU's egrep is fast mainly because it hybridizes regex matching with Boyer-Moore search. I do not know how exactly this is done, but I do hope other regex may adopt this heuristics. On all the regex I have tried, egrep is the fastest. Another possible reason that egrep is fast may be because it does not keep track of grouping.

Perl is surprisingly fast as a scripting language. Python, on the contrary, is surprisingly slow. I heavily rely on regex for parsing huge text files. This benchmark tells me that python, although faster than perl in other applications, is not the right language for me.

C vs. C++

I used PCRE's POSIX APIs in the table above. If I use its C++ APIs, PCRE becomes 10% slower. C++ interfaces incur minor overheads.

For boost::regex, I implemented two versions: one uses the fgets() libc API and the other uses std::getline() to read into a C++ string. The already-slow boost::regex becomes 50% slower, which means 50% CPU time goes to a function as simple as getline()! Why not implement getline() by calling fgets()?

Alternative benchmarks

Russ Cox, the developer of the RE2 library, evaluated the performance of RE2 and PCRE. He seems to be using long texts, while my programs match line by line. The results may not be comparable.

Google search will bring us to this page. The conclusion broadly agrees with mine: PCRE is significantly faster than boost::regex. Nonetheless, that benchmark is not performed in a realistic context.

This page evaluates the performance of regex matching of several scripting languages. It also points out that python is slower in regex matching. Another interesting conclusion from that page is perl is efficient given `/FOO/ || /BAR/' but very inefficient given `/FOO|BAR/'. So I also added this type of regex to my benchmark.

The regex-dna benchmark from The Computer Language Benchmarks Game shows that Python is faster than Perl. This may be related to Perl's inefficiency given `/FOO|BAR/'. I have not tested this, though.

Conclusions

Before I did this benchmark, I thought all regex implementations are similar in performance, like what happens to the hash table libraries. To my big surprise, the fact is the contrary. The performance varies a lot between implementations and between regex. Most widely used implementations such as boost::regex, Python, Perl and gawk all have striking weakness.

In this benchmark, RE2 seems to be the overall winner. It is feature-rich and among the fastest for all 4 regex. More importantly, it does not suffer from the worst-case problem. RE2 is implemented in C++. For C programmers, onig or PCRE is a good choice. However, to use these two libraries, we should be aware that regex containing the `|' operator may greatly degrade the performance. If this is a concern, one may consider regexp9. Regexp9 is featured as its small code size (<1300 lines of code) which makes it the ideal library when we want to include the source code in a project to avoid an extra library dependency.

Appendix: Performance on x64-linux

Compiled on Debian etch with gcc-4.1.2 (CPU: Xeon E5450 @3GHz). This is a server. Timing may have large variance due to other active processes running on the same machine.

Library/Program	URI	Email	Date	URI\|Email
re2 (20100719)	0.54	0.56	0.50	0.64
regex (glibc 2.3.6)	3.12	3.96	0.39	3.91
regexp9 (20100715)	1.74	2.09	0.97	4.25
tcl (8.4)	1.56	1.26	1.09	2.72
gawk (3.1.5)	0.47	0.43	0.30	0.36
egrep (GNU 2.5.1)	0.09	0.05	0.10	0.16
nawk (20070501)	0.94	0.92	0.92	0.92
perl (5.8.8)	0.62	0.56	0.62	9.93
python (2.5)	4.26	7.00	1.58	10.66