Regular expressions (regex) is often the most convinient approach for parsing text files. The performance of regex matching may be critical when the input files are huge, which is quite common in analyzing biological data. Most scripting languages support regex and there are various C/C++ libraries, too, but surprisingly few have evaluated their performance. This page aims to provide a preliminary benchmark on the performance of several regex libraries as well as scripting languages.
The text file for this benchmark is a version of concatenated Linux howto document. This file has 39,422,105 bytes, 5,330,321 words and 1,086,077 lines (according to wc). The maximum line length is 533 and the median is 33.
It should be note that the above regular expressions may not be strictly correct or the optimal. We just need something to test the performance.
It is also worth noting that matching word boundaries is not required by POSIX regex. For the patterns above, we may simply check the character before and after the matching region to test if the entire match is a word.
The benchmarking program reads the file and tests matching line by line. The tailing new line character is trimmed as different libraries may treat it differently.
All the benchmark programs or scripts are available at this page. The following table shows the performance on a 64-bit Macbook Pro clocked at 2.53GHz with Mac OS X 10.6.4 and gcc-4.2.1 installed.
|Regex (Mac, 10.6.4)||?||?||No||No||Yes||1.18||0.54||1.28||3.00||19.66|
|Regex (NCBI, ?)||BT||No||No||Yes||Yes||7.21||9.83||3.45||20.49||22.35|
|egrep (GNU, 2.5.1)||?||Yes||No||NA||NA||0.07||0.08||0.10||0.25||0.16|
Note: boost::xpressive, Glib, Tcl and V8 were added later. The discussions below did not consider these implementations.
Regex libraries are typically implemented with backtracking (BT) or finite state machine (FSA). The former approach is more flexible but can be exponential in time given some regex. The latter guarantees polynomial time in searching, but usually less flexible. RE2 is believed to be the only library that uses FSA and supports Perl-like syntax at the same time.
In practice, implementations based on BT and FSA seem to have similar performance on short regex. However, FSA may greatly outperform BT given the `|' operator in regex. This means for BT-based implementations, matching with multiple regex is preferred.
For `|'-free regex, onig is the winner. PCRE and RE2 are similar in performance. The regex library from Mac OS X is comes in the next place. Of the three light-weight library, only regexp9 is close to the performance of matured libraries. TRE and boost::regex are nearly 8X slower than onig on `|'-free regex.. The T-Rex and NCBI port of regex are the slowest implementations.
For `|'-contained regex, the results are changed a lot. RE2 clearly beats all the rest. Regexp9 and TRE, the other two FSA based libraries come in the second and the third places, respectively. Several BT-based algorithms such as onig and PCRE are 10X slower than using two regex separately, although this is not true for all BT-based algorithms.
GNU's egrep is fast mainly because it hybridizes regex matching with Boyer-Moore search. I do not know how exactly this is done, but I do hope other regex may adopt this heuristics. On all the regex I have tried, egrep is the fastest. Another possible reason that egrep is fast may be because it does not keep track of grouping.
Perl is surprisingly fast as a scripting language. Python, on the contrary, is surprisingly slow. I heavily rely on regex for parsing huge text files. This benchmark tells me that python, although faster than perl in other applications, is not the right language for me.
I used PCRE's POSIX APIs in the table above. If I use its C++ APIs, PCRE becomes 10% slower. C++ interfaces incur minor overheads.
For boost::regex, I implemented two versions: one uses the fgets() libc API and the other uses std::getline() to read into a C++ string. The already-slow boost::regex becomes 50% slower, which means 50% CPU time goes to a function as simple as getline()! Why not implement getline() by calling fgets()?
Google search will bring us to this page. The conclusion broadly agrees with mine: PCRE is significantly faster than boost::regex. Nonetheless, that benchmark is not performed in a realistic context.
This page evaluates the performance of regex matching of several scripting languages. It also points out that python is slower in regex matching. Another interesting conclusion from that page is perl is efficient given `/FOO/ || /BAR/' but very inefficient given `/FOO|BAR/'. So I also added this type of regex to my benchmark.
Before I did this benchmark, I thought all regex implementations are similar in performance, like what happens to the hash table libraries. To my big surprise, the fact is the contrary. The performance varies a lot between implementations and between regex. Most widely used implementations such as boost::regex, Python, Perl and gawk all have striking weakness.
In this benchmark, RE2 seems to be the overall winner. It is feature-rich and among the fastest for all 4 regex. More importantly, it does not suffer from the worst-case problem. RE2 is implemented in C++. For C programmers, onig or PCRE is a good choice. However, to use these two libraries, we should be aware that regex containing the `|' operator may greatly degrade the performance. If this is a concern, one may consider regexp9. Regexp9 is featured as its small code size (<1300 lines of code) which makes it the ideal library when we want to include the source code in a project to avoid an extra library dependency.
Compiled on Debian etch with gcc-4.1.2 (CPU: Xeon E5450 @3GHz). This is a server. Timing may have large variance due to other active processes running on the same machine.
|regex (glibc 2.3.6)||3.12||3.96||0.39||3.91|
|egrep (GNU 2.5.1)||0.09||0.05||0.10||0.16|