Fast String Matching With Wildcards, Globs, and Gitignore-Style Globs - How Not To Blow It Up - CodeProject
Fast String Matching With Wildcards, Globs, and Gitignore-Style Globs - How Not To Blow It Up - CodeProject
Classic globbing and modern gitignore-style globbing algorithms can be fast, whereas recursive implementations are known to
blow up exponentially; why some freely available source code should not be used.
Introduction
Jack Handy’s wildcard string compare efficiently compares a string to a pattern containing * and ? wildcards. The algorithm is fast
and safe, but does not support gitignore-style globbing rules. In this article, I will illustrate that classic globbing and modern
gitignore-style globbing algorithms can be fast too. I will also explain what’s wrong with recursive implementations that blow up
exponentially and why some freely available source code should not be used.
Background
Wildcard string matching and globbing isn’t as trivial as it may seem at first. In fact, mistakes in the past resulted in serious
vulnerabilities such as denial of service. Simple patches, such as limiting the number of matches to limit the CPU time, have been
applied to fix implementations that suffer exponential blow-up in execution time. More disconcerting is that buggy globbing
implementations can be easily located on the Web: just search for wildmat.c to find copies of implementations that may crash when
a character class range is incomplete, such as [a-.
This algorithm is fairly simple and needs little explanation: recursive calls are made for stars until matching or the end of the string
of text is reached. A ? matches any character. Otherwise, the current pattern character is matched with the current character in the
text. We move up in the text and in the pattern by one character, and repeat until the end of the text is reached. Note that any
trailing stars can be ignored in the pattern when we get to the end of the text.
This approach works well for short strings to match and patterns with few stars. However, it is easy to find examples that result in an
explosive execution time blow-up due to excessive backtracking on stars. Cases such as a*a*a*a*a*a*a*a*b and a string
aaa…aaa of 100 a’s take minutes to terminate with a failure to match. Just try the following two shell commands:
It takes bash-3.2 about 8 minutes to terminate on my 2.9GHz i7 machine with decent performance. It takes tcsh-6.18 over
an hour to terminate. The explosive 2n states for n stars is to blame for this behavior.
Using ABORT
One of the first globbing algorithms in history written by Rich Salz wildmat.c (original) had this undesirable behavior, but was
updated by Lars Mathiesen in the early 2000s with an ABORT state. Their comments in the updated wildmat.c source code still
remain there to this day and are quite insightful:
Quote:
Once the control of one instance of DoMatch enters the star-loop, that instance will return either TRUE or ABORT, and any
calling instance will therefore return immediately after (without calling recursively again). In effect, only one star-loop is ever
active. It would be possible to modify the code to maintain this context explicitly, eliminating all recursive calls at the cost of
some complication and loss of clarity (and the ABORT stuff seems to be unclear enough by itself).
Apparently, the exponential blow-up problem was already well known at that time, as we find the following comment in wildmat.c
demonstrating the issue:
pattern: -*-*-*-*-*-*-12-*-*-*-m-*-*-*
text 1: -adobe-courier-bold-o-normal--12-120-75-75-m-70-iso8859-1
text 2: -adobe-courier-bold-o-normal--12-120-75-75-X-70-iso8859-1
Text 1 matches with 51 calls, while text 2 fails with 54 calls.
Without the ABORT, then it takes 22310 calls to fail. Ugh.
The powerful insight here is that only the last star-loop should be active, because advancing any of the previous star-loops is not
productive when the last star-loop fails. This clever change made globbing execute in linear time for the typical case. Only in the
worst case, the algorithm takes quadratic time in the length of the pattern and string.
This improvement avoids exponential blow-up, but still makes a recursive call for each star in the pattern. This is not necessary. We
can save that state of the matcher so that we can restore it from our backup to execute another iteration of the last star-loop until
we are done.
The execution time of this algorithm is linear in the length of the pattern and string for typical cases. In the worst case, this
algorithm takes quadratic time (to see why, consider pattern *aaa…aaab with ½n a’s and string aaa…aaa with n a’s, which
requires ¼n2 comparisons before failing.)
In case you're asking if there is a linear worst-case algorithm, the answer is yes: by constructing a deterministic finite automaton
(DFA) from the pattern for matching. This is beyond the scope of this article and requires taking into account the (high) cost of DFA
construction.
To adjust our previous match algorithm to perform glob matching, we add some checks for the / pathname separator such that
* and ? cannot match it:
For practical glob matching applications however, this glob-like ("globly") algorithm is not yet complete. It lacks support for
character classes, such as [a-zA-Z] to match letters and [^0-9] to match any character except a digit. It also lacks a means to
escape the special meaning of the *, ?, and [ meta characters with a backslash. Adding these new features requires a redesign of
the glob meta character tests in the main loop, using a switch to select glob meta characters and by using continue to
commence the main loop. Breaking from the switch moves control to the star-loop:
#define PATHSEP '/' // pathname separator, we should define '\\' for Windows instead
// returns TRUE if text string matches glob pattern with * and ?
int glob_match(const char *text, const char *glob)
{
const char *text_backup = NULL;
const char *glob_backup = NULL;
while (*text != '\0')
{
switch (*glob)
{
case '*':
// new star-loop: backup positions in pattern and text
text_backup = text;
glob_backup = ++glob;
continue;
case '?':
// match any character except /
if (*text == PATHSEP)
break;
text++;
glob++;
continue;
case '[':
{
int lastchr;
int matched = FALSE;
int reverse = glob[1] == '^' || glob[1] == '!' ? TRUE : FALSE;
// match any character in [...] except /
if (*text == PATHSEP)
break;
// inverted character class
if (reverse)
glob++;
// match character class
for (lastchr = 256; *++glob != '\0' && *glob != ']'; lastchr = *glob)
if (lastchr < 256 && *glob == '-' &&
glob[1] != '\0' ? *text <= *++glob && *text >= lastchr : *text == *glob)
matched = TRUE;
if (matched == reverse)
break;
text++;
if (*glob != '\0')
glob++;
continue;
}
case '\\':
// literal match \-escaped character
glob++;
// FALLTHROUGH
default:
// match the current non-NUL character
if (*glob != *text && !(*glob == '/' && *text == PATHSEP))
break;
text++;
glob++;
continue;
}
if (glob_backup == NULL || *text_backup == PATHSEP)
return FALSE;
// star-loop: backtrack to the last * but do not jump over /
text = ++text_backup;
glob = glob_backup;
}
// ignore trailing stars
while (*glob == '*')
glob++;
// at end of text means success if nothing else is left to match
return *glob == '\0' ? TRUE : FALSE;
}
The PATHSEP character is either the conventional / or the Windows \ used in the string text to separate pathnames. Note that
traditional Unix uses ! for character class inversion, for example [!0-9]. Here, we offer a choice of ! and the more conventional
^ to invert a character class, for example [^0-9].
The execution time of this algorithm is linear in the length of the pattern and string for typical cases and takes quadratic time in the
worst case.
Gitignore-style globbing applies the following rules to determine file and directory pathname matches:
We also make the following assumption: when a glob contains a path separator /, the full pathname is matched. Otherwise, the
basename of a file or directory is matched. For example, *.h matches foo.h and bar/foo.h, bar/*.h matches bar/foo.h but not foo.h
and not bar/bar/foo.h. A leading / may be used to force /*.h to match foo.h but not bar/foo.h.
Examples:
a[a-z]b matches aab, abb, acb, azb but not a, b, a3b, aAb, aZb
a[^xy]b matches aab, abb, acb, azb but not a, b, axb, ayb
a[^a-z]b matches a3b, aAb, aZb but not a, b, aab, abb, acb, azb
To implement gitignore-style globbing, we need two star-loops: one for single star, the “*-loop”, and another for double star, the
“**-loop”. The **-loop overrules the *-loop because there is no point in backtracking on a single star when we encounter a double
star after it in the glob pattern. The converse is not true: we should backtrack on a double star when a single star that comes after it
in the glob does not match:
The execution time of this algorithm is linear in the length of the pattern and string for typical cases and cubic in the worst case,
when both a **-loop and a *-loop are active.
Refinements
Files with names starting with a dot (.) are treated differently by shell globs, meaning that a dot beginning a name must be
matched explicitly and cannot be matched by a wildcard. To replicate this behavior, we add the following:
Unicode matching can be supported in two ways: by using wide strings, i.e., wchar_t or std::wstring or by using UTF-8
encoded strings. The wide string option requires no changes to the algorithms. The UTF-8 version requires a few changes for ? and
character classes, with everything else staying the same:
case '?':
// match anything except /
if (*text == PATHSEP)
break;
utf8(&text);
glob++;
continue;
case '[':
{
int chr;
…
// match character class
lastchr = 0x10ffff;
glob++;
while (*glob != '\0' && *glob != ']')
if (lastchr < 0x10ffff && *glob == '-' ? (chr = utf8(&text)) <= utf8(&glob) &&
chr >= lastchr : utf8(&text) == (lastchr = utf8(&glob)))
matched = TRUE;
if (matched == reverse)
break;
if (*glob)
glob++;
continue;
}
where utf8() returns the wide character and advances by one UTF-8 character in the glob:
Conclusions
Wildcard matching, classic globbing, and modern gitignore-style globbing can be fast when implemented with star-loop iterations
instead of recursion. These algorithms typically run in linear time in the length of the string and pattern and in the worst case, may
run in quadratic or cubic time in the length of the pattern and string, without blowing up CPU usage exponentially.
Extended globbing features such as shell brace expansion also use recursion with backtracking, which may make the exponential
blow-up problem worse. Consider for example a*{b,a*{b,a*{b,a*{b,a*{b,a*{b,a*{b,a*b}}}}}}}. There is no
easy way to limit the execution time for brace expansion except perhaps by limiting the number of braces or by converting an
extended glob to a regex for matching.
I wrote this article after implementing gitignore-style globbing for ugrep, due to the lack of available and usable open source code.
The versions of wildmat.c that I found in official repositories all had a nasty bug that is actually documented in the source code:
"Might not be robust in face of malformed patterns; e.g., "foo[a-" could cause a segmentation violation." Ugh.
History
3rd August, 2019: First draft
5th August, 2019: First publication
8th August, 2019: Updated
12th September, 2019: Added Java download sources
19th September, 2019: Added shell execution example; [] does not match /; dotglob refinement; new and updated
download sources
License
This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)