UrlPatternIndex
overviewThe UrlPatternIndex component can be used to build an index over a set of URL rules, and speed up matching network requests against these rules.
A URL rule (see flat::UrlRule
structure) describes a subset of network requests that it targets. The essential element of the rule is its URL pattern, which is a simplified regular expression (a string with wildcards). UrlPatternIndex
is mainly based on text fragments extracted from the patterns.
The component uses the FlatBuffers serialization library to represent the rules and the index. The key advantage of the format is that it does not require deserialization. Once built, the data structure can be stored on disk or transferred, then copied/loaded/memory-mapped and used directly.
UrlPattern
sThe component is built around an underlying concept of a URL pattern, defined in the class UrlPattern
. These patterns are largely inspired by patterns in EasyList / Adblock Plus filters and are documented in more detail in the declarativeNetRequest documentation.
The underlying goal of the index format is to efficiently check to see if URLs match any URL patterns contained in the index. The data structure used here is an N-gram filter. An N-gram is a string consisting of N (up to 8) bytes. Currently, the component has chosen to use kNGramSize = 5
.
The strategy used in this component is to build a data structure which maps NGram -> vector<UrlRule>
, by finding all N-grams associated with a given URL pattern, and picking one of them (the most distinctive one, see UrlPatternIndexBuilder::GetMostDistinctiveNGram
). The URL pattern is then inserted into the map associated with that N-gram.
Note: URL patterns have special characters like *
and ^
which implement special wildcard matching. N-grams are built only between these special characters.
For example, the URL pattern foo.com/*abc*
will generate the following 5-grams:
foo.c oo.co o.com .com/
See url_pattern_index.fbs for the raw underlying Flatbuffers format which builds the N-gram filter using a custom hash table implementation.
Querying a built index is very similar to building the index in the first place. Given a URL, it is broken into all of it's component N-grams, just like the URL pattern was above. For example, the URL https://foo.com/?q=abcdef
would generate the following 5-grams:
https ttps: tps:/ ps:// s://f ://fo //foo /foo. foo.c oo.co o.com .com/ com/? om/?q m/?q= /?q=a ?q=ab q=abc =abcd abcde bcdef
With these N-grams extracted, we can just consider all of the UrlPattern
s which are associated with those N-grams. See FindMatchInFlatUrlPatternIndex
and FindMatchAmongCandidates
for this logic.
Many of these N-grams match ones that are also present in the foo.com/*abc*
example above , so we can be sure that that URL pattern will be considered during pattern evaluation.
You might be thinking “what about URLs whose length is less than N, or patterns that generate no N-grams?” We will make sure to put all rules like that into a special list called the fallback_rules
which are applied to every URL unconditionally.
UrlPattern
This logic is encapsulated in UrlPattern::MatchesUrl
. This essentially consists of splitting a URL pattern by the *
wildcard, and considering each subpattern in between the *
s.
There is some complexity here to deal with:
^
separator matching, which matches any ASCII symbol except letters, digits, and the following: '_', '-', '.', '%'
. See fuzzy_pattern_matching.|
Left/right anchors, which specifies the beginning or end of a URL.||
Domain anchors, which specifies the start of a (sub-)domain of a URL.After all this complexity is dealt with, the bulk of the subpattern logic is simply StringPiece::find / std::search
! This component used to use something much more complicated (Knuth-Morris-Pratt algorithm), but benchmarking on real URLs proved the simple solution was more optimal (and removed the need for a preprocessing step at indexing time), so it was removed.
For example, in checking if https://foo.com/?q=abcdef
matches foo.com/*abc*
, the component will:
foo.com/
and abc
.foo.com/
in https://foo.com/?q=abcdef
, which is a match!abc
in ?q=abcdef
, which is a match! This is the last pattern, so return true