Description
Grok.match(bytes, offset, length, extracter)
API seems to have problems when using a positive offset together with a pattern that matches the input as a suffix (ie. not the beginning, but discarding a prefix).
Eg. with a pattern like %{WORD:a} %{WORD:b} %{NUMBER:c:int}
and an input string like x1 a1 b1 12
, the pattern is supposed to discard x1
and match the rest of the input.
Everything works fine as long as the string is stored in a byte[]
alone, and the offset used is 0
.
If the input of match()
is a bigger array, where the string is in the middle and an offset > 0 is used, then the match fails.
Here is a code snippet that reproduces the problem.
Grok grok = new Grok(GrokBuiltinPatterns.get(true), "%{WORD:a} %{WORD:b} %{NUMBER:c:int}", logger::warn);
byte[] utf8 = "x1 a1 b1 12 x2 a2 b2 13 ".getBytes(StandardCharsets.UTF_8);
// this succeeds
assertThat(captureBytes(grok, utf8, 0, 12), equalTo(Map.of("a", "a1", "b", "b1", "c", 12)));
// this fails, see offset 12
assertThat(captureBytes(grok, utf8, 12, 12), equalTo(Map.of("a", "a2", "b", "b2", "c", 13)));
private Map<String, Object> captureBytes(Grok grok, byte[] utf8, int offset, int length) {
GrokCaptureExtracter.MapExtracter extracter = new GrokCaptureExtracter.MapExtracter(grok.captureConfig());
if (grok.match(utf8, offset, length, extracter)) {
return extracter.result();
}
return null;
}
[edit] added the same test in #95003
Checking the library code, the problem could depend on the usage of Matcher.search(), where we are passing matcher.search(offset, length, Option.DEFAULT)
Since for Regex.matcher() we are passing offset
and offset + length
(instead of offset
, length
), it's possible that also matcher.search()
should accept offset + length
.
Joni API documentation is very basic and it's absolutely not clear about this specific API and the problem never happened at higher level because we are always using that API with offset = 0.
The fix proposed above solves the problem and passes all the tests, but it would be good to have a confirmation from someone with more expertise on this lib.
PR following with the proposed fix