Terms Splunk
Terms Splunk
TSTATS and
PREFIX
How to get the most out of your lexicon,
with walklex, tstats, indexed fields,
PREFIX, TERM and CASE
Richard Morgan
Principal Architect | Splunk
© 2020 SPLUNK INC.
Richard Morgan
Principal Architect – Splunk
Forward- During the course of this presentation, we may make forward‐looking statements regarding
future events or plans of the company. We caution you that such statements reflect our
Looking current expectations and estimates based on factors currently known to us and that actual
events or results may differ materially. The forward-looking statements made in the this
Statements presentation are being made as of the time and date of its live presentation. If reviewed after
its live presentation, it may not contain current or accurate information. We do not assume
any obligation to update any forward‐looking statements made herein.
In addition, any information about our roadmap outlines our general product direction and is
subject to change at any time without notice. It is for informational purposes only, and shall
not be incorporated into any contract or other commitment. Splunk undertakes no obligation
either to develop the features or functionalities described or to include any such feature or
functionality in a future release.
Splunk, Splunk>, Data-to-Everything, D2E and Turn Data Into Doing are trademarks and registered trademarks of Splunk Inc. in the United States
and other countries. All other brand names, product names or trademarks belong to their respective owners. © 2020 Splunk Inc. All rights reserved
© 2020 SPLUNK INC.
=
=
© 2020 SPLUNK INC.
Reduction in HW costs
accelerated
considered_buckets
© 2020 SPLUNK INC.
| <everything else>
Horror
Scan Count Vs. Event Count
During execution you see the ratio between scan count and event count
😱😱 Show
😍😍 🥰🥰
scan_count = event_count
TERM is used in less the 1% of all customer searches executed on Splunk Cloud
© 2020 SPLUNK INC.
MAJOR TERM
© 2020 SPLUNK INC.
Notice how all fields other than series= are tokenized into useful TERMS
© 2020 SPLUNK INC.
SIDE NOTE: Over precision in numbers generates many unique TERMS and bloats the tsidx file
© 2020 SPLUNK INC.
Input event
01-27-2020 20:29:22.922 +0000 INFO Metrics - group=per_sourcetype_thruput,
ingest_pipe=0, series="splunkd", kbps=258.6201534528208, eps=1367.6474892210738,
kb=8050.1142578125, ev=42571, avg_age=145747.7853938127, max_age=2116525
TSIDX journal
TERM Events containing The posting lists tells us
TERM
7 that we have two slices
tom 1
6 that contain all the terms
tom. 1
rich 1,4
we need.
5
harry 1, 4 We extract these slices
susan 3, 5, 6 4 from the bucket,
bob 2 3 decompress and run
fred 2,3
2
though schema on the fly
karen 5,6 to see if they match.
loves 2, 3, 4, 5, 6 1
loves. 6
© 2020 SPLUNK INC.
TSIDX journal
TERM Events with TERM But excluding ”loves.” (with
the comma) we have stopped
tom 1 7
the need to open and parse
tom. 1
6 slice 6.
rich 1,4
5 This means only a single
harry 1, 4
event is parsed onto index on
susan 3, 5, 6 4 the fly.
bob 2
3
fred 2,3 The false positive ratio
is now 0% - doubling
🥳🥳
karen 5,6 2
performance
loves 2, 3, 4, 5, 6 1
loves. 6
© 2020 SPLUNK INC.
Fast versatile
© 2020 SPLUNK INC.
Raw search
134 secs
Adoption
The prerequisite of indexed fields means its application is limited
We can use TERM on any of the tokens highlighted in yellow, but notice the one in RED
© 2020 SPLUNK INC.
🚀🚀 48x
Some simple searches can be expressed with TERM
version is
index=itsi_summary TERM(alert_severity=*) faster
| timechart span=1sec count by alert_severity
🚀🚀
prefix version is
🚀🚀
| tstats count where index=itsi_summary TERM(alert_severity=*)
by PREFIX(alert_severity=) _time span=1sec
| rename alert_severity= as alert_severity
| xyseries _time alert_severity count 3x faster again!
© 2020 SPLUNK INC.
prefix version is
30x faster!
© 2020 SPLUNK INC.
09-21-2020 12:10:41.051 +0000 INFO Metrics - group=cachemgr_bucket, open=4557, close=4561, cache_hit=4557, open_buckets=4
09-21-2020 12:10:44.330 +0000 INFO Metrics - group=cachemgr_bucket, open=3550, close=3550, cache_hit=3550, open_buckets=4
09-21-2020 12:10:39.985 +0000 INFO Metrics - group=cachemgr_bucket, open=3412, close=3415, cache_hit=3412, open_buckets=4
09-21-2020 12:10:44.102 +0000 INFO Metrics - group=cachemgr_bucket, register_start=1, open=4096, close=4100, cache_hit=4096, open_buckets=6
09-21-2020 12:10:45.709 +0000 INFO Metrics - group=cachemgr_bucket, register_start=1, register_end=1, open=3162, close=3164, cache_hit=3162, open_buckets=5
09-21-2020 12:10:41.229 +0000 INFO Metrics - group=cachemgr_bucket, register_cancel=1, open=4794, close=4796, cache_hit=4794, open_buckets=7
09-21-2020 12:10:10.012 +0000 INFO Metrics - group=cachemgr_bucket, open=4783, close=4779, cache_hit=4783, open_buckets=8
09-21-2020 12:10:23.227 +0000 INFO Metrics - group=cachemgr_bucket, register_start=1, open=2896, close=2896, cache_hit=2896, open_buckets=4
© 2020 SPLUNK INC.
[full]
[indexing]
# change INTERMEDIATE_MAJORS to "true" if you want an ip address to appear in typeahead as a, a.b, a.b.c, a.b.c.d
# the typical performance hit by setting to "true" is 30%
INTERMEDIATE_MAJORS = false
[search]
MAJOR = [ ] < > ( ) { } | ! ; , ' " \n \r \s \t & ? + %21 %26 %2526 %3B %7C %20 %2B %3D -- %2520 %5D %5B %3A %0A %2C %28 %29 / : = @ . - $ # % \\ _
MINOR =
[standard]
MAJOR = [ ] < > ( ) { } | ! ; , ' " * \n \r \s \t / : = @ . ? - & $ # + % _ \\ %21 %26 %2526 %3B %7C %20 %2B %3D -- %2520
MINOR =
[inner]
MAJOR = [ ] < > ( ) { } | ! ; , ' " * \n \r \s \t / : = @ . ? - & $ # + % _ \\ %21 %26 %2526 %3B %7C %20 %2B %3D -- %2520
MINOR =
[outer]
MAJOR = [ ] < > ( ) { } | ! ; , ' " * \n \r \s \t & ? + %21 %26 %2526 %3B %7C %20 %2B %3D -- %2520
MINOR =
© 2020 SPLUNK INC.
When developing complex searches on large data sets, avoid repeatedly reloading event
data from indexers as you iterate towards your solution
© 2020 SPLUNK INC.
Make
faster!
SPL to LISPY
Make smaller
Bigger is better!
© 2020 SPLUNK INC.
SESSION SURVEY
© 2020 SPLUNK INC.
buckets
buckets
bucket
bucket
events
events
All searches are executed
with an index and a time
buckets bucket events
range. This defines our list
of buckets to consider.
1. Index and time defines
the considered buckets
3. LISPY queries the tsidx to
identify slices to decompress
5. Schema on the fly extracts
and eliminates events The first performance tip is
to make this as tight as
buckets slices events possible.
buckets slices events
What happened?
buckets bucket events
buckets bucket events
By introducing TERM to our
buckets bucket events
search we were able to improve
elimination earlier in the pipeline.
Doing so saves downloading
1. Index and time defines
the considered buckets
3. LISPY queries the tsidx to
identify slices to decompress
5. Schema on the fly extracts
and eliminates events
journal files from SmartStore, and
reduces CPU required for
decompression and parsing
buckets slices events
slices
buckets events
Minimize filtering during schema
buckets slices events
on the fly stage
Agenda 1. Introduction
What this presentation is all about
4. Bloomfilter elimination
How bloomfilters accelerate _raw search
6. Introducing tstats
How tstats delivers further performance improvements
6 “Karen loves
decompress Susan”
scan_count=2, event_count=1
Implies a 50% event elimination during schema on the fly
© 2020 SPLUNK INC.
Agenda 1. Introduction
What this presentation is all about
4. Bloomfilter elimination
How bloomfilters accelerate _raw search
6. Introducing tstats
How tstats delivers further performance improvements
https://www.jasondavies.com/bloomfilter/
© 2020 SPLUNK INC.
A “Bucket” is a Directory
A bucket is a collection of files held in a directory structure; notable files highlighted
Eliminated buckets
Bloomfilters and metadata allows us to eliminate buckets early, avoiding work
considered_buckets vs eliminated_buckets
© 2020 SPLUNK INC.
😱😱 😱😱
enableTsidxReduction = <boolean>
* Whether or not the tsidx reduction capability is enabled.
* By enabling this setting, you turn on the tsidx reduction capability.
This causes the indexer to reduce the tsidx files of buckets when the
buckets reach the age specified by 'timePeriodInSecBeforeTsidxReduction'.
🤮🤮 🤮🤮
* CAUTION: Do not set this setting to "true" on indexes that have been
configured to use remote storage with the "remotePath" setting.
* Default: false