Splunk 6.4.0 Troubleshooting
Splunk 6.4.0 Troubleshooting
0
Troubleshooting Manual
Generated: 5/12/2016 5:48 am
First steps.............................................................................................................2
Intro to troubleshooting Splunk Enterprise..................................................2
Determine which version of Splunk Enterprise you're running....................4
Use btool to troubleshoot configurations.....................................................5
Splunk on Splunk app.................................................................................7
Platform Instrumentation..................................................................................31
About Splunk Enterprise platform instrumentation....................................31
What gets logged?....................................................................................33
Configure platform instrumentation...........................................................37
Sample platform instrumentation searches...............................................42
i
Table of Contents
Common back end scenarios...........................................................................81
What do I do with buckets?.......................................................................81
I get errors about ulimit in splunkd.log......................................................82
Intermittent authentication timeouts on search peers...............................83
Event indexing delay.................................................................................88
Common issues with Splunk and WMI......................................................92
Troubleshooting Windows event log collection.......................................102
I need advanced help troubleshooting Splunk for Windows...................103
SuSE Linux search error.........................................................................109
Garbled events........................................................................................109
Binary file error........................................................................................110
Performance degraded in a search head pooling environment..............111
HTTP thread limit issues.........................................................................113
ii
Welcome to the Troubleshooting Manual
As with all Splunk docs, use the comment box feedback link at the bottom of
each page to make any suggestions.
Here's a brief description of each chapter you see in the left navigation bar:
First steps
Get oriented here. Find some tips about where to start with your troubleshooting.
Splunk Enterprise logs all sorts of things about itself. Find out what, where, and
how in this section.
We're here to help! If you're stuck, do contact us! Details and tips in this section.
This section includes some of the most common scenarios we see in Splunk
Support, with suggestions about what to do. Much more material is in the works
here!
1
First steps
For example, if the error occurs in a dashboard or alert, check the underlying
search first to see whether the error appears there. When troubleshooting
searches, it's almost always best to remove the dashboard layer as soon as
possible.
For another example, does the problem exist in one app but not the other? With
one user but not admins?
Did the error start occurring after the product was functioning
normally?
Yes! So what has changed? Remember to think of both Splunk and non-Splunk
factors. Was there a server outage? Network problems? Has any configuration or
topology changed?
Configurations
Splunk has configuration files in several locations, with rules about which files
take precedence over each other. Use btool to check which settings your Splunk
instance is using. Read about btool in this manual.
The *.conf files are case-sensitive. Check settings and values against the spec
and example configuration files in the Admin manual.
2
There are also a lot of settings in the .conf files that aren't exposed in Splunk
Web. It's best to leave these alone unless you know what changing these
settings might do.
Splunk has various internal log files that can help you diagnose problems. Read
about the log files in this manual.
The Distributed Deployment Manual has a high-level overview of the Splunk data
pipeline, breaking it into input, parsing, indexing, and search segments.
For more detail on each segment, see this Community Wiki article about how
indexing works.
Check the (continuously growing) chapter in this manual on some of the most
common symptoms and solutions.
If you need additional help or opinions, ask the Splunk community! The
Community Wiki, Splunk Answers, and the #splunk IRC channel on efnet are
available to everyone and provide a great resource.
Once you've found a way to fix the problem, test it! Test any noninvasive
changes first. Then, test any changes that would create minor interruptions.
Make sure no new issues arise from your tested solution.
Stuck?
If you get stuck at any point, contact Splunk Support. Don't forget to send a diag!
Read about making a diag in this manual.
3
Determine which version of Splunk Enterprise
you're running
In Splunk Web
Click the About link at the bottom left of most pages in Splunk Web to view a
JavaScript overlay with the version and build numbers.
or
In Splunk Search
Splunk Enterprise indexes the splunk.version file into the _internal index and
sends it along to the indexer by forwarders.
Here's a search that shows you how many installs you have of each Splunk
Enterprise version:
4
index=_internal sourcetype=splunk_version | dedup host | top VERSION
To help you out, Splunk provides btool. This is a command line tool that can
help you troubleshoot configuration file issues or just see what values are being
used by your Splunk Enterprise installation.
Note: btool is not tested by Splunk and is not officially supported or guaranteed.
That said, it's what our Support team uses when trying to troubleshoot your
issues.
You can run btool to see all the configuration values in use by your Splunk
instance.
You probably want to send the results of btool into a text file that you can peruse
then delete, like this:
5
./splunk cmd btool transforms list > /tmp/transformsconfigs.txt
Piping to a file is handy for all use cases of btool, but for simplicity we'll only
explicitly mention it this once.
You can also run btool for a specific app in your Splunk installation. It will list all
the configuration values in use by that app for a given configuration file.
where <app_name> is the name of the app you want to see the configurations for.
For example, if you want to know what configuration options are being used in
props.conf by the Search app, type:
This returns a list of the props.conf settings currently being used for the Search
app.
The app name is not required. In fact, it's often a good idea not to specify the
app when using btool. In the case of btool, insight into all of your configurations
can be helpful.
Another thing you can do with btool is find out from which specific app Splunk is
pulling its configuration parameters for a given configuration file. To do this, add
the --debug flag to btool like in this example for props.conf:
6
Read about btool syntax in "Command line tools for use with Support".
Additional resources
Have questions? Visit Splunk Answers and see what questions and answers the
Splunk community has using btool.
For information about installing and configuring the Splunk on Splunk app, see
the Splunk on Splunk documentation.
Each view offers help to explain the significance of the different charts and
panels shown, as well as the searches that populate them.
7
Errors: Parses Splunk Enterprise internal logs to help expose errors and
abnormal behavior. Contains dedicated search controls to help you locate the
source of problems.
Warnings: Detects known problems that may exist on your Splunk Enterprise
instance.
Crash Log Viewer: Detects and displays recent crash logs and correlates them
with Splunk Enterprise log files.
Search Detail Activity: Displays CPU utilization for all searches, providing
various ways to analyze and compare searches.
Scheduler Activity: Shows a variety of performance and usage metrics for the
search scheduler. Also, displays statistics on alert actions associated with
scheduled searches.
8
Splunk Enterprise log files
The Splunk Enterprise internal log files are rolled based on size. You can change
the default log rotation size by editing $SPLUNK_HOME/etc/log.cfg.
index=_internal
Internal logs
Note that some log files are not created until your Splunk instance uses them, for
example crawl.log.
9
Log of crawl activity. Read about crawl in the Getting
Data In Manual. Crawl is now deprecated.
Django HTTP request log (equivalent to
django_access.log web_access.log) for the Django Bindings component of
the Splunk Web Framework.
Raw Django error output from Splunk Web Framework
django_error.log (not really meant to be human readable). Used with link
on error screens to see the full error in Splunk Web.
General Django related messages from Splunk Web
django_service.log
Framework (equivalent to web_service.log)
Log of metrics related to exporting data with Hadoop
export_metrics.log
Connect.
first_install.log Shows version number.
Inputs found by crawl. This log file will be empty unless
inputs.log
you use the crawl command.
Beginning with Splunk 5, no longer used. Read about
intentions.log intentions in the Developing Views and Apps for Splunk
Web Manual.
license_audit.log No longer used.
Indexed volume in bytes per pool, index, source,
license_usage.log sourcetype, and host. Available only on a Splunk
license master.
Contains periodic snapshots of Splunk performance
and system data, including information about CPU
usage by internal processors and queue usage in
Splunk's data processing. The metrics.log file is a
sampling of the top ten items in each category in 30
metrics.log
second intervals, based on the size of _raw. It can be
used for limited analysis of volume trends for data
inputs. For more information about metrics.log, see
About metrics.log and Work with metrics.log in this
manual.
A log of events during install and migration. Specifies
migration.log
which files were altered during upgrade.
Contains runtime messages from the Splunk Enterprise
mongod.log
app key value store.
python.log
10
Python events within Splunk. Useful for debugging
REST endpoints, communication with splunkd, PDF
Report Server App, Splunk Web display issues,
sendmail (email alerts), and scripted inputs. With
web_service.log, one of the few Splunk logs that uses
"WARNING" instead of "WARN" for second most
verbose logging level.
Messages from StreamedSearch channel. This code is
executed on the search peers when a search head
remote_searches.log makes a search request. So this file contains useful
information on indexers regarding searches they're
participating in.
All actions (successful or unsuccessful) performed by
scheduler.log the splunkd search and alert scheduler. Typically, this
shows scheduled search activity.
Beginning with Splunk 5, no longer used. Instead, use
the following search syntax: | history. This shows all
searches.log
the searches that have been run, plus stats for the
searches.
searchhistory.log No longer used.
The primary log written to by the Splunk server. May be
requested by Splunk Support for troubleshooting
splunkd.log purposes. Any stderr messages generated by scripted
inputs, scripted search commands, and so on, are
logged here.
Any action done from splunkd through the UI is logged
here, including splunkweb, the CLI, all POST GET
actions, deleted saved searches, and other programs
splunkd_access.log accessing the REST endpoints. Also logs the time
taken to respond to the requests. Search job artifacts
logged here include size of data returned with search.
sourcetype="splunkd_access"
The Unix standard error device for the server. Typically
this contains (for *nix) times of healthy start and stop
splunkd_stderr.log events, as well as various errors like exceptions,
assertions, and errors generated by libraries and the
operating system.
splunkd_stdout.log The Unix standard output device for the server.
splunkd_ui_access.log
11
Starting in 6.2, contains a significant portion of the
types of events that used to be logged in
web_access.log.
This log is written to by the prereq-checking utils
splunkd clone-prep-clear-config, splunkd
validatedb, splunkd check-license, splunkd
check-transforms-keys, and splunkd rest (for offline
splunkd-utility.log
CLI). Each util logs Splunk version, some basic config,
and current OS limits like max number of threads, and
then messages specific to the util. Consult this log file
when splunkd didn't start.
Requests made of Splunk Web, in an Apache
access_log format. Much of the types of events logged
web_access.log
here are logged in splunkd_ui_access.log starting in
6.2.
Primary log written by splunkweb. Records actions
made by splunkweb. This and python.log are the only
web_service.log logs that, in second most verbose logging level, write
messages with "WARNING" instead of Splunk log files'
usual "WARN."
Introspection logs
Splunk also creates search logs. Note that these are not indexed to _internal.
Each search has its own directory for all information specific to the search,
including its search logs. The search's directory is named with (among other
parameters) the search_id. (Match a search to its search_id in audit.log.) You'll
find the search directory in $SPLUNK_HOME/var/run/splunk/dispatch/.
If you have any long-running real-time searches, you might want to adjust the
maximum size of your search logs. These logs are rotated when they reach a
12
default maximum size of 25 MB. Splunk keeps up to five of them for each search,
so the total log size for a search can conceivably grow as large as 125 MB.
Debug mode
Splunk has a debugging parameter. Read about enabling debug logging in this
manual.
Except where noted above, Splunk's internal logging levels are DEBUG INFO WARN
ERROR FATAL (from most to least verbose).
1. Navigate to Settings > System settings > System logging. This generates a
list of log channels and their status.
2. To change the logging level for a particular log channel, click on that channel.
This brings up a page specific to that channel.
3. On the log channel's page, you can change its logging level.
Settings > System settings > System logging is meant only for dynamic and
temporary changes to Splunk log files. For permanent changes, use
$SPLUNK_HOME/etc/log.cfg instead.
13
Included data models
Splunk comes with two sample data models. These data models are constructed
from Splunk's internal logs. By interacting with them, you can learn about
Splunk's log files and about data models in one fell swoop.
To access the internal log data models, click Pivot. By default, you should see
two data models, "Splunk's Internal Audit Logs - SAMPLE" and "Splunk's Internal
Server Logs - SAMPLE."
Be warned, Splunk's debug mode is extremely verbose. All the extra chatter
might obscure something that might have helped you diagnose your problem.
And running Splunk in debug mode for any length of time will make your internal
log files really pretty unwieldy. Running debug mode is not recommended on
production systems.
Splunk has a debugging parameter (--debug) that you can use when starting
Splunk from the CLI in *nix. This command outputs logs to
$SPLUNK_HOME/var/log/splunk/splunkd.log. To enable debug logging from the
command line:
Navigate to $SPLUNK_HOME/bin.
Stop Splunk, if it is running.
Save your existing splunkd.log file by moving it to a new filename, like
splunkd.log.old.
Restart Splunk in debug mode with splunk start --debug.
When you notice the problem, stop Splunk.
Move the new splunkd.log file elsewhere and restore your old one.
Stop or restart Splunk normally (without the --debug flag) to disable debug
logging.
Specific areas can be enabled to collect debugging details over a longer period
with minimal performance impact. See the category settings in the file
14
$SPLUNK_HOME/etc/log.cfg to set specific log levels without enabling a large
number of categories as with --debug. Restart Splunk after changing this file.
Note: Not all messages marked WARN or ERROR indicate actual problems with
Splunk; some indicate that a feature is not being used.
Note also that this option is not available on Windows. To enable debugging on
Splunk running on Windows, enable debugging on a specific processor in Splunk
Web or using log.cfg.
In Splunk Web
You can enable these DEBUG settings via Splunk Web if you have admin
privileges. Navigate to Settings > System settings > System logging. Search
for the processor names using the text box. Click on the processor name to
change the logging level to DEBUG. You do not need to restart Splunk. In fact,
these changes will not persist if you restart the Splunk instance.
In log.cfg
For example, to see how often Splunk is checking on a particular file, put
'FileInputTracker' in DEBUG. Update the existing entry to read
category.FileInputTracker=DEBUG
15
Restart Splunk. Now every time Splunk checks the inputs file, it will be recorded
in $SPLUNK_HOME/var/log/splunk/splunkd.log. Remember to change these
settings back when you are finished investigating.
If a default level is not specified for a category, the logging level defaults to your
rootCategory setting.
Note: Leave category.loader at INFO. This is what gives us our build and system
info.
To change the maximum size of a log file before it rolls, change the maxFileSize
value (in bytes) for the desired file:
appender.A1=RollingFileAppender
appender.A1.fileName=${SPLUNK_HOME}/var/log/splunk/splunkd.log
appender.A1.maxFileSize=250000000
appender.A1.maxBackupIndex=5
appender.A1.layout=PatternLayout
appender.A1.layout.ConversionPattern=%d{%m-%d-%Y %H:%M:%S.%l} %-5p %c
- %m%n
About precedence
If you have duplicate lines in log.cfg, the last line takes precedence. For example,
category.databasePartitionPolicy=INFO
category.databasePartitionPolicy=DEBUG
will give you DEBUG, but in the other order it will not.
The other log-*.cfg files behave similarly when you add categories. To set only
some things in a search.log into debug mode, then in log-searchprocess.cfg just
add a new category line after the rootCategory:
rootCategory=INFO,searchprocessAppender category.whatever=DEBUG
appender.searchprocessAppender=RollingFileAppender
This leaves everything else as it was, which means only the debug messages
you want are generated. Putting rootCategory into DEBUG mode makes the
dispatch directories huge, so it's not a good choice for long-running debug.
16
log-local.cfg
You can put log.cfg settings into a local file, log-local.cfg file, residing in the
same directory as log.cfg. The settings in log-local.cfg take precedence. And
unlike log.cfg, the log-local.cfg file doesn't get overwritten on upgrade.
With endpoints
In Splunk 4.1 and later, you can access a debugging endpoint that shows status
information about monitored files:
https://your-splunk-server:8089/services/admin/inputstatus/TailingProcessor:FileStatus
Enable debug messages from the CLI (4.1.4 and later versions)
Note: This search will return the message "HTTP Status: 200". This is not an
error and is normal.
rootCategory=DEBUG,searchprocessAppender
category.UnifiedSearch=DEBUG
category.IndexScopedSearch=DEBUG
This change takes effect immediately for all searches started after the change.
17
Debug Splunk Web
Change the logging level for the splunkweb process by editing the file:
$SPLUNK_HOME/etc/log.cfg
or if you have created your own $SPLUNK_HOME/etc/log-local.cfg
[python]
splunk = DEBUG
# other lines should be removed
The logging component names are hierarchical so setting the top level splunk
component will affect all loggers unless a more specific setting is provided, like
splunk.search = INFO.
Restart the splunkweb process with the command ./splunk restart splunkweb.
The additional messages are output in the file
$SPLUNK_HOME/var/log/splunk/web_service.log.
About metrics.log
This topic is an overview of metrics.log.
To learn about other log files, read "What Splunk logs about itself."
For an example using metrics.log, read "Troubleshoot inputs with
metrics.log."
By default, metrics.log reports the top 10 results for each type. You can change
that number of series from the default by editing the value of maxseries in the
[metrics] stanza in limits.conf.
18
Structure of the lines
First, boiler plate: the timestamp, the "severity," which is always INFO for metrics
events, and then the kind of event, "Metrics."
The next field is the group. This indicates what kind of metrics data it is. There
are a few groups in the file, including:
pipeline
queue
thruput
tcpout_connections
udpin_connections
mpool
Pipeline messages
Pipeline messages are reports on the Splunk pipelines, which are the
strung-together pieces of "machinery" that process and manipulate events
flowing into and out of the Splunk system. You can see how many times data
reached a given machine in the Splunk system (executes), and you can see how
much cpu time each machine used (cpu_seconds).
Plotting totals of cpu seconds by processor can show you where the cpu time is
going in indexing activity. Looking at numbers for executes can give you an idea
of data flow. For example if you see:
then it's pretty clear that a large portion of your items aren't making it past the
aggregator. This might indicate that many of your events are multiline and are
being combined in the aggregator before being passed along.
19
Read more about Splunk's data pipeline in "How data moves through Splunk" in
the Distributed Deployment Manual.
Queue messages
Most of these values are not interesting. But current_size, especially considered
in aggregate, across events, can tell you which portions of Splunk indexing are
the bottlenecks. If current_size remains near zero, then probably the indexing
system is not being taxed in any way. If the queues remain near 1000, then more
data is being fed into the system (at the time) than it can process in total.
This message contains the blocked string, indicating that it was full, and
someone tried to add more, and couldn't. A queue becomes unblocked as soon
as the code pulling items out of it pulls an item. Many blocked queue messages
in a sequence indicate that data is not flowing at all for some reason. A few
scattered blocked messages indicate that flow control is operating, and is normal
for a busy indexer.
If you want to look at the queue data in aggregate, graphing the average of
current_size is probably a good starting point.
There are queues in place for data going into the parsing pipeline, and for data
between parsing and indexing. Each networking output also has its own queue,
which can be useful to determine whether the data is able to be sent promptly, or
alternatively whether there's some network or receiving system limitation.
Thruput messages
20
Thruput is measured in the indexing pipeline. If your data is not reaching this
pipeline for some reason, it will not appear in this data. Thruput numbers relate to
the size of "raw" of the items flowing through the system, which is typically the
chunks of the original text from the log sources. This differs from the tcpout
measurements, which measure the total byte count written to sockets, including
protocol overhead as well as descriptive information like host, source, source
type.
This is the best line to look at when tuning performance or evaluating indexing
load. It tries to capture the total indexing data load.
Note: In thruput lingo, kbps does not mean kilobits per second, it means kilobytes
per second. The industry standard term would be to write this something like
KBps.
Following the catchall, there can be variety of breakouts of the indexing thruput,
including lines like:
21
... group=per_sourcetype_thruput, series="splunkd", kbps=0.261530,
eps=1.774194, kb=8.107422, ev=2606, avg_age=420232.710668,
max_age=420241
In thruput messages the data load is broken out by host, index, source, and
source type. This can be useful for answering two questions:
The series value identifies the host or index, etc. The kb value indicates the
number of kilobytes processed since the last sample. Graphing kb in aggregate
can be informative. The summary indexing status dashboard uses this data, for
example.
The avg_age and max_age refer to the difference between the time that the event
was seen by the thruput processor in the indexing queue, and the time when the
event occurred (or more accurately, the time that Splunk decided the event
occurred).
max_age is the largest difference between the current time and the
perceived time of the events coming through the thruput processor.
avg_age is the average difference between the current time and the
perceived time of the events coming through the thruput processor.
22
These lines look like this:
... group=tcpout_connections,
name=undiag_indexers:10.159.4.67:9997:0, sourcePort=11089,
destIp=10.159.4.67, destPort=9997, _tcp_Bps=28339066.17,
_tcp_KBps=27674.87,
_tcp_avg_thruput=27674.87, _tcp_Kprocessed=802571,
_tcp_eps=33161.10, kb=802571.21
name is a combination of conf stanza and entries that fully define the target
system to the software.
sourcePort is the dynamically assigned port by the operating system for
this socket.
destIP and destPort define the destination of the socket. destPort is
typically statically configured, while destIP may be configured or returned
by name resolution from the network.
All of the size-related fields are based on the number of bytes that Splunk has
successfully written to the socket, or SSL-provided socket proxy. When SSL is
not enabled for forwarding (the default), these numbers represent the number of
bytes written to the socket, so effectively the number of bytes conveyed by the
tcp transport (irrespective of overhead issues such as keepalive, tcp headers, ip
headers, ethernet frames and so on). When SSL is enabled for forwarding, this
number represents the number of bytes Splunk handed off to the OpenSSL layer.
_tcp_Bps is the bytes transmitted during the metrics interval divided by the
duration of the interval (in seconds)
_tcp_KBps is the same value divided by 1024
_tcp_avg_thruput is an average rate of bytes sent since the last time the
tcp output processor was reinitialized/reconfigured. Typically this means
an average since Splunk started.
_tcp_KProcessed is the total number of bytes written since the processor
was reinitialized/reconfigured, divided by 1024.
_tcp_eps is the number of items transmitted during the interval divided by
the direction of the interval (in seconds). Note that items will frequently not
be events for universal/light forwarders (instead, data chunks)
kb is the bytes transmitted during the metrics interval divided by 1024.
udpin messages
23
group=udpin_connections, 2514, sourcePort=2514, _udp_bps=0.00,
_udp_kbps=0.00,
_udp_avg_thruput=0.00, _udp_kprocessed=0.00, _udp_eps=0.00
Be aware that it's quite achievable to max out the ability of the operating system,
let alone Splunk, to handle UDP packets at high rates. This data might be useful
to determine if any data is coming in at all, and at what times it rises. There is no
guarantee that all packets sent to this port will be received and thus metered.
mpool messages
The mpool lines represent memory used by the Splunk indexer code only (not
any other pipeline components). This information is probably not useful to
anyone other than Splunk developers.
In this case we can see that some memory is sometimes in use, although at the
time of the sample, none is in use, and that generally the use is low.
24
map, pipelineinputchannel name messages
These messages are primarily debugging information over the Splunk internal
cache of processing state and configuration data for a given data stream (host,
source, or source type).
subtask_seconds messages
25
rebuild_metadata=0.000300, update_bktManifest=0.000000,
service_volumes=0.000105, update_bktManifest=0.000000,
service_volumes=0.000105, service_maxSizes=0.000000,
service_externProc=0.000645
You might want to identify a data input that has suddenly begun to generate
uncharacteristically large numbers of events. If this input is hidden in a large
quantity of similar data, it can be difficult to determine which one is actually the
problem. You can find it by searching the internal index (add index=_internal to
your search) or just look in metrics.log itself in $SPLUNK_HOME/var/log/splunk.
There's a lot more in metrics.log than just volume data, but for now let's focus on
investigating data inputs.
For incoming events, the amount of data processed is in the thruput group, as in
per_host_thruput. In this example, you're only indexing data from one host, so
per_host_thruput actually can tell us something useful: that right now host
26
"grumpy" indexes around 8k in a 30-second period. Since there is only one host,
you can add it all up and get a good picture of what you're indexing, but if you
had more than 10 hosts you would only get a sample.
For example, you might know that access_common is a popular source type for
events on this Web server, so it would give you a good idea of what was
happening:
27
But you probably have more than 10 source types, so at any particular time some
other one could spike and access_common wouldn't be reported.
per_index_thruput and per_source_thruput work similarly.
With this in mind, let's examine the standard saved search "KB indexed per hour
last 24 hours".
This means: look in the internal index for metrics data of group
per_index_thruput, ignore some internal stuff and make a report showing the
sum of the kb values. For cleverness, we'll also rename the output to something
meaningful, "totalKB". The result looks like this:
Those totalKB values just come from the sum of kb over a one hour interval. If
you like, you can change the search and get just the ones from grumpy:
We see that grumpy was unusually active in the 2 pm time bin. With this
knowledge, we can start to hunt down the culprit by, for example, source type or
host.
28
Answers
Have questions? Visit Splunk Answers and see what questions and answers the
Splunk community has about working with metrics.log.
Apache formats are described briefly in the Apache HTTP Server documentation.
For example, see Apache 2.4 log file documentation.
splunkd_access.log
This file records HTTP requests served by splunkd on its management port. Here
is a typical line in splunkd_access.log:
address: The IP address from which the HTTP client socket appears to
originate. Typically these requests originate from splunkweb and come
over the localhost/loopback address.
The second field is a placeholder for the unused identd field.
user: The splunk user, if any, making the request. System accesses on
behalf of no particular user appear as "-".
timestamp: This is the time that splunkd finished reading in the request.
However, the log event is written out when the http server finishes writing
the response, so these timestamps can be out of order.
request: The HTTP request made by the client consisting of an action, a
URL, and a protocol version.
status: The HTTP status returned as part of the response.
29
response_size: The size of the body of the response in bytes
Three additional placeholders. (If you know what these stand in for, send
docs feedback below!)
duration: The time it took from the completion of reading the request to
completely writing out the response. This value is logged explicitly in
milliseconds.
Between the definitions for timestamp and duration, you can infer the response
completion time by adding duration to the timestamp.
web_access.log
where address, user, time, request, status, response_size, and duration are
the same as in splunkd_access.log. The new components here are:
referer: referer [sic] is the URL that the client told us provided the link to
the URL that was accessed.
user agent: The string the http client used to identify itself.
session_id: This represents the splunkweb session. Can be used to follow
a stream of requests from a particular client. These sessions are transient
starting in Splunk Enterprise 6.2.0.
splunkd_ui_access.log
Starting in Splunk Enterprise 6.2.0, splunkd handles requests from the browser
that splunkweb handled pre-6.2.0. This file records HTTP requests served by
splunkd on the Splunk Web / UI port.. The format is identical to web_access.log.
30
Platform Instrumentation
Supported platforms
Windows
x86-64: Server 2008, Server 2008 R2, Server 2012
x86-32: Server 2008, Server 2008 R2
Linux
x86-64: RHEL with 2.6+ kernel
x86-32: RHEL with 2.6+ kernel
Solaris
x86-64: 10, 11
SPARC: 10, 11
31
Where is this data written?
The two log files are disk_objects.log and resource_usage.log. See "What gets
logged" for a breakdown of what data goes into which file.
index=_introspection
To find introspection data from a forwarder or another instance in your
deployment, qualify your search with the remote host name.
If you are upgrading from a Splunk Enterprise version pre-6.1, expect the new
log files to use a bit of disk space (an estimated 300 MB). The _introspection
index's disk usage, on the other hand, varies from deployment to deployment.
Each log file has a maximum size of 25 Mb. You can change this limit in log.cfg.
You can have up to six instances of each file, according to your log rotation
policy. That is, resource_usage.log, resource_usage.log.1, ...
resource_usage.log.5, and the same for disk_objects.log. Thus, the introspection
log files by default can take up to 300 MB of disk space.
See the upgrade docs in the Installation Manual for upgrade information.
32
What gets logged?
This topic describes the contents of log files that are tailed to populate the
_introspection index. For the log files that populate _internal, see "What
Splunk logs about itself" in this manual.
These log files comply with the Common Information Model (CIM). See the CIM
add-on documentation for more information.
"Extra field" indicates a field that is not logged by default. Read more about
configuring polling intervals and enabling this feature on a universal forwarder in
"Configure platform instrumentation."
Splunk Enterprise can log all the above data for search processes (except args).
In addition, it logs some additional information about search processes, in a
subsection called search_props.
33
See the list of output fields at
system/server/status/resource-usage/splunk-processes in the REST API
Reference Manual. The search process fields are embedded within the larger
process table, at the search_props entry.
I/O statistics
Disk input-output statistics. The Splunk Enterprise iostats endpoint displays the
most recent data. Historical data is logged to resource-usage.log.
server/introspection/search/dispatch
34
Disk object data
server/info
See the list of output fields at system/server/info in the REST API Reference
Manual.
data/index-volumes
See the list of output fields at data/index-volumes in the REST API Reference
Manual.
data/index-volumes/{Name}
data/indexes-extended
data/indexes-extended/{Name}
35
server/status/dispatch-artifacts
server/status/fishbucket
server/status/limits/search-concurrency
server/status/partitions-space
Helps track disk usage. These results show only partitions with Splunk disk
objects (indexes, volumes, logs, fishbucket, search process artifacts) on them.
There is a partitions event for each file system, and each event gives the
respective file system type.
A partition is a physical concept, simply a chunk of hard drive (or solid state
drive). All we know about a partition is its size. A file system can reside on
multiple partitions. Splunk Enterprise does not report at the partition level.
36
Configure platform instrumentation
This topic is about log files that are tailed to populate the _introspection index.
Read about this feature in "About Splunk Enterprise platform instrumentation."
This topic helps you configure the default logging interval and enable or disable
logging.
[install]
state = enabled
37
Prerequisites
[install]
state = enabled
7. Save the changes. Review the changes to the app.conf file and the path as a
validation step.
38
Update the serverclass.conf file, adding the app to a serverclass for
deployment
1. Find the primary copy of the serverclass.conf file. The location and contents
will vary between deployments, but some common locations are:
$SPLUNK_HOME$/etc/system/local/, and $SPLUNK_HOME$/etc/apps/*/local. To
use btool to find all serverclass.conf files referenced on the deployment server,
run: ./splunk btool --debug serverclass list and review the output.
2. Create a new app definition for deploying the changes to the introspection
generator add-on. This task is dependent upon the local environment and how
the Splunk administrator has chosen to assign and manage apps deployed to
forwarders. Many deployments use one serverclass definition to deploy and
manage the most common apps for forwarders. For the purposes of this
procedure, all universal forwarders are included under one encompassing
serverclass named PrimaryForwarders.
[serverClass:PrimaryForwarders:app:introspection_generator_addon]
excludeFromUpdate = $app_root$/default, $app_root$/bin
restartSplunkd = True
4. Save the changes. Review the changes to the serverclass.conf file and the
path as a validation step.
1. Utilize your enterprise change control system to file the requirements and
changes for this procedure.
Use the search head to validate the introspection logs are being forwarded.
Example: index=_introspection host=<forwarder_host> | stats count by
source, component
39
Populate "Extra" fields
Four fields (in per-process resource usage data) are not populated by default but
can be turned on. See "What gets logged" for information.
In server.conf you can tell Splunk Enterprise to acquire the "Extra" fields by
setting acquireExtra_i_data to true. For example:
[introspection:generator:disk_objects]
disabled = false
acquireExtra_i_data = true
collectionPeriodInSecs = 600
In server.conf you can increase the polling period by collection type (that is,
resource usage data or disk object data).
The default settings (for anything other than a universal forwarder) are:
[introspection:generator:disk_objects]
disabled = false
acquireExtra_i_data = false
collectionPeriodInSecs = 600
[introspection:generator:resource_usage]
disabled = false
acquireExtra_i_data = false
collectionPeriodInSecs = 10
40
Disable logging
You can turn off all introspection collection (and subsequent logging) by disabling
the Introspection Generator Add-On.
In the $SPLUNK_HOME/etc/apps/introspection_generator_addon/local/app.conf
file, set
[install]
state = disabled
[introspection:generator:disk_objects]
disabled = false
acquireExtra_i_data = false
collectionPeriodInSecs = 600
[introspection:generator:resource_usage]
disabled = false
acquireExtra_i_data = false
collectionPeriodInSecs = 10
If you've disabled this logging on your instance, you can still invoke the CLI
command. To invoke, at the command line:
41
--debug: Set logging level to DEBUG (this can also be done via
log-cmdline.cfg)
--extra: This has the same effect as setting acquireExtra_i_data to true in the
server.conf [introspection:generator:resource_usage] stanza. See "What
gets logged" for which fields are not logged by default and require this flag.
In indexes.conf you can specify the _introspection index. The default location is
in $SPLUNK_DB:
[_introspection]
homePath = $SPLUNK_DB/_introspection/db
coldPath = $SPLUNK_DB/_introspection/colddb
thawedPath = $SPLUNK_DB/_introspection/thaweddb
maxDataSize = 1024
frozenTimePeriodInSecs = 1209600
Use this search to find the median total physical memory used, per search type
(ad hoc, scheduled, report acceleration, data model acceleration, or summary
indexing) for one host over the last hour:
As a stacked column chart, this search produces a visualization that looks like
this:
42
Current disk usage per partition in use by Splunk Enterprise
Use this search to find the latest value of Splunk Enterprise disk usage per
partition and instance:
Median CPU usage for the main splunkd process for one host
Use this search to find the median CPU usage of the main splunkd process for
one host over the last hour:
Fill in "<hostname>" with the "host" metadata field associated with your instance,
as recorded in inputs.conf's "host" property. As an area chart, this search
produces something like this:
Use this search to find the median number of searches running at any given time,
split by mode (historical, historical batch, real-time, or real-time indexed):
43
index=_introspection data.search_props.sid=* earliest=-1h | bin _time
span=10s|stats dc(data.search_props.sid) as search_count by
data.search_props.mode, _time | timechart median(search_count) by
data.search_props.mode
Fill in "<hostname>" with the "host" metadata field associated with your instance,
as recorded in inputs.conf's "host" property.
44
Contact Splunk Support
Contact Support
For contact information, see the main Support contact page.
Note: Before you send any files or information to Splunk Support, verify that you
are comfortable with sending it to us. We try to ensure that no sensitive
information is included in any output from the commands below and in
"Anonymize data samples to send to Support" in this manual, but we cannot
guarantee compliance with your particular security policy.
Note: Before you upload a diag, make sure the user who uploads the file has
read permissions to the diag*.tar.gz file.
Diagnostic files
The diag command collects basic info about your Splunk server, including
Splunk's configuration details (such as the contents of $SPLUNK_HOME/etc and
general details about your index, like the host and source names). It does not
include any event data or private information.
Be sure to run diag as a user with appropriate access to read Splunk files.
On *NIX, typically the user you run the splunk service under, such as 'splunk',
while on Windows typically the domain user you run splunk as, or some kind of
local administrator if you run as "LocalSystem".
See "Generate a diag" in this manual for instructions on the diag command.
45
Core Files
To collect a core file if Support asks you for one, use ulimit to remove any
maximum file size setting before starting Splunk.
# ulimit -c unlimited
# splunk restart
This setting only affects the processes you start from the shell where you ran the
ulimit command. To find out where core files land in your particular UNIX flavor
and version, consult the system documentation. The below text includes some
general rules that may or may not apply.
On UNIX, if you start Splunk with the --nodaemon option (splunk start
--nodaemon), it may write the core file to the current directory. Without the flag the
expected location is / (the root of the filesystem tree). However, various platforms
have various rules about where core files go with or without this setting. Consult
your system documentation. If you do start splunk with --nodaemon, you will
need to, in another shell, start the web interface manually with splunk start
splunkweb.
Depending on your system, the core may be named something like core.1234,
where '1234' is the process ID of the crashing program.
LDAP configurations
If you are having trouble setting up LDAP, Support will typically need the
following information:
46
Here are some ideas to get you started.
What elements are present for the issue? What's the timeline leading to the
error? What processes are running when the error appears?
What behavior do you observe, compared to what you expect? Be specific: for
example, how late is "late"?
Most Support cases are for functional problems: the software has been
configured to do something, but it is behaving in an unexpected way. Splunk
Support needs both the context of the problem and insight into the instance that
is not performing as expected. That insight comes in the form of a "diag,"
essentially a snapshot of the configuration of the Splunk platform instance and
the recent logs from that instance.
You can make a diag on any instance type: forwarder, indexer, search head, or
deployment server. If you have a forwarder and a receiver that are not working
together correctly, send us diags of both. Label the diags so it's clear which
instance each is from. If you have many forwarders, send only one
representative forwarder diag.
The diag tarball does not contain any of your indexed data, but you can examine
its contents before sending it. Read about what you can include or exclude from
diags in Generate a diag in this manual.
47
and examine its effect. It is not unusual to have multiple updated diags for a
single case. If you send multiple diags, label each one clearly.
Generate a diag
To help diagnose a problem, Splunk Support might request a diagnostic file from
you. Diag files give Support insight into how an instance is configured and how it
has been operating up to the point that the diag command was issued.
About diag
The diag command collects basic information about your Splunk platform
instance, including Splunk's configuration details. It gathers information from the
machine such as server specs, OS version, file system, and current open
connections. From the Splunk platform instance it collects the contents of
$SPLUNK_HOME/etc such as app configurations, internal Splunk log files, and index
metadata.
Diag does not collect any of your indexed data and we strongly encourage you to
examine the tarball to ensure that no proprietary data is included. In some
environments, custom app objects, like lookup tables, could potentially contain
sensitive data. Exclude a file or directory from the diag collection by using the
--exclude flag. Read on for more details.
Note: Before you send any files or information to Splunk Support, verify that you
are comfortable sending it to us. We try to ensure that no sensitive information is
included in any output from the commands below and in "Anonymize data
samples to send to Support" in this manual, but we cannot guarantee compliance
with your particular security policy.
Be sure to run diag as a user with appropriate access to read Splunk files.
On *nix: $SPLUNK_HOME/bin
./splunk diag
On Windows: %SPLUNK_HOME%\bin
48
splunk diag
If you have difficultly running diag in your environment, you can also run the
python script directly from the bin directory using cmd.
On *nix:
On Windows:
splunk cmd
python %SPLUNK_HOME%\Python-2.7\Lib\site-packages\splunk\clilib\info_gather.py
Note: The python version number may differ in future versions of Splunk
Enterprise, affecting this path.
Diag can be told to leave some files out of the diag. One way to do this is with
path exclusions. At the command line you can use the switch --exclude. For
example:
This is repeatable:
49
components available are: index_files, index_listing, dispatch, etc, log,
pool.
The following switches control the thoroughness with which diag gathers
categories of data:
If you have an app installed which extends diag, it may offer additional
app-specific flags, in the form --app_name:setting. Apps do not currently offer
defaulting of their settings in server.conf
50
Components
dispatch: The search dispatch directories. See "What Splunk Enterprise logs
about itself."
pool: If search head pooling is enabled, the contents of the pool dir.
rest: splunkd httpd REST endpoint gathering. Collects output of various splunkd
urls into xml files to capture system state. (Off by default due to fragility concerns
for initial 6.2 shipment.)
51
Run diag on a remote node
If you are not able to SSH into every machine in your deployment, you can still
gather diags.
First, make sure you have the "get-diag" capability. Admin users have this
capability. If admin users want to delegate this responsibility, they can give power
users the get-diag capability.
Examples
These two examples exclude content on the file level. A lookup table can be
one of several formats, like .csv, .dat, or text.
Note: These examples will exclude all files of that type, not only lookup tables. If
you have .csv or .dat files that will be helpful for Support in troubleshooting your
issue, exclude only your lookup tables. That is, write out the files instead of using
an asterisk.
This example excludes content on the component level. Exclude the dispatch
directory to avoid gathering search artifacts (which can be very costly on a
pooled search head):
52
$SPLUNK_HOME/bin/splunk diag --disable=dispatch
To exclude multiple components, use the --disable flag once for each
component.
Exclude the dispatch directory and all files in the shared search head pool:
Note: This does not gather a full set of the configuration files in use by that
instance. Such a diag is useful only for the logs gathered from
$SPLUNK_HOME/var/log/splunk. See "What Splunk Enterprise logs about itself" in
this manual.
Our recommended steps for the moment for generating a diag on a Splunk data
cluster are:
$SPLUNK_HOME/bin/splunk login
...enter username and password here...
$SPLUNK_HOME/bin/splunk diag --collect all
You can update the default settings for diag in the [diag] stanza of server.conf.
[diag]
53
Diag contents
_raft/...
Files containing the state of the consensus protocol produced by search
head clustering from var/run/splunk/_raft
composite.xml
The generated file that splunkd uses at runtime to control its component
system (pipelines & processors), from var/run/splunk/composite.xml
diag.log
A copy of all the messages diag produces to the screen when running,
including progress indicators, timing, messages about files excluded by
heuristic rules (eg if size heuristic, the setting and the size of the file),
errors, exceptions, etc.
dispatch/...
A copy of some of the data from the search dispatch directory. Results
files (the output of searches) are not included, nor other similar files
(events/*)
etc/...
A copy of the contents of the configuration files. All files and directories
under $SPLUNK_HOME/etc/auth are excluded by default.
excluded_filelist.txt
A list of files which diag would have included, but did not because of some
restriction (exclude rule, size restriction). This is primarily to confirm the
behavior of exclusion rules for customers, and to enable Splunk technical
support to understand why they can't see data they are looking for.
introspection/...
The log files from $SPLUNK_HOME/var/log/introspection
log/...
The log files from $SPLUNK_HOME/var/log/splunk
rest-collection/...
Output of several splunkd http endpoints that contain information not
available in logs. File input/monitor/tailing status information, server-level
admin banners, clustering status info if on a cluster.
scripts/...
A single utility script may exist here for support reasons. It is identical for
every diag.
systeminfo.txt
54
Generated output of various system commands to determine things like
available memory, open splunk sockets, size of disk/filesystems, operating
system version, ulimits.
Also contained in systeminfo.txt are listings of filenames/sizes etc from a
few locations.
Some of the splunk index directories (or all of the index directories,
if full listing is requested.)
The searchpeers directory (replicated files from search heads)
Search Head Clustering -- The summary files used in
synchronization from var/run/splunk/snasphot
Typically var/...
The paths to the indexes are a little 'clever', attempting to resemble the
paths actually in use (For example, on windows if an index is in
e:\someother\largedrive, that index's files will be in e/someother/largdrive
inside the diag). By default only the .bucketManifest for each index is
collected.
app_ext/<app_name>/...
If you have an app installed which extends diag, the content it adds to the
produced tar.gz file will be stored here.
Behavior on failure
55
"/opt/splunk/lib/python2.7/site-packages/splunk/clilib/info_gather.py",
line 1862, in create_diag
copy_etc(options)
File
"/opt/splunk/lib/python2.7/site-packages/splunk/clilib/info_gather.py",
line 1626, in copy_etc
raise Exception("OMG!")
Exception: OMG!
For most real errors, diag tries to guess at the original problem, but it also writes
out a file for use in bugfixing diag. Please do send it along, and at least a
workaround can often be provided quickly.
Additional resources
Watch a video on making a diag and using the anonymize command by a Splunk
Support engineer.
Have questions? Visit Splunk Answers and see what questions and answers the
Splunk community has about diags.
The anonymized file is written to the same directory as the source file, with ANON-
prepended to its filename. For example, /tmp/messages will be anonymized as
/tmp/ANON-messages.
You can anonymize files from Splunk's CLI. To use Splunk's CLI, navigate to the
$SPLUNK_HOME/bin/ directory and use the ./splunk command.
56
Simple method
The easiest way to anonymize a file is with the anonymizer tool's defaults, as
shown in the session below. Note that you currently need to have
$SPLUNK_HOME/bin as your current working directory.
From the CLI while you are in $SPLUNK_HOME, type the following:
Of course it is always good practice to move the file somewhere safe (like /tmp)
before doing this sort of thing. So, for example:
Advanced method
You can customize the anonymizer by telling it what terms to anonymize, what
terms to leave alone, and what terms to use as replacements. The advanced
form of the command is:
filename
Default: None
57
Path and name of the file to anonymize.
public_terms
Default: $SPLUNK_HOME/etc/anonymizer/public-terms.txt
A list of locally-used words that will not be anonymized if they are in
the file. It serves as an appendix to the dictionary file.
Here is a sample entry:
2003 2004 2005 2006 abort aborted am apr april aug august auth
authorize authorized authorizing bea certificate class com complete
private_terms
Default: $SPLUNK_HOME/etc/anonymizer/private-terms.txt
A list of words that will be anonymized if found in the file, because
they may denote confidential information.
Here is a sample entry:
401-51-6244
passw0rd
name_terms
Default: $SPLUNK_HOME/etc/anonymizer/names.txt
A global list of common English personal names that Splunk uses
to replace anonymized words.
Splunk always replaces a word with a name of the exact same
length, to keep each event's data pattern the same.
Splunk uses each name in name_terms once to replace a character
string of equal length throughout the file. After it runs out of names,
it begins using randomized character strings, but still mapping each
replaced pattern to one anonymized string.
Here is a sample entry:
charlie
claire
desmond
jack
dictionary
Default: $SPLUNK_HOME/etc/anonymizer/dictionary.txt
A global list of common words that will not be anonymized, unless
overridden by entries in the private_terms file.
Here is a sample entry:
58
algol
ansi
arco
arpa
arpanet
ascii
timestamp_config
Default: $SPLUNK_HOME/etc/anonymizer/anonymizer-time.ini
Splunk's built-in file that determines how timestamps are parsed.
Output Files
Splunk's anonymizer function will create three new files in the same directory as
the source file.
ANON-filename
The anonymized version of the source file.
INFO-mapping.txt
This file contains a list of which terms were anonymized into which
strings.
Here is a sample entry:
Replacement Mappings
--------------------
kb900485 --> LO200231
1718 --> 1608
transitions --> tstymnbkxno
reboot --> SPLUNK
cdrom --> pqyvi
INFO-suggestions.txt
A report of terms found in the file that, based on their appearance
and frequency, you may want to add to public_terms.txt or to
private-terms.txt or to public-terms.txt for more accurate
anonymization of your local data.
Here is a sample entry:
59
Linux tip: Anonymize all log files from a diag at once
Here are the steps to generate a diagnostic (diag file) and then anonymize the
logs of that diag.
cd $SPLUNK_HOME/bin
./splunk diag --exclude "*/passwd"
cd pathtomyuncompresseddiag/
tar xfz my-diag-hostname.tar.gz
3. Run anonymize on each file of the diag. If you run this command for all *.log,
then make note of the log files that now have a prefix of ANON*.log. For
example:
4. Keep all the files that now have a prefix of ANON*.log while deleting the
non-anonymized versions in the diag directory.
6. Upload the diag, adding it to the Support case, with the ADD FILE button in the
case.
Collect pstacks
Support might ask you to gather thread call stacks with pstack, for example if
your deployment experiences:
unexplained high CPU, along with identified threads using high CPU,
frozen Splunk that's not doing anything, when it obviously should, or
60
unexplainably slow behavior in splunkd (that is, not limited by disk or
CPU).
On *nix
Pstack is available on Red Hat and Centos Linux and Solaris by default. Pstack
is installable on several other flavors of Linux.
which pstack
/usr/bin/pstack
If you get an error message instead of a location, you might still be able to install
pstack. On RHEL and its derivatives (CentOS, Oracle Linux, etc), pstack is part
of the gdb package.
On Linux flavors that aren't based on RHEL, pstack might be useless for
troubleshooting, in that it does not support threads.
Then you probably have the x86-64-specific pstack binary, which is less capable
than the redhat gdb-based one, as it does not understand posix threaded
applications. Ensure that the gdb package is installed, and try the gstack
command as a substitution for pstack. gstack is available on Ubuntu, for
example. If gstack is not available, a very barebones gstack is provided here:
pid=$1
echo 'thread apply all bt' | gdb --quiet -nx /proc/$pid/exe $pid
61
gdb
Run pstack
Note that this script requires bash (let is not a portable expression).
On Windows
http://wiki.splunk.com/Community:GatherWindowsStacks
62
Command line tools for use with Support
This topic contains information on CLI tools to help with troubleshooting Splunk
Enterprise. Most of these tools are invoked using the Splunk CLI command
"cmd". You should not use these tools without first consulting with Splunk
Support.
For general information about using the CLI in Splunk, see "Get help with the
CLI" in the Admin Manual.
cmd
Examples:
Objects: None
btool
View or validate Splunk configuration files, taking into account configuration file
layering and user/app context.
Syntax:
Objects: None
63
Required Parameters: None
Optional Parameters:
Examples:
btprobe
Queries the fishbucket for checkpoints stored by monitor input. For up-to-date
usage, run btprobe --help.
This method queries the specified BTree for the given key or file.
64
(Required)
-k Hex crc key or ALL to get all the keys.
--file File to compute the crc from.
-r Rebuild the btree .dat files (i.e.,
var/lib/splunk/fishbucket/splunk_private_db/
(One of -k and --file must be specified.
This method computes a crc from the specified file, using the given salt if any.
Example: ./btprobe -d
/opt/splunk/var/lib/splunk/fishbucket/splunk_private_db -k
0xe8d117ddba85e714 --validate
Example: ./btprobe -d
/opt/splunk/var/lib/splunk/fishbucket/splunk_private_db --file
/var/log/inputfile --salt SOME_SALT
Example: ./btprobe --compute-crc /var/log/inputfile --salt
SOME_SALT
classify
The "splunk train sourcetype" CLI command calls classify. To call it directly use:
check-rawdata-format
Unpacks and verifies the 'rawdata' component one or more buckets. 'rawdata' is
the record of truth from which Splunk can rebuild the other components of a
65
bucket. This tool can be useful if you are worried or believe there may be data
integrity problems in a set of buckets or index. Also you can use it to check for
journal integrity prior to issuing a rebuild, if you wish to know whether the rebuild
can complete successfully before running it.
fsck
Diagnoses the health of your buckets and can rebuild search data as necessary.
Note: ./splunk --repair will only work with buckets created by Splunk version >
4.2.
For more information, read "How Splunk stores indexes" in the Managing
Indexers and Clusters Manual.
locktest
If you run Splunk Enterprise on a file system that is not listed, the software might
run a startup utility named `locktest` to test the viability of the file system.
`Locktest` is a program that tests the start up process. If `locktest` fails, then the
file system is not suitable for running Splunk Enterprise. See System
Requirements for details.
66
locktool
Usage :
Acquires and releases locks in the same manner as splunkd. If you were to write
an external script to copy db buckets in and out of indexes you should acqure
locks on the db colddb and thaweddb directories as you are modifying them and
release the locks when you are done.
parsetest
Usage:
parsetest "<string>"
["<sourcetype>|source::<filename>|host::<hostname>"]
parsetest file <filename> ["<sourcetype>|host::<hostname>"]
Example:
parsetest "10/11/2009 12:11:13" "syslog"
parsetest file "foo.log" "syslog"
pcregextest
That is, define modular regex in the 'mregex' parameter. Then define all the
subregexes referenced in 'mregex'. Finally you can provide a sample string to
test the resulting regex against, in 'test_str'.
67
regextest
searchtest
signtool
Sign
Verify
Allows verification and signing splunk index buckets. If you have signing set up in
a cold to frozen script. Signtool allows you to verify the signatures of your
archives.
tsidxprobe
This will take a look at your time-series index files (or "tsidx files"; they are
appended with .tsidx) and verify that they meet the necessary format
requirements. It should also identify any files that are potentially causing a
problem
Then use tsidxprobe to look at each of your index files with this little script you
can run from your shell (this works with bash):
(If you've changed the default datastore path, then this should be in the new
location.)
The file tsidxprobeout.txt will contain the results from your index files. You should
be able to gzip this and attach it to an email and send it to Splunk Support.
68
tsidx_scan.py
(4.2.2+) This utility script searches for tsidx files at a specified starting location,
runs tsidxprobe for each one, and outputs the results to a file.
Example:
walklex
This tool "walks the lexicon" to tell you which terms exist in a given index. For
example, with some search commands (like tstat), the field is in the index; for
other terms it is not. Walklex can be useful for debugging.
Usage:
It recognizes wildcards:
69
Empty quotes return all results, and asterisks return all keys or all values (or
both, as in the example above).
Example:
70
Common front end scenarios
Splunk Free does not support multiple user accounts, distributed searching, or
alerting.
Saved searches that were previously scheduled by other users are still available,
and you can run them manually as required. You can also view, move, or modify
them in Splunk Web or in savedsearches.conf.
Review this topic about object ownership and this topic about configuration file
precedence in the Admin Manual for information about where Splunk writes
knowledge objects such as scheduled searches.
Some apps, like the *nix and Windows apps, write input data to a specific index
(in the case of *nix and Windows, that is the "os" index). If you're not finding data
that you're certain is in Splunk, be sure that you're looking at the right index. You
may want to add the "os" index to the list of default indexes for the role you're
using. For more information about roles, refer to the topic about roles in the
Securing Splunk Enterprise manual. For information about troubleshooting data
input issues, see "Troubleshoot the input process" in the Getting Data In manual.
Your permissions can vary depending on the index privileges or search filters.
Read more about adding and editing roles in Securing Splunk.
Double check the time range that you're searching. Are you sure the events exist
in that time window? Try increasing the time window for your search.
71
You might also want to try a real-time search over all time for some part of your
data, like a source type or string.
If you are running a report, check the time zone of the user who created the
report.
The indexer might be incorrectly timestamping for some reason. Read about
timestamping in the Getting Data In Manual.
Check that your data is in fact being forwarded. Here are some searches to get
you started. You can run all these searches, except for the last one, from the
Splunk default Search app. The last search you run from the CLI to access the
forwarder. A forwarder does not have a user interface:
Where is Splunk trying to forward data to? From the Splunk CLI issue the
following command:
If you need to see if the socket is getting established you can look at the
forwarder's log of this in splunkd.log "Connected to idx=<ip>:<port>" , and
72
on the receiving side if you set the log category TcpInputConn to INFO or
lower you can see messages "Connection in cooked mode from
src=<ip>:<port>
Check that your search heads are searching the indexers that contain the data
you're looking for. Read about distributed search in the Distributed Search
Manual.
If you have several (3 for Splunk Free or 5 for Enterprise) license violations within
a rolling 30 day window, Splunk will prevent you from searching your data.
Note, however, that Splunk will continue to index your data, and no data will be
lost. You will also still be able to search the _internal index to troubleshoot your
problem. Read about license violations in the Admin Manual.
Are you SURE your time range is correct? (You wouldn't be the first!) Search
over all time to double check.
Are you sure the incoming data is indexed when you expect and not lagging? To
determine if there is a lag between the event's timestamp and indexed time is to
manually run the scheduled search with the added syntax of:
Other common problems with scheduled searches are searches getting rewritten,
saved, run incorrectly, or run not as expected. Investigate scheduled searches in
audit.log and the search's dispatch directory: read about these tools in "What
73
Splunk logs about itself" in this manual.
Check your regex. One way to test regexes interactively is in Splunk using
the rex command.
Do you have privileges for extracting and sharing fields? Read about
sharing fields in the Knowledge Manager Manual.
Are your extractions applied for the correct source, sourcetype, and host?
Additional resources
Have questions? Visit Splunk Answers and see what questions and answers the
Splunk community has.
If you get stuck at any point, contact Splunk Support. Don't forget to send a diag!
Read about making a diag in this manual.
74
Symptom
Splunk Web displays a yellow banner warning of too many search jobs in the
dispatch directory.
Remedies
First, check that for any real-time all-time scheduled searches, you've configured
alert throttling. Configure throttling in Settings > Searches and Reports. See
Throttle alerts in the Alerting Manual.
Already throttled alerts and still getting the warning? A second step you can take
is to make alert expiration shorter than the default of 24 hours. If you can, change
"alert expiration time" from 24 hours to 1 hour (or less, if you need your alert
triggered very frequently).
In version 6.2 you cannot generate PDFs from dashboards or forms that were
built using advanced XML.
75
Determine the search string that powers the view panel that is
not showing the expected results
Macros and event-types are very handy knowledge objects, but unless you know
exactly what they do they tend to obscure the way a given search works. For that
reason, I find it easier to expand them manually so that you know *exactly* what
your search is doing.
The question we are now going to try to answer is: Can we reproduce this
manually, outside of the view it was reported in?
This is incorrect. We know that plenty of different users have been running
searches on this server over the past 24 hours.
The next step is simple: Let's compare the results generated by the search and
its multiple evals against the source events. The first thing we notice is that
looking at the last command of the search ("chart ... by user") and at the values
of the "user" field from the field picker, we expect 11 different rows (as many as
76
there are distinct values for the "user" field).
We should see where the "user" field is referenced in the search and possibly
modified. This really only happens twice:
The first command is not a good suspect as it couldn't possibly result in the
squashing of the user field down to "splunk-system-user". The second command
is quite interesting though: With "first(user) AS user ... by search_id", we
essentially squash the value of "user" for each search (uniquely referenced by
the search_id field) to the *most recent* value of the field (that is what first()
does: it looks for the first value of the field encountered while searching => the
most recent).
Dig deeper
In order to drill down to the source of the problem, let's pick *one* example. A
good one if possible: A search that we know was run by an actual user. I'm going
to go with SID=1338858385.644, which was run by Ed at 6:06pm today.
This search returns one result, with an inadequate value for user as we expect:
Event #1:
77
source",index,index_name,is_configured,lastExceedDate,license_size,maxColdDBSizeGB,maxDa
apiStartTime='ZERO_TIME', apiEndTime='ZERO_TIME',
savedsearch_name=""][n/a]
Event #1 is the oldest event and is logged at the time that the search is launched.
Note that the user field is correct => esastri
Event #2 is the newest event and is logged at the time that the search completes,
hence reporting things such as "total_run_time" or "event_count". Note that the
user field is *incorrect* => splunk-system-user
Conclusion
The bug is: Audit events reporting that a search has finished are *all* logged with
"user=splunk-system-user". This seems incorrect and is a deviation from
previous behavior.
The workaround is: replace "stats first(user) AS user ... by search_id" with "stats
last(user) AS user ... by search_id".
Field search
Why does splunk add junk to the front of a search when a field search is defined
before the first pipe?
78
litsearch index=checkpoint ( ( ( sourcetype=opsec_audit ) AND ( ( ( ( (
( sourcetype=WinRegistry ) AND ( ( registry_type=accept ) ) ) OR ( (
sourcetype=fs_notification ) AND ( ( action=accept ) ) ) ) OR (
vendor_action=accept ) ) ) ) ) ) OR ( ( ( ( sourcetype=fe_json ) AND (
( "alert.action"=accept ) ) ) OR ( ( sourcetype=fe_xml ) AND ( (
"alerts.alert.action"=accept ) ) ) OR ( (
source="/nsm/bro/logs/current/notice.log" ) AND ( (
EXTRA_FIELD_18=accept ) ) ) ) OR ( action=accept ) ) | litsearch
index=checkpoint action=accept | fields keepcolorder=t "*" "_bkt" "_cd"
"_si" "host" "index" "linecount" "source" "sourcetype" "splunk_server" |
prehead limit=1 null=false keeplast=false
A slight modification of the search to put the field search after the first pipe
makes the junk go away:
This is true even when he value "accept" is not before the first pipe.
Why does Splunk insert junk into the normalized search with a field search
before the first pipe? The junk increases search time and in some cases where
"NOT" OR "!" it can return "no results".
Resolution
[bob]
LOOKUP-actions = boblookup someinputfield OUTPUT action
someinputfield,action
potato,deny
79
tomato,accept
blueberry,accept
If you search on action=accept, then Splunk can look through all of its config files
and reason-out something like this:
Sourcetype bob has a lookup that outputs a field named action based on this
CSV file. I see here in the CSV file that action=accept is returned whenever
sometinputfield=blueberry or someinputfield=tomato. So there is an equivalency
here:
This is the fundamental step of a reverse lookup - the goal is to attempt to make
automatic lookup fields searchable. This is a necessary evil for CIM-compliant
apps like Enterprise Security because of how often they use automatic lookups to
normalize field names and values.
There's a whole longer discussion here about the performance impacts around
this. While it made your example situation slower, there are many other counter
examples where this approach (up to a point) speeds things up.
80
Common back end scenarios
Here's a Community Wiki article about bucket rotation and retention with specific
recommendations and examples.
If your Splunk instance will not start, a possible cause is that one or more of your
index buckets is corrupt in some way. Contact Support; they will help you
determine if this is indeed the case and if so, which bucket(s) are affected. Then,
run this command:
81
Recovering and rebuilding buckets
If so, you might need to adjust your server ulimit. Ulimit controls the resources
available to a Linux shell and processors the Linux shell has started. A machine
running Splunk Enterprise needs higher limits than are provided by default.
ulimit -a
Or restart Splunk Enterprise and look in splunkd.log for events mentioning ulimit:
82
You probably want your new values to stay set even after you reboot. To
persistently modify the values, edit settings in /etc/security/limits.conf
The file size (ulimit -f). The size of an uncompressed bucket file can be
very high.
The data segment size (ulimit -d). Increase the value to at least 1 GB =
1073741824 bytes.
The number of open files (ulimit -n), sometimes called the number of
file descriptors. Increase the value to at least 8192 (depending on your
server capacity).
The max user processes (ulimit -u). Increase to match the file
descriptors. This limit is important for the number of http threads.
Another value that you might need to modify on an older system (but not on most
modern systems) is the system-wide file size, fs.file-max, in /etc/sysctl.conf.
Why must you increase ulimit to run Splunk software? Well, you might
concurrently need file descriptors for every forwarder socket and every
deployment client socket. Each bucket can use 10 to 100 files, every search
consumes up to 3, and then consider every file to be indexed and every user
connected.
A group of search heads can schedule more concurrent searches than some
peers are capable of handling with their CPU core count.
Symptoms
On the search head, you might see yellow banners in quick succession warning
that a peer or peers are 'Down' due to Authentication Failed and/or Replication
Status Failed. Typically this can happen a few times a day, but the banners
appear and disappear seemingly randomly.
83
On the search head, splunkd.log will have messages like:
The symptoms can appear with or without other Splunk features such as search
head pooling and index replication being enabled. The symptoms are more
common in environments with two or more search heads.
Diagnosis
To properly diagnose this issue and proceed with its resolution, you must deploy
and run the SoS technology add-on (TA) on all indexers/search peers. In
addition, install the SoS app itself on a search head. Once the TA has been
enabled and has begun collecting data, the next time the issue occurs, you will
have performance data to validate the diagnosis.
To find an auth timeout on the peer named in the search head banner:
index=_internal source=*splunkd_access*
splunk_server="search_peer_name" auth | timechart max(spent)
2. Examine the load average just before the auth timeout and check for a
dramatic increase.
84
Now that you've established the time frame in step 1, examine metrics.log's load
average over the time frame to determine whether the load increased
significantly just before the timeouts were triggered. Typically the total time frame
is about 2 minutes.
Use the SoS view for CPU/Memory Usage (SOS > Resource Usage > Splunk
CPU/Memory Usage) to review the peak resource usage on the search peer
during the time scoped above. Look at the Average CPU Usage panel. If you
have too many concurrent searches, you will see that the peer uses more than
the available percentage of CPU per core. For example: A healthy 8 core box will
show no more than 100% x 8 cores = 800% average CPU usage. In contrast, a
box overtaxed with searches typically shows 1000% or more average CPU
usage during the time frame where the timeouts appear.
For more information about your CPU and memory usage, you can run the useful
search described below.
Remedies
Examine the concurrent search load. There are typically searches that had
dubious scheduling choices made and/or are scoped in inefficient ways.
Use the SoS Dispatch Inspector view to learn about the dispatched search
objects, the app they were triggered from, and their default schedule. Or you can
find this information using the useful search provided below.
Once you've identified your pileup of concurrent searches, get started on this list
of things you should do. All of them are good practices.
85
triggers a script, this task should be configured as a scheduled search set
to run 10 minutes in the past (to address potential source latency) over a 5
minutes window, combined with a cron offset. This offers the same effect
without tying down a CPU core across all peers, all the time. Read more
about expected performance and known limitations of real-time searches
and reports in the Search Manual.
Re-scope the search time for actual information needs. For example:
Scheduled searches that run every 15 minutes over a 4 hour time frame
are a waste of limited resources. Unless you have a very good reason why
a search should look back an additional 3 hours and 45 minutes on every
search (such as extreme forwarder latency), it's a waste of shared
resources. Read more about alerts in the Alerting Manual.
Additionally, there's the option to use limits.conf to lower the search
concurrency of all the search heads. Note that if you do only this step, you
will get a different set of banners (about reaching the max number of
concurrent searches) and you will still not be able to run concurrent
searches. But if you do some of the other steps, too, you might want to
configure the search concurrency like this:
[search]
base_max_searches = 2
# Defaults to 6
max_searches_per_cpu = 1
# Defaults to 1
max_rt_search_multiplier = 1
# Defaults to 1 in 6.0, in 5.x defaults to 3
[scheduler]
max_searches_perc = 20
# Defaults to 50
auto_summary_perc = 10
# Defaults to 50
[distributedSearch]
statusTimeout = 30
# Defaults to 10
authTokenConnectionTimeout = 30
86
# Default is 5
authTokenSendTimeout = 60
# Default is 10
authTokenReceiveTimeout = 60
# Default is 10
Useful searches
Search concurrency
If you have SoS installed on your search head, you can use this search to
examine search concurrency.
If you have the SoS App installed on the search head, you can find CPU and
memory usage for all search processes at one point based on the intersection of
the "ps" run interval and maximum load:
87
Event indexing delay
Symptoms
Events collected from a forwarder or from a log file are not yet searchable on
Splunk. Even though the time stamps of the events are within the search time
range, a search does not return the events. Later, a search over the same time
range returns the events.
Diagnosis
Quantify the problem by measuring how long your Splunk deployment is taking to
make your data searchable.
To measure the delay between the time stamp of the events and the indexing
time (the time that the indexer receives and processes the events), use the
following method:
3. Look at the delay per host for the Splunk internal logs.
Run these searches on realtime - all time mode for a little while to see the
events that are being received right now. In addition to real-time, you can run
historical searches to compare a day this week to a day from a previous week.
88
Compare the delay from your events with the delay from the internal Splunk logs.
If all the logs are delayed, including the internal logs, then the delay is a
forwarding issue.
If some sources are delayed but not others, this indicates a problem with
the input.
As you implement each fix below, you can measure how well it's working by
running these searches again.
Root causes
There are several possible root causes. Some might not be applicable to your
situation.
To check the forwarder default thruput limit, on the command line in the splunk
folder type:
cd $SPLUNK_HOME/bin
./splunk cmd btool limits list thruput --debug
/opt/splunkforwarder/etc/apps/SplunkUniversalForwarder/default/limits.conf
[thruput]
/opt/splunkforwarder/etc/apps/SplunkUniversalForwarder/default/limits.conf
maxKBps = 256
/opt/splunk/etc/system/default/limits.conf [thruput]
/opt/splunk/etc/system/default/limits.conf maxKBps = 0
89
To verify in the forwarder: When the thruput limit is reached, monitoring pauses
and the following events are recorded in splunkd.log:
To verify how often the forwarder is hitting this limit, check the forwarder's
metrics.log. (Look for this on the forwarder because metrics.log is not forwarded
by default on universal and light forwarders.)
cd $SPLUNK_HOME/var/log/splunk/metrics.log
grep "name=thruput" metrics.log
Remedy
Create a custom limits.conf with a higher limit or no limit. The configuration can
be in system/local, or in an app that will have precedence on the existing limit.
[thruput]
maxKBps = 512
[thruput]
maxKBps = 0
Notes:
Unlimited speed can cause higher resource usage on the forwarder. Keep
a limit if you need to control the monitoring and network usage.
90
Restart to apply.
Verify the result of the configuration with btool.
Later, verify in metrics.log that the forwarder is not reaching the new limit
constantly.
Once the thruput limit is removed, if the events are still slow, use the metrics
method to check if the forwarders are hitting a network limit. Compare with other
forwarders on different networks or different VPN tunnels.
Compressed files (like .gz and .zip) are handled by the Archive processor, and
are serialized. Therefore if you index a large set of compressed files, they will
come through the indexer one after the other. The second file will only come
through after the first one has been indexed.
Remedy
Use this search to verify the source type, the time stamp detected (_time), the
time of the user on the search head (now), and the time zone applied
(date_zone).
Notes:
91
The _time is converted to the user profile time zone configured on the
search head at search time.
The date_zone is applied at index time on the indexer.
Remedy
Fix the time zone and time stamp extraction. Take a sample of the data and test
it with data preview.
If the only events delayed are WinEventLogs, and the forwarder is on a busy
domain controller, with a high number of events per second, you might be
encountering the Windows collection log performance limit on Splunk and 5.x.
Or if the forwarder was recently started, it might be still collecting the older events
first.
Remedy
92
Splunk can't get data from remote machines
When Splunk can index events on the local machine, but can't get data from
remote machines using WMI, authentication or network connectivity is often the
reason. Splunk requires a user account with valid credentials for the Active
Directory (AD) domain or forest in which it's installed in order to collect data
remotely. It also requires a clear network path to the machine from which it gets
data, unblocked by firewalls on either the source or target machines.
The first thing to do is to make sure that Splunk is installed as a domain user. If
this requirement isn't met, Splunk won't be able to get data remotely even if the
network is functioning.
2. Run the SC command to query the Services Command Manager about the
splunkd and splunkweb services.
C:\> sc qc splunkd
[SC] QueryServiceConfig SUCCESS
SERVICE_NAME: splunkd
TYPE : 10 WIN32_OWN_PROCESS
START_TYPE : 2 AUTO_START
ERROR_CONTROL : 1 NORMAL
BINARY_PATH_NAME : "C:\Program Files\Splunk\bin\splunkd.exe"
service
LOAD_ORDER_GROUP :
TAG : 0
DISPLAY_NAME : Splunkd
DEPENDENCIES :
SERVICE_START_NAME : LocalSystem
The SERVICE_START_NAME field tells you the user that Splunk is configured to run
as. If this field shows LocalSystem, then Splunk is not configured to run as a
domain user. Uninstall Splunk, then reinstall it and make sure to specify "Other
user" during the setup process.
Note: You can also determine which user Splunk is configured to run as by using
the Services control panel.
93
Review the splunkd.log file
The following table shows the most common errors encountered when
connecting to WMI providers:
You can get even more detailed information about what is causing the errors by
enabling debug logging in Splunk's logging engine.
Note: After you have confirmed the cause of the error, be sure to turn debug
logging off.
To enable debugging for WMI-based inputs, you must set two parameters:
94
1. Edit log.cfg in %SPLUNK_HOME\etc. Add the following parameter:
[splunkd]
category.ExecProcessor=DEBUG
category.WMI=DEBUG
Note: You can place this attribute/value pair anywhere in the file, as long as it is
on its own line. log-cmdline.cfg does not use stanzas.
3. Restart Splunk:
4. Once Splunk has restarted, let it run for a few minutes until you see debug log
events coming into Splunk.
5. Once Splunk has collected enough debug log data, send a diag to Splunk
Support:
Important: Once you finish troubleshooting, revert back to the default settings:
[splunkd]
category.ExecProcessor=WARN
Note: You can also remove this entry from the file.
95
category.WMI=ERROR
Note: Any changes made to log.cfg are overwritten when you upgrade Splunk.
Create a log-local.cfg in %SPLUNK_HOME%\etc to avoid this problem.
If you see HRESULT error entries in the splunkd.log, use the WBEMTEST utility to
confirm the error outside of Splunk.
5. In the Namespace field of the Connect window, type in the namespace of the
server that is experiencing errors.
Note: You must type in the full path of the namespace. For example, if the server
you are attempting to connect to is called ADLDBS01, you must type in
\\ADLDBS01\root\cimv2 (including the backslashes).
6. Click Connect.
96
Note: You should be able to connect to the server without needing to supply
credentials. If you are prompted for credentials, then the Splunk user is not
correctly configured to access WMI.
7. Once you are connected to the server, set your WMI connection mode by
selecting one of the radio buttons in Method Invocation Options the lower right
corner of the WBEMTEST window:
8. Click ?Query??
Following is a WQL statement that you can test WMI connections with:
97
Check Windows Firewall
If Windows Firewall (or any other firewall software) is running on either the
source or target machine, Splunk might be blocked from getting data through
WMI providers. Make sure that you explicitly allow WMI through on the firewalls
on both machines. You can also disable Windows Firewall, but this is not
recommended by Splunk or Microsoft.
When Splunk is unable to get data from the local machine through WMI
providers, this might be because WMI is experiencing issues under load. When
this happens, try restarting the Windows Management Instrumentation (wmimgmt)
service from within the Services control panel, or by using the sc command-line
utility.
WMI can occasionally cause the splunk-wmi.exe process to crash. Splunk will
spawn a new process when this happens (you can tell by the changed process
ID).
While there is no guaranteed fix for this issue, you can reduce the number
of crashes by reducing the number of servers you are monitoring through
WMI with any given Splunk instance. Limit the number of WMI-based
inputs per instance to 80 or fewer.
98
If you monitor the same subset of WMI providers on large numbers of
machines, you can run into WMI memory constraints on the monitoring
server. This can also cause crashes. Limit the number of WMI-based data
inputs per server monitored through WMI. It's best to reduce the total
number of WMI connections per instance to 120 or fewer on 32-bit
Windows servers, and 240 or fewer on 64-bit Windows servers.
Consider using universal forwarders to get your data. You can either
install universal forwarders on a few machines and get data from other
machines through WMI, or you can put universal forwarders on all remote
machines.
Splunk makes what are known as semisynchronous calls to WMI providers. This
means that when Splunk makes a call to WMI, it continues running while WMI
deals with the request.
Semisynchronous mode offers the best balance of resource usage and security
on the computer making the request. It differs from the faster asynchronous
mode, but is more secure due to the way that the system handles retrieval of the
WMI objects. Both of these modes are faster than synchronous mode, which
forces programs making that kind of WMI request to wait until WMI returns the
data.
When WMI is dealing with a large number of requests, you might notice a slower
response because memory usage on the system increases until the retrieved
WMI objects are no longer needed by Splunk (after indexing).
More information about how WMI calls are made is available at "Calling a
Method", http://msdn.microsoft.com/en-us/library/aa384832(VS.85).aspx on
MSDN.
To test WMI, you can run the splunk-wmi.exe command manually with a desired
query and/or namespace to see the output that it produces.
99
Caution: When running this command, be sure to temporarily change Splunk's
data store directory (the location that SPLUNK_DB points to), so that you do not
miss any WMI events. To change Splunk's database store, refer to "Test access
to WMI providers" in the Getting Data In Manual.
The following output shows a failure to connect to the desired WMI provider:
ERROR WMI - Error occurred while trying to retrieve results from a WMI
query (error="Specified class is not valid." HRESULT=80041010) (.:
select * FROM Win32_PerfFormattedData_PerfDisk_PhysicalDisk_typo)
100
DiskWriteBytesPersec=0
DiskWritesPersec=0
Frequency_Object=NULL
Frequency_PerfTime=NULL
Frequency_Sys100NS=NULL
Name=0 D: C:
PercentDiskReadTime=0
PercentDiskTime=0
PercentDiskWriteTime=0
PercentIdleTime=98
SplitIOPerSec=0
Timestamp_Object=NULL
Timestamp_PerfTime=NULL
Timestamp_Sys100NS=NULL
wmi_type=unspecified
---splunk-wmi-end-of-event---
20090904144105.000000
AvgDiskBytesPerRead=0
AvgDiskBytesPerTransfer=0
AvgDiskBytesPerWrite=0
AvgDiskQueueLength=0
AvgDiskReadQueueLength=0
AvgDiskWriteQueueLength=0
AvgDisksecPerRead=0
AvgDisksecPerTransfer=0
AvgDisksecPerWrite=0
Caption=NULL
CurrentDiskQueueLength=0
Description=NULL
DiskBytesPersec=0
DiskReadBytesPersec=0
DiskReadsPersec=0
DiskTransfersPersec=0
DiskWriteBytesPersec=0
DiskWritesPersec=0
Frequency_Object=NULL
Frequency_PerfTime=NULL
Frequency_Sys100NS=NULL
Name=Total
PercentDiskReadTime=0
PercentDiskTime=0
PercentDiskWriteTime=0
PercentIdleTime=98
SplitIOPerSec=0
Timestamp_Object=NULL
Timestamp_PerfTime=NULL
Timestamp_Sys100NS=NULL
wmi_type=unspecified
---splunk-wmi-end-of-event---
101
Clean shutdown completed.
See the Admin Manual for information on getting started for Windows admins.
Problems with collection and indexing of Windows event logs generally fall into
two categories:
Event logs are not collected from the server. This is usually due to
either a local configuration problem or, in the case of remote event log
collection, a network, permissions, or authentication issue.
Event logs are collected from the server, but information within the
event log is either missing or incorrect. This is usually due to problems
associated with a particular event log channel, or because of the methods
used to collect data from those channels.
When you have problems getting data into your local Splunk instance, try these
tips to fix the problem:
Make sure that the desired event log channels are selected in Splunk Web
or properly configured in inputs.conf.
Make sure to select fewer than 64 event log channels per event log input.
Make sure that you are not attempting to index exported event logs that
are incompatible with the indexing system (for example, attempting to
index event logs exported from a Windows Server 2008 computer on a
Windows XP computer will result in missing log data).
Make sure that, if you are monitoring non-standard event log channels,
that you have the appropriate dynamic linked libraries (DLLs) that are
associated with that event log channel. This is particularly important when
indexing exported log files from a different computer.
102
Troubleshooting issues with event logs collected remotely
When you experience issues getting event logs from remote Windows servers,
try these solutions to fix the problem:
Make sure that your Splunk user is configured correctly for WMI.
Make sure that your Splunk user is valid, and does not have an expired
password.
Make sure that the Event Log service is running on both the source and
target machines.
Make sure that your Active Directory (AD) is functioning correctly.
Make sure that your computers are configured to allow WMI data between
them.
Make sure that your event logs are properly configured for remote access.
See the Admin Manual for information on getting started for Windows admins.
This topic provides solutions to common issues encountered when working with
the Windows version of Splunk. It's divided into several subtopics:
Generic issues
Issues with WMI
Issues with forwarders
General issues
103
Splunk fails to start
There are several factors that might prevent Splunk from starting properly.
Whether it didn't start automatically, or you are having problems manually
starting it, here are some solutions to try:
Make sure that your system meets the Splunk system requirements.
These requirements differ depending on the type of Splunk you're trying to
run (full instance versus forwarder).
Make sure that the Splunk services are enabled. Go into Control Panel
and check that the splunkd and splunkweb services have their Startup
type set to "Automatic."
Check file and security permissions. When you install Splunk as a user
other than Local System, Splunk does not have full permissions to run on
the system by default. Try these solutions to get Splunk back up and
running:
Make sure the Splunk user is in the local Administrators group on
the machine.
Make sure that the Splunk user has Full Control permissions for the
entire %SPLUNK_HOME% directory, and is also the owner of all files and
subdirectories in %SPLUNK_HOME%. You must explicitly define this in
the Security properties of the %SPLUNK_HOME% directory.
Be sure to read the "Considerations for deciding how to monitor
remote Windows data" for additional information about permissions
required to run Splunk as a domain user.
No data is received
Splunk for Windows operates similarly to Splunk for other operating systems. If
you're not getting data and it's not because of a permissions or network
connectivity issue, then there is likely something happening within Splunk, such
as an incorrectly configured input.
104
If you're having trouble collecting Windows event logs, review "Troubleshooting
Windows event logs."
WMI issues
This section contains information about problems encountered when using WMI
providers to gather data from remote machines.
105
of that checklist follows:
You can also see additional information about Splunk's WMI operations by
turning on debug logging. To turn on debug logging, follow the instructions in
"Troubleshooting WMI Logging" in the Getting Data In Manual.
WMI can sometimes causes the Splunk WMI process (splunk-wmi.exe) to crash.
If that happens, Splunk will start another WMI process immediately, but you
might see crash files in your %SPLUNK_HOME%\var\log\splunk directory.
Reduce the amount of WMI inputs on each Splunk instance. For best
results, limit the number of WMI connections per instance to 120 or fewer
on 32-bit Windows systems, or 240 or fewer for 64-bit systems. Note that
each server monitored can use more than one WMI connection,
depending on the amount of inputs configured for each server.
Use a universal forwarder to get data. Splunk recommends that you use
a universal forwarder to send data from remote machines to an indexer.
Universal forwarders are more scalable and reliable than WMI in nearly all
cases, and require far less security management than WMI does.
Splunk makes what are known as semisynchronous calls to WMI providers. This
means that when Splunk makes a call to WMI, it continues running while WMI
deals with the request.
Semisynchronous mode offers the best balance of resource usage and security.
It differs from the faster asynchronous mode, but is more secure due to the way
that the system handles retrieval of the WMI objects. Both of these modes are
106
faster than synchronous mode, which forces programs making that kind of WMI
request to wait until WMI returns the data.
When WMI is dealing with a large number of requests, you might notice a slower
response because memory usage on the system increases until the retrieved
WMI objects are no longer needed by Splunk (after indexing).
More help
If you are still having issues, read "Troubleshooting common issues with Splunk
and WMI".
Forwarder Issues
This section provides help for users who use Splunk's forwarding and receiving
capabilities, including the new universal forwarder included with Version 4.2 and
later.
If you're using a forwarder to send data to a receiver and the receiver isn't getting
any data, there are a number of things you can try to fix the problem:
107
Make sure the configuration files on your forwarder are properly
formatted.
Review your configuration files carefully, and check for spelling and
syntax errors.
Stanza names must always be bracketed with square brackets ([
]). Don't use curly braces or parentheses.
The syntax for remote performance monitoring differs significantly
from local performance monitoring. Be sure to review "Monitor
Windows performance" in the Getting Data In Manual for specific
information.
Once you have confirmed any or all of these, restart the universal forwarder to
ensure it gets a new authentication token from a domain controller.
Note: When assigning access, it's best practice to use the least permissive
security paradigm. This entails denying all access to a resource initially, and only
then granting access for specific users as necessary.
See the Admin Manual for information on getting started for Windows admins.
Have additional questions or need more help? Be sure to visit Splunk Answers
and see what questions and answers the Splunk community has around
troubleshooting Splunk on Windows.
108
SuSE Linux search error
Users running Splunk on a SuSE server may see the error message:
when executing a search. Alternatively, the dashboard just won't display properly.
To resolve this issue, edit the /etc/mime.types file. Delete (or comment out)
these two lines:
text/x-xsl xsl
text/x-xslt xslt xsl
text/xml xml
to:
With these changes in place, restart Splunk and clear your browser cache.
Note: If you are using a proxy, you will need to flush that as well.
Garbled events
Symptom
Explanation
Many files are human readable that are not in a properly encoded format. Many
applications will auto-trim text or special characters including nulls, so it is
important to know what is included in the log file, not just what the application
displays.
109
Solution
To correct this, set the charset in props.conf for this input to the appropriate
character set (using the CHARSET attribute).
If you don't know the encoding of your source file, and have access to a *nix
machine, you can use the "file" command:
file sample.log
sample.log: UTF-8 Unicode English text
In this example, the encoding is UTF-8. Note, though, that Splunk accepts many
other encodings. Find a list of supported character sets, and instructions on
specifying a charset, in "Configure character set encoding" in the Getting Data In
Manual.
Explanation
Sometimes non-UTF-8 logs are not processed because they are seen as binary
in the binary check process.
Solution
Set the charset in props.conf for this input to the appropriate charset. This error
shows up in the splunkd.log where the props.conf needs to be specified. So, if
you're using a forwarder, the forwarder's splunkd.log is where you'll find the error,
and that's also where you need to configure the props.conf.
110
Performance degraded in a search head pooling
environment
In a pool environment, you're noticing that searches are taking longer than they
used to. How do you figure out where your performance degradation is coming
from? This topic suggests a few tests you can run.
On the search head, in the pooled location, at the *nix command line,
measures the time to find the things in .../dir and then count them.
If you don't have shell access, other tests you can run include:
In splunkd.log searchstats
111
any search taking over 30 seconds to return is a slow search.
If
the only slow things are searches (but not, for example, bundle
replication), then your problem might be with your mount point. Run some
commands outside of Splunk Enterprise to validate that your mount point
is healthy.
accessing knowledge objects takes a long time, search in metrics.log for
the load_average:
look in metrics for 2-5 minutes before and after the duration of the slow-running
search
If you see this is high, and you have SoS installed, refer to the same period of
time and look at the CPU graphs on SoS to make sure you're not seeing a
system load.
If it's a search load problem, the CPU usage will be high for the duration of the
slow search.
If you have the Splunk on Splunk app, check the search load view. If you have
the Distributed Management Console, check the Search Activity views.
Consider search scheduling. Have you scheduled many searches to run at the
same time? Use the Distributed Management Console Search Activity view to
identify search scheduling issues. If you've identified issues, move some of your
scheduled searches to different minutes past the hour.
112
HTTP thread limit issues
When you run Splunk Enterprise in a fashion that uses lots of HTTP connections
for Representational State Transfer (REST) operations (for example, a
deployment server in a large distributed environment), you might encounter
undesirable behavior, including but not limited to logging of errors in splunkd.log
like the following:
If Splunk Enterprise runs out of HTTP sockets or threads, it can't complete REST
calls to its backend and any such calls fail.
113
Override automatic socket and thread configuration
2. In the [httpServer] stanza, set the maxThreads attribute to specify the number
of threads for REST HTTP operations that Splunk Enterprise should use.
3. Set the maxSockets attribute to specify the number of sockets that should be
available for REST HTTP operations.
The following example sets the number of HTTP threads to 100000 and the
number of sockets to 50000:
[httpServer]
maxThreads=100000
maxSockets=50000
114