Anomaly Detection of Web-Based Attacks
Anomaly Detection of Web-Based Attacks
path a 1= v 1 a 2= v 2
puts. This phase is divided into two steps. During the first 4.1.2 Detection
step, the system creates profiles for each server-side program Given the estimated query attribute length distribution
and its attributes. During the second step, suitable thresh- with parameters µ and σ 2 as determined by the previous
olds are established. This is done by evaluating queries and learning phase, it is the task of the detection phase to assess
their attributes using the profiles created during the previ- the regularity of a parameter with length l.
ous step. For each program and its attributes, the highest The probability of l can be calculated using the Chebyshev
anomaly score is stored and then, the threshold is set to a inequality shown below.
value that is a certain, adjustable percentage higher than this
maximum. The default setting for this percentage (also used
for our experiments) is 10%. By modifying this value, the σ2
p(|x − µ| > t) < (2)
user can adjust the sensitivity of the system and perform a t2
trade-off between the number of false positives and the ex- The Chebyshev inequality puts an upper bound on the
pected detection accuracy. The length of the training phase probability that the difference between the value of a ran-
(i.e., the number of queries and attributes that are utilized dom variable x and µ exceeds a certain threshold t, for an
to establish the profiles and the thresholds) is determined by arbitrary distribution with variance σ 2 and mean µ. This
an adjustable parameter. upper bound is strict and has the advantage that is does not
Once the profiles have been created – that is, the models assume a certain underlying distribution. We substitute the
have learned the characteristics of normal events and suit- threshold t with the distance between the attribute length
able thresholds have been derived – the system switches to l and the mean µ of the attribute length distribution (i.e.,
detection mode. In this mode, anomaly scores are calculated |l−µ|). This allows us to obtain an upper bound on the prob-
and anomalous queries are reported. ability that the length of the parameter deviates more from
The following sections describe the algorithms that ana- the mean than the current instance. The resulting probabil-
lyze the features that are considered relevant for detecting ity value p(l) for an attribute with length l is calculated as
malicious activity. For each algorithm, an explanation of the shown below.
model creation process (i.e., the learning phase) is included.
In addition, the mechanism to derive a probability value p for
a new input element (i.e., the detection phase) is discussed. σ2
p(|x − µ| > |l − µ|) < p(l) = (3)
(l − µ)2
4.1 Attribute Length This is the value returned by the model when operating in
In many cases, the length of a query attribute can be used detection mode. The Chebyshev inequality is independent
to detect anomalous requests. Usually, parameters are either of the underlying distribution and its computed bound is, in
fixed-size tokens (such as session identifiers) or short strings general, very weak. Applied to our model, this weak bound
derived from human input (such as fields in an HTML form). results in a high degree of tolerance to deviations of attribute
Therefore, the length of the parameter values does not vary lengths given an empirical mean and variance. Although
much between requests associated with a certain program. such a property is undesirable in many situations, by using
The situation may look different when malicious input is this technique only obvious outliers are flagged as suspicious,
passed to the program. For example, to overflow a buffer leading to a reduced number of false alarms.
in a target application, it is necessary to ship the shell code
and additional padding, depending on the length of the tar- 4.2 Attribute Character Distribution
get buffer. As a consequence, the attribute contains up to The attribute character distribution model captures the
several hundred bytes. concept of a ‘normal’ or ‘regular’ query parameter by look-
The goal of this model is to approximate the actual but ing at its character distribution. The approach is based
unknown distribution of the parameter lengths and detect on the observation that attributes have a regular structure,
instances that significantly deviate from the observed normal are mostly human-readable, and almost always contain only
behavior. Clearly, we cannot expect that the probability printable characters.
density function of the underlying real distribution will follow A large percentage of characters in such attributes are
a smooth curve. We also have to assume that the distribution drawn from a small subset of the 256 possible 8-bit values
has a large variance. Nevertheless, the model should be able (mainly from letters, numbers, and a few special charac-
to identify significant deviations. ters). As in English text, the characters are not uniformly
distributed, but occur with different frequencies. Obviously,
4.1.1 Learning it cannot be expected that the frequency distribution is iden-
We approximate the mean µ̇ and the variance σ̇ 2 of the real tical to a standard English text. Even the frequency of a cer-
attribute length distribution by calculating the sample mean tain character (e.g., the frequency of the letter ‘e’) varies con-
µ and the sample variance σ 2 for the lengths l1 , l2 , . . . , ln of siderably between different attributes. Nevertheless, there
the parameters processed during the learning phase (assum- are similarities between the character frequencies of query
ing that n queries with this attribute were processed). parameters. This becomes apparent when the relative fre-
quencies of all possible 256 characters are sorted in descend- 4.2.2 Detection
ing order. Given an idealized character distribution ICD, the task of
The algorithm is based only on the frequency values them- the detection phase is to determine the probability that the
selves and does not rely on the distributions of particular character distribution of a query attribute is an actual sam-
characters. That is, it does not matter whether the character ple drawn from its ICD. This probability, or more precisely,
with the most occurrences is an ‘a’ or a ‘/’. In the following, the confidence in the hypothesis that the character distribu-
the sorted, relative character frequencies of an attribute are tion is a sample from the idealized character distribution, is
called its character distribution. calculated by a statistical test.
For example, consider the parameter string ‘passwd’ with This test should yield a high confidence in the correctness
the corresponding ASCII values of ‘112 97 115 115 119 100’. of the hypothesis for normal (i.e., non-anomalous) attributes
The absolute frequency distribution is 2 for 115 and 1 for the while it should reject anomalous ones. The detection algo-
four others. When these absolute counts are transformed into rithm uses a variant of the Pearson χ2 -test as a ‘goodness-
sorted, relative frequencies (i.e., the character distribution), of-fit’ test [4].
the resulting values are 0.33, 0.17, 0.17, 0.17, 0.17 followed For the intended statistical calculations, it is not neces-
by 0 occurring 251 times. sary to operate on all values of ICD directly. Instead, it is
For an attribute of a legitimate query, one can expect that enough to consider a small number of intervals, or bins. For
the relative frequencies slowly decrease in value. In case of example, assume that the domain of ICD is divided into six
malicious input, however, the frequencies can drop extremely segments as shown in Table 1. Although the choice of six
fast (because of a peak caused by a single character with a bins is somewhat arbitrary1 , it has no significant impact on
very high frequency) or nearly not at all (in case of random the results.
values).
The character distribution of an attribute that is perfectly
normal (i.e., non-anomalous) is called the attribute’s ideal- Segment 0 1 2 3 4 5
ized character distribution (ICD). The idealized character x-Values 0 1-3 4-6 7-11 12-15 16-255
distribution is a discrete distribution with:
D P PD
ICD : 7→ with = {n ∈ N |0 ≤ n ≤ 255}, P = {p ∈ Table 1: Bins for the χ2 -test
R|0 ≤ p ≤ 1} and 255
i=0 ICD(i) = 1.0.
The relative frequency of the character that occurs n-most The expected relative frequency of characters in a segment
often (0-most denoting the maximum) is given as ICD(n). can be easily determined by adding the values of ICD for the
When the character distribution of the sample parameter corresponding x-values. Because the relative frequencies are
‘passwd’ is interpreted as the idealized character distribution, sorted in descending order, it can be expected that the values
then ICD(0) = 0.33 and ICD(1) to ICD(4) are equal to 0.17. of ICD(x) are more significant for the anomaly score when x
In contrast to signature-based approaches, this model has is small. This fact is clearly reflected in the division of ICD’s
the advantage that it cannot be evaded by some well-known domain.
attempts to hide malicious code inside a string. In fact, When a new query attribute is analyzed, the number of
signature-based systems often contain rules that raise an occurrences of each character in the string is determined.
alarm when long sequences of 0x90 bytes (the nop operation Afterward, the values are sorted in descending order and
in Intel x86-based architectures) are detected in a packet. combined according to Table 1 by aggregating values that
An intruder may substitute these sequences with instructions belong to the same segment. The χ2 -test is then used to cal-
that have a similar behavior (e.g., add rA,rA,0, which adds culate the probability that the given sample has been drawn
0 to the value in register A and stores the result back to A). from the idealized character distribution. The standard test
By doing this, it is possible to prevent signature-based sys- requires the following steps to be performed.
tems from detecting the attack. Such sequences, nonetheless,
cause a distortion of the attribute’s character distribution, 1. Calculate the observed and expected frequencies - The
and, therefore, the character distribution analysis still yields observed values Oi (one for each bin) are already given.
a high anomaly score. In addition, characters in malicious in- The expected number of occurrences Ei are calculated
put are sometimes disguised by xor’ing them with constants by multiplying the relative frequencies of each of the
or shifting them by a fixed value (e.g., using the ROT-13 six bins as determined by the ICD times the length of
code). In this case, the payload only contains a small rou- the attribute (i.e., the length of the string).
tine in clear text that has the task of decrypting and launch-
ing the primary attack code. These evasion attempts do not 2. Compute the χ2 -value as χ2 =
P i<6 (Oi −Ei )2
- note
i=0 Ei
change the resulting character distribution and the anomaly
that i ranges over all six bins.
score of the analyzed query parameter is unaffected.
4.2.1 Learning 3. Determine the degrees of freedom and obtain the sig-
nificance - The degrees of freedom for the χ2 -test are
The idealized character distribution is determined during identical to the number of addends in the formula above
the training phase. For each observed query attribute, its minus one, which yields five for the six bins used. The
character distribution is stored. The idealized character dis- actual probability p that the sample is derived from
tribution is then approximated by calculating the average of the idealized character distribution (that is, its signif-
all stored character distributions. This is done by setting icance) is read from a predefined table using the χ2 -
ICD(n) to the mean of the nth entry of the stored character value as index.
distributions ∀n : 0 ≤ n ≤ 255 . Because all individual char-
acter distributions sum up to unity, their average will do so as 1
The number six seems to have a particular relevance to the
well, and the idealized character distribution is well-defined. field of anomaly detection [32].
The derived value p is used as the return value for this probability of a single path is the product of the probabili-
model. When the probability that the sample is drawn from ties of the emitted symbols pSi (oi ) and the taken transitions
the idealized character distribution increases, p increases as p(ti ). The probabilities of all possible output words w sum
well. up to 1.
>
are multiplied. Models that are too complex have a high g(x) = (9)
th
likelihood of producing the training data (up to 1 when the
>
>
if the x value was seen before
model only contains the training input without any abstrac-
tions), but the probability of the model itself is very low. : 0,if x = 0
By maximizing the product, the Bayesian model induction
approach creates automatons that generalize enough to re- The correlation parameter ρ is derived after the training
flect the general structure of the input without discarding data has been processed. It is calculated from f and g with
too much information. their respective variances Var(f ), Var(g) and the covariance
The model building process starts with an automaton that Covar(f,g) as shown below.
exactly reflects the input data and then gradually merges
states. This state merging is continued until the a posteriori
probability no longer increases. There are a number of op- ρ= pVar(f
Covar(f, g)
) ∗ Var(g)
(10)
timizations such as the Viterbi path approximation and the
path prefix compression that need to be applied to make that If ρ is less than 0, then f and g are negatively correlated
process effective. The interested reader is referred to [30] and and an enumeration is assumed. This is motivated by the fact
[31] for details. Alternative applications of Markov models that, in this case, increasing function values of f (reflecting
for intrusion detection have been presented in [3] and in [35]. the increasing number of analyzed parameters) correlate with
decreasing values of g(x) (reflecting the fact that many argu-
4.3.2 Detection ment values for a have previously occurred). In the opposite
Once the Markov model has been built, it can be used case, where ρ is greater than 0, the values of a have shown
by the detection phase to evaluate query attributes by de- sufficient variation to support the hypothesis that they are
termining their probability. The probability of an attribute not drawn from a small set of predefined tokens.
is calculated in a way similar to the likelihood of a training When an enumeration is assumed, the complete set of iden-
item as shown in Equation 4. The problem is that even legi- tifiers is stored for use in the detection phase.
timate input that has been regularly seen during the training
phase may receive a very small probability value because the 4.4.2 Detection
probability values of all possible input words sum up to 1. Once it has been determined that the values of a query
Therefore, we chose to have the model return a probability attribute are tokens drawn from an enumeration, any new
value of 1 if the word is a valid output from the Markov value is expected to appear in the set of known values. When
model and a value of 0 when the value cannot be derived this happens, 1 is returned, 0 otherwise. If it has been de-
from the given grammar. termined that the parameter values are random, the model
always returns 1.
4.4 Token Finder
The purpose of the token finder model is to determine 4.5 Attribute Presence or Absence
whether the values of a certain query attribute are drawn Most of the time, server-side programs are not directly in-
from a limited set of possible alternatives (i.e., they are to- voked by users typing the input parameters into the URIs
kens or elements of an enumeration). Web applications often themselves. Instead, client-side programs, scripts, or HTML
require one out of a few possible values for certain query forms pre-process the data and transform it into a suitable
attributes, such as flags or indices. When a malicious user request. This processing step usually results in a high reg-
attempts to use these attributes to pass illegal values to the ularity in the number, name, and order of parameters. Em-
application, the attack can be detected. When no enumera- pirical evidence shows that hand-crafted attacks focus on ex-
tion can be identified, it is assumed that the attribute values ploiting a vulnerability in the code that processes a certain
are random. parameter value, and little attention is paid to the order or
completeness of the parameters.
4.4.1 Learning The analysis takes advantage of this fact and detects re-
The classification of an argument as an enumeration or as quests that deviate from the way parameters are presented
a random value is based on the observation that the number by legitimate client-side scripts or programs. This type of
of different occurrences of parameter values is bound by some anomaly is detected using two different algorithms. The first
one, described in this section, deals with the presence and either by a direct edge connecting their corresponding ver-
absence of attributes ai in a query q. The second one is tices, or by a path over a series of directed edges. At this
based on the relative order of parameters and is further dis- point, however, the graph could potentially contain cycles as
cussed in Section 4.6. Note that the two models differ from a result of precedence relationships between attributes de-
the previous ones because the analysis is performed on the rived from different queries. As such relationships are im-
query as a whole, and not individually on each parameter. possible, they have to be removed before the final order con-
The algorithm discussed hereinafter assumes that the ab- straints can be determined. This is done with the help of
sence or abnormal presence of one or more parameters in a Tarjan’s algorithm [33] which identifies all strongly connected
query might indicate malicious behavior. In particular, if components (SCCs) of G. For each component, all edges con-
an argument needed by a server-side program is missing, or necting vertices of the same SCC are removed. The resulting
if mutually exclusive arguments appear together, then the graph is acyclic and can be utilized to determine the set of
request is considered anomalous. attribute pairs O which are in a ‘precedes’ relationship. This
is obtained by enumerating for each vertex vi all its reachable
4.5.1 Learning nodes vg , . . . , vh in G, and adding the pairs (ai , ag ) . . . (ai , ah )
The test for presence and absence of parameters creates a to O.
model of acceptable subsets of attributes that appear simul-
taneously in a query. This is done by recording each distinct 4.6.2 Detection
subset Sq = {ai , . . . , ak } of attributes that is seen during the The detection process checks whether the attributes of a
training phase. query satisfy the order constraints deduced during the learn-
ing phase. Given a query with attributes a1 , a2 , . . . , ai and
4.5.2 Detection the set of order constraints O, all the parameter pairs (aj , ak )
During the detection phase, the algorithm performs for with j 6= k and 1 ≤ j, k ≤ i are analyzed to detect po-
each query a lookup of the current attribute set. When the tential violations. A violation occurs when for any single
set of parameters has been encountered during the training pair (aj , ak ), the corresponding pair with swapped elements
phase, 1 is returned, otherwise 0. (ak , aj ) is an element of O. In such a case, the algorithm
returns an anomaly score of 0, otherwise it returns 1.
4.6 Attribute Order
As discussed in the previous section, legitimate invocations
of server-side programs often contain the same parameters
5. EVALUATION
in the same order. Program logic is usually sequential, and, This section discusses our approach to validate the pro-
therefore, the relative order of attributes is preserved even posed models and to evaluate the detection effectiveness of
when parameters are omitted in certain queries. This is not our system. That is, we assess the capability of the models to
the case for hand-crafted requests, as the order chosen by a accurately capture the properties of the analyzed attributes
human can be arbitrary and has no influence on the execution and their ability to reliably detect potentially malicious de-
of the program. viations.
The test for parameter order in a query determines whether The evaluation was performed using three data sets. These
the given order of attributes is consistent with the model data sets were Apache log files from a production web server
deduced during the learning phase. at Google, Inc. and from two Computer Science Department
web servers located at the University of California, Santa
4.6.1 Learning Barbara (UCSB) and the Technical University, Vienna (TU
The order constraints between all k attributes (ai : ∀i = Vienna).
1 . . . k) of a query are gathered during the training phase. We had full access to the log files of the two universities.
An attribute as of a program precedes another attribute at However, the access to the log file from Google was restricted
when as and at appear together in the parameter list of at because of privacy issues. To obtain results for this data set,
least one query and as comes before at in the ordered list of our tool was run on our behalf locally at Google and the
attributes of all queries where they appear together. results were mailed to us.
This definition allows one to introduce the order constraints Table 2 provides information about important properties
as a set of attribute pairs O such that: of the data sets. The table shows the time interval during
which the data was recorded and the log file size. It also
lists the total number of HTTP queries in the log file, the
O = {(ai , aj ) : ai precedes aj and (11) number of requests that invoke server-side programs (such
as CGI requests), the total number of their attributes, and
ai , aj ∈ (Sqj : ∀j = 1 . . . n)}
the number of different server-side programs.
The set of attribute pairs O is determined as follows. Con-
sider a directed graph G that has a number of vertices equal 5.1 Model Validation
to the number of distinct attributes. Each vertex vi in G This section shows the validity of the claim that our pro-
is associated with the corresponding attribute ai . For every posed models are able to accurately describe properties of
query qj , with j = 1 . . . n, that is analyzed during the train- query attributes. For this purpose, our detection tool was
ing period, the ordered list of its attributes a1 , a2 , . . . , ai is run on the three data sets to determine the distribution of
processed. For each attribute pair (as , at ) in this list, with the probability values for the different models. The length
s 6= t and 1 ≤ s, t ≤ i, a directed edge is inserted into the of the training phase was set to 1,000 for this and all follow-
graph from vs to vt . ing experiments. This means that our system used the first
At the end of the learning process, graph G contains all or- thousand queries that invoked a certain server-side program
der constraints imposed by queries in the training data. The to establish its profiles and to determine suitable detection
order dependencies between two attributes are represented thresholds.
Data Set Time Interval Size (MByte) HTTP Queries Program Requests Attributes Programs
Google 1 hour 236 640,506 490,704 1,611,254 206
UCSB 297 days 1,001 9,951,174 7,993 4,617 395
TU Vienna 80 days 251 2,061,396 713,500 765,399 84
1
Google
UCSB
TU Vienna
0.1
Relative Number of Attribute Values
0.01
0.001
0.0001
0 0.2 0.4 0.6 0.8 1
Probability Values
Figure 3 and 4 show a distribution of the probability values search string is included in the distribution. Naturally, this
that have been assigned to the query attributes by the length string, which is provided by users via their web browsers to
and the character distribution models, respectively. The y- issue Google search request, varies to a great extent.
axis shows the percentage of attribute values that appeared
with a specific probability. For the figures, we aggregated the
probability values (which are real numbers in the interval be-
5.2 Detection Effectiveness
tween 0.0 and 1.0) into ten bins, each bin covering an interval This section analyzes the number of hits and false positives
of 0.1. That is, all probabilities in the interval [0.0, 0.1[ are raised during the operation of our tool.
added to the first bin, values in the interval [0.1, 0.2[ are To assess the number of false positives that can be ex-
added to the second bin, and so forth. Note that a proba- pected when our system is deployed, the intrusion detection
bility of 1 indicates a completely normal event. The relative system was run on our three data sets. For this experiment,
number of occurrences are shown on a logarithmic scale. we assumed that the training data contained no real attacks.
Although the original log files showed a significant number of
entries from Nimda or Code Red worm attacks, these queries
Table 3 shows the number of attributes that have been were excluded both from the model building and detection
rated as normal (with a probability of 1) or as anomalous process. Note, however, that this is due to the fact that
(with a probability of 0) by the structural model and the all three sites use the Apache HTTP server. This web server
token finder model. The table also provides the number fails to locate the targeted vulnerable program and thus, fails
of queries that have been classified as normal or as anoma- execute it. As we only include queries that result from the in-
lous by the presence/absence model and the attribute order vocation of existing programs into the training and detection
model. The number of queries is less than the number of process, these worm attacks were ignored.
attributes, as each query can contain multiple attributes. The false positive rate can be easily calculated by divid-
The distributions of the anomaly scores in Figure 3, Fig- ing the number of reported anomalous queries by the total
ure 4 and Table 3 show that all models are capable of captur- number of analyzed queries. It is shown for each data set in
ing the normality of their corresponding features. The vast Table 4.
majority of the analyzed attributes are classified as normal The relative numbers of false positives are very similar for
(reflected by an anomaly score close to one in the figures) all three sites, but the absolute numbers differ tremendously,
and only few instances deviate from the established profiles. reflecting the different web server loads. Although almost
The graphs in Figure 3 and 4 quickly drop from above 90% five thousand alerts per day for the Google server appears
of ‘most normal’ instances in the last bin to values below 1%. to be a very high number at a first glance, one has to take
It can be seen that the data collected by the Google server into account that this is an initial result. The alerts are the
shows the highest variability (especially in the case of the at- raw output produced by our system after a training phase
tribute length model). This is due to the fact that the Google with parameters chosen for the university log files. One ap-
Structure (Attribute) Token (Attribute) Presence (Query) Order (Query)
Data Set normal anomalous normal anomalous normal anomalous normal anomalous
Google 1,595,516 15,738 1,603,989 7,265 490,704 0 490,704 0
UCSB 7,992 1 7,974 19 4,616 1 4,617 0
TU Vienna 765,311 98 765,039 370 713,425 75 713,500 0
1
Google
UCSB
TU Vienna
0.1
Relative Number of Attribute Values
0.01
0.001
0.0001
0 0.2 0.4 0.6 0.8 1
Probability Values
proach to reduce the number of false positives is to modify the already been installed at this site and were regularly used.
training and detection thresholds to account for the higher This allowed us to base the evaluation on real-world training
variability in the Google traffic. Nearly half of the number of data.
false positives are caused by anomalous search strings that We used eleven real-world exploits downloaded from popu-
contain instances of non-printable characters (probably re- lar security sites [6, 27, 29] for our experiment. The set of at-
quests issued by users with incompatible character sets) or tacks consisted of a buffer overflow against phorum [26], a php
extremely long strings (such as URLs directly pasted into the message board, and three directory traversal attacks against
search field). Another approach is to perform post-processing htmlscript [24]. Two XSS (cross-site scripting) exploits
of the output, maybe using a signature-based intrusion de- were launched against imp [15], a web-based email client,
tection system to discard anomalous queries with known de- and two XSS exploits against csSearch [8], a search utility.
viations. In addition, it is not completely impossible to deal Webwho [9], a web-based directory service was compromised
with this amount of alerts manually. One or two full-time using three variations of input validation errors. We also
employees could browse the list of alerts, quickly discarding wanted to assess the ability of our system to detect worms
obviously incorrect instances and concentrating on the few such as Nimda or Code Red. However, as mentioned above,
suspicious ones. all log files were created by Apache web servers. Apache
When analyzing the output for the two university log files, is not vulnerable against the attacks, as both worms exploit
we encountered several anomalous queries with attributes vulnerabilities in Microsoft’s Internet Information Server (IIS).
that were not malicious, even though they could not be in- We solved the problem by installing a Microsoft IIS server
terpreted as correct in any way. For example, our tool re- and, after manually creating training data for the vulnerable
ported a character string in a field used by the application program, injecting the signature of a Code Red attack [5].
to transmit an index. By discussing these queries with the Then, we transformed the log file into Apache format and
administrators of the corresponding sites, it was concluded run our system on it.
that some of the mistakes may have been introduced by users All eleven attacks and the Code Red worm have been re-
that were testing the system for purposes other than security. liably detected by our anomaly detection system, using the
After estimating the false alarm rates, the detection ca- same thresholds and training data that were used to evaluate
pabilities of our tool were analyzed. For this experiment, a the false alarm rate for this data set. Although the attacks
number of attacks were introduced into the data set of TU were known to us, all are based on existing code that was
Vienna. We have chosen this data set to insert attacks for two used unmodified. In addition, the malicious queries were in-
reasons. First, we had access to the log file and could inject jected into the log files for this experiment after the model
queries; something that was impossible for the Google data algorithms were designed and the false alarm rate was as-
set. Second, the vulnerable programs that were attacked had sessed. No manual tuning or adjustment was necessary.
Data Set Number of Alerts Number of Queries False Positive Rate Alarms per Day
Google 206 490,704 0.000419 4,944
UCSB 3 4617 0.000650 0.01
TU Vienna 151 713,500 0.000212 1.89
Table 5 shows the models that reported an anomalous than 0.2% false alarms in our experiments). Some of the at-
query or an anomalous attribute for each class of attacks. tacks are also detectable by signature-based intrusion detec-
It is evident that there is no model that raises an alert for tion systems such as Snort, because they represent variations
all attacks. This underlines the importance of choosing and of known attacks (e.g., Code Red, buffer overflows). Other
combining different properties of queries and attributes to attacks use malicious manipulation of the query parameters,
cover a large number of possible attack venues. which signature-based system do not notice. These attacks
The length model, the character distribution model, and are correctly flagged by our anomaly detection system.
the structural model are very effective against a broad range A limitation of the system is its reliance on web access logs.
of attacks that inject a substantial amount of malicious pay- Attacks that compromise the security of a web server before
load into an attribute string. Attacks such as buffer over- the logging is performed may not be detected. The approach
flow exploits (including the Code Red worm, which bases described in [1] advocates the direct instrumentation of web
its spreading mechanism on a buffer overflow in Microsoft’s servers in order to perform timely detection of attacks, even
IIS) and cross-site scripting attempts require a substantial before a query is processed. This approach may introduce
amount of characters, thereby increasing the attribute length some unwanted delay in certain cases, but if this delay is
noticeably. Also, a human operator can easily tell that a ma- acceptable then the system described here could be easily
liciously modified attribute does not ‘look right’. This ob- modified to fit that model.
servation is reflected in its anomalous character distribution
and a structure that differs from the previously established
profile. 6. CONCLUSIONS
Input validation errors, including directory traversal at- Web-based attacks should be addressed by tools and tech-
tempts, are harder to detect. The required number of charac- niques that compose the precision of signature-based detec-
ters is smaller than the number needed for buffer overflow or tion with the flexibility of anomaly-based intrusion detection
XSS exploits, often in the range of the legitimate attribute. system.
Directory traversal attempts stand out because of the un- This paper introduces a novel approach to perform anomaly
usual structure of the attribute string (repetitions of slashes detection, using as input HTTP queries containing param-
and dots). Unfortunately, this is not true for input valida- eters. The work presented here is novel in several ways.
tion attacks in general. The three attacks that exploit an First of all, to the best of our knowledge, this is the first
error in Webwho did not result in an anomalous attribute for anomaly detection system specifically tailored to the detec-
the character distribution model or the structural model. In tion of web-based attacks. Second, the system takes advan-
this particular case, however, the token finder raised an alert, tage of application-specific correlation between server-side
because only a few different values of the involved attribute programs and parameters used in their invocation. Third,
were encountered during the training phase. the parameter characteristics (e.g., length and structure) are
The presence/absence and the parameter order model can learned from input data. Ideally, the system will not re-
be evaded without much effort by an adversary that has suffi- quire any installation-specific configuration, even though the
cient knowledge of the structure of a legitimate query. Note, level of sensitivity to anomalous data can be configured via
however, that the available exploits used in our experiments thresholds to suit different site policies.
resulted in reported anomalies from at least one of the two The system has been tested on data gathered at Google,
models in 8 out of 11 cases (one buffer overflow, four directory Inc. and two universities in the United States and Europe.
traversal, and three input validation attacks). We therefore Future work will focus on further decreasing the number of
decided to include these models into our IDS, especially be- false positives by refining the algorithms developed so far,
cause of the low number of false alarms they produce. and by looking at additional features. The ultimate goal is
The results presented in this section show that our sys- to be able to perform anomaly detection in real-time for web
tem is able to detect a high percentage of attacks with a sites that process millions of queries per day with virtually
very limited number of false positives (all attacks, with less no false alarms.
Acknowledgments In Symposium on Applied Computing (SAC). ACM
We would like to thank Urs Hoelzle from Google, Inc. who Scientific Press, March 2002.
made it possible to test our system on log files from one of [19] T. Lane and C.E. Brodley. Temporal sequence learning
the world’s most popular web sites. and data reduction for anomaly detection. In
This research was supported by the Army Research Office, Proceedings of the 5th ACM conference on Computer
under agreement DAAD19-01-1-0484. The views and conclu- and communications security, pages 150–158. ACM
sions contained herein are those of the authors and should not Press, 1998.
be interpreted as necessarily representing the official policies [20] W. Lee and S. Stolfo. A Framework for Constructing
or endorsements, either expressed or implied, of the Army Features and Models for Intrusion Detection Systems.
Research Office, or the U.S. Government. ACM Transactions on Information and System
Security, 3(4), November 2000.
7. REFERENCES [21] W. Lee, S. Stolfo, and K. Mok. Mining in a Data-flow
Environment: Experience in Network Intrusion
[1] M. Almgren and U. Lindqvist. Application-Integrated
Data Collection for Security Monitoring. In Proceedings Detection. In Proceedings of the 5th ACM SIGKDD
International Conference on Knowledge Discovery &
of Recent Advances in Intrusion Detection (RAID),
LNCS, pages 22–36, Davis,CA, October 2001. Springer. Data Mining (KDD ’99), San Diego, CA, August 1999.
[22] J. Liberty and D. Hurwitz. Programming ASP.NET.
[2] Apache 2.0 Documentation, 2002.
O’REILLY, February 2002.
http://www.apache.org/.
[3] D. Barbara, R. Goel, and S. Jajodia. Mining Malicious [23] U. Lindqvist and P.A. Porras. Detecting Computer and
Network Misuse with the Production-Based Expert
Data Corruption with Hidden Markov Models. In 16th
System Toolset (P-BEST). In IEEE Symposium on
Annual IFIP WG 11.3 Working Conference on Data
and Application Security, Cambridge, England, July Security and Privacy, pages 146–161, Oakland,
California, May 1999.
2002.
[24] Miva HtmlScript. http://www.htmlscript.com/.
[4] Patrick Billingsley. Probability and Measure.
Wiley-Interscience, 3 edition, April 1995. [25] V. Paxson. Bro: A System for Detecting Network
Intruders in Real-Time. In Proceedings of the 7th
[5] CERT/CC. “Code Red Worm” Exploiting Buffer
USENIX Security Symposium, San Antonio, TX,
Overflow In IIS Indexing Service DLL. Advisory
CA-2001-19, July 2001. January 1998.
[26] Phorum: PHP Message Board.
[6] CGI Security Homepage.
http://www.phorum.org/.
http://www.cgisecurity.com/, 2002.
[7] K. Coar and D. Robinson. The WWW Common [27] PHP Advisory Homepage.
http://www.phpadvisory.com/, 2002.
Gateway Interface, Version 1.1. Internet Draft, June
1999. [28] M. Roesch. Snort - Lightweight Intrusion Detection for
[8] csSearch. http://www.cgiscript.net/. Networks. In Proceedings of the USENIX LISA ’99
Conference, November 1999.
[9] Cyberstrider WebWho. http://www.webwho.co.uk/.
[29] Security Focus Homepage.
[10] D.E. Denning. An Intrusion Detection Model. IEEE
http://www.securityfocus.com/, 2002.
Transactions on Software Engineering, 13(2):222–232,
February 1987. [30] Andreas Stolcke and Stephen Omohundro. Hidden
Markov Model Induction by Bayesian Model Merging.
[11] R. Fielding et al. Hypertext Transfer Protocol –
Advances in Neural Information Processing Systems,
HTTP/1.1. RFC 2616, June 1999.
1993.
[12] S. Forrest. A Sense of Self for UNIX Processes. In
[31] Andreas Stolcke and Stephen Omohundro. Inducing
Proceedings of the IEEE Symposium on Security and
Probabilistic Grammars by Bayesian Model Merging.
Privacy, pages 120–128, Oakland, CA, May 1996.
In Conference on Grammatical Inference, 1994.
[13] A.K. Ghosh, J. Wanken, and F. Charron. Detecting
[32] K. Tan and R. Maxion. ”Why 6?” Defining the
Anomalous and Unknown Intrusions Against
Operational Limits of Stide, an Anomaly-Based
Programs. In Proceedings of the Annual Computer
Intrusion Detector. In Proceedings of the IEEE
Security Applications Conference (ACSAC’98), pages
Symposium on Security and Privacy, pages 188–202,
259–267, Scottsdale, AZ, December 1998.
Oakland, CA, May 2002.
[14] K. Ilgun, R.A. Kemmerer, and P.A. Porras. State
[33] Robert Tarjan. Depth-First Search and Linear Graph
Transition Analysis: A Rule-Based Intrusion Detection
Algorithms. SIAM Journal of Computing, 1(2):10–20,
System. IEEE Transactions on Software Engineering,
June 1972.
21(3):181–199, March 1995.
[34] Security Tracker. Vulnerability statistics April
[15] IMP Webmail Client. http://www.horde.org/imp/.
2001-march 2002. http://www.securitytracker.com/
[16] H. S. Javitz and A. Valdes. The SRI IDES Statistical learn/statistics.html, April 2002.
Anomaly Detector. In Proceedings of the IEEE
[35] N. Ye, Y. Zhang, and C. M. Borror. Robustness of the
Symposium on Security and Privacy, May 1991.
Markov chain model for cyber attack detection. IEEE
[17] C. Ko, M. Ruschitzka, and K. Levitt. Execution Transactions on Reliability, 52(3), September 2003.
Monitoring of Security-Critical Programs in
Distributed Systems: A Specification-based Approach.
In Proceedings of the 1997 IEEE Symposium on
Security and Privacy, pages 175–187, May 1997.
[18] C. Kruegel, T. Toth, and E. Kirda. Service Specific
Anomaly Detection for Network Intrusion Detection.