PQL
PQL
,
bef (s) = aft(s). The default skip transition is modied to be
predicated upon not matching s
.
Sequencing. If s = s1; s2, then bef (s) = bef (s1), aft(s) =
aft(s2), and aft(s1) = bef (s2).
Alternation. If s = s1|s2, then bef (s) provides transitions to
bef (s1) and bef (s2); similarly, aft(s1) and aft(s2) each have an
transition to aft(s).
Partial order. Partial orders resemble alternation statements: if
s = s1, s2, then bef (s) provides transitions to bef (s1) and
bef (s2); similarly, aft(s1) and aft(s2) each have an transition
to aft(s). The primary difference is that the aft(s) state is a join
point.
Method invocation. If s is a method invocation statement, we
must match the call and return events for that method, as well as
all events between them. To do this, we create a fresh state t and a
new event variable v. We create a transition from bef (s) to t that
matches the call event, and bind v to the ID of the event. We cre-
ate another transition from t to aft(s) that matches a return event
with ID v. The skip transition from t back to itself is modied to
exclude the match of the return event. Call and return events are
unied in a manner analogous to array and eld operations.
Creation points. Object creation is handled in Java by invoking
the method < init >, and is translated like any other method
invocation.
Context. The within clause is represented by nesting the automaton
representing the body between a pair of matching call and return
event pairs. The skip transitions are modied to not match the re-
turn, forcing the failure of any match that does not complete within
the call.
Unication statements. A unication statement is represented by
a predicated transition that requires that the two variables on the
left and right have the same value. If one is unbound, it will acquire
the value of the other.
Subquery invocation. Subquery invocations are treated as if it
the subquery match were a primitive event in its own right. The
recognizer handles subquery calls and returns on its own. More
details are discussed in Section 3.3.
3.2 Instrumenting the Application
The system instruments all instructions in the target application
that match any primitive event or any exclusion event in the query.
At an instrumentation point, the pending event and all relevant ob-
jects are marshalled and sent to the query recognizer. The recog-
nizer will update the state of all pending matches and then return
control to the application.
The recognizer does not interfere with the behavior of the ap-
plication except via completed matches; therefore, any instrumen-
tation point that can be statically proven to not contribute to any
match need not be instrumented. In particular, we can optimize
away instrumentation where the referenced objects have statically
declared types that conict with the query. More sophisticated op-
timization techniques are discussed in Section 4.
3.3 The Query Recognizer
The recognizer begins with a single partial match at the begin-
ning of the main query, with no values for any variables. It receives
events from the instrumented application and updates all currently
active partial matches. For each partial match, each transition from
its current state that can unify with the event produces a new pos-
sible partial match where that transition is taken. A single event
may be uniable with multiple transitions from a state, so multiple
new partial matches are possible. If a skip transition is present and
its predicates pass, the match will persist unchanged. If the skip
transition is present but a predicate fails the match transitions to the
fail state. If the skip transition is present but a predicates value is
unknown because the variables it refers to as are of yet unbound,
then the variable is bound to a value representing any object that
does not violate the predicate. Predicates accumulate if two such
objects are unied; unication with any object that satises all such
predicates replaces the predicates with that object.
If the new state has transitions, they are processed immediately.
If a transition representing a subquery call is available from the
new state, a new partial match based on the subquerys state ma-
chine is generated. This partial match begins in the subquerys start
state and has initial bindings corresponding to the arguments the
subquery was invoked with. A unique subquery ID is generated for
the subquery call and associated with the subquery callers partial
match, with the subquery callees partial match, and with any par-
tial match that results from taking transitions within the subquery
callee.
Join points are handled by nding the latest state that dominates
the join point and treating it as the split point. Each incoming tran-
sition to the join point has a submachine representing all paths from
the split point to that transition. When a split point is reached, each
of these submachines is matched independently in a manner similar
to subqueries. The join point then collects and combines matches
as they complete, and propagates combined matches once all of
them have completed.
Once a partial match transitions into an accept state, it begins
to wait for events named in replaces clauses. When a targeted
event is encountered, the instruction is skipped and the substituted
method is run instead. An executes clause runs immediately once
the accept state is reached.
When a subquery invocation completes, the subquery ID is used
to locate the transition that triggered the subquery invocation. The
variables assigned by the query invocation are then unied with the
return values, and the subquery invocation transition is completed.
The original calling partial match remains active to accept any ad-
ditional subquery matches that may occur later.
In order for this matcher to scale over long input traces, it is
critical to be able to quickly acquire all relevant partial matches
to an event. We use a hash map to quickly access partial matches
affected by each kind of event. This map is keyed not only on the
specic transition, but also on all variables known to have values at
that point in the query. For queries whose partial matches consist
of at most one variable-value pair of binding, our implementation
is very efcient as it needs to perform only one single hash lookup.
4. STATIC CHECKER & OPTIMIZER
PQL makes it easy for developers to take advantage of context-
sensitive points-to analysis results. We have developed an algo-
rithm to automatically translate PQL queries into queries on a
pointer analysis result, shielding the user from the need to directly
operate on the program representation or the context-sensitive re-
sults. This translation approach is very exible: even though
our checkers are currently ow-insensitive, ow sensitivity can be
added in the future to improve precision without needing to modify
the queries themselves.
Accurate interprocedural pointer alias analysis is critical to the
precision of PQL static checkers, because events relevant for a par-
ticular query may be widely separated in the program. The an-
alysis for PQL must be sound, because false negatives mean the
results are unusable for optimization. This is in contrast to many
recently developed practical static checkers [10, 18, 62] which use
unsound analyses and thus produce both false-negative and false-
positive warnings.
Our checkers use pointer information from a sound cloning-
based context-sensitive inclusion-based pointer alias analysis due
to Whaley and Lam [59]. This analysis computes the points-to
relations for each distinct call path for programs without recur-
sion. Call paths in recursive programs are reduced by treating
each strongly connected component as a single node. The points-
to information is stored in a deductive database called bddbddb.
The data are compactly represented with binary decision diagrams
(BDDs), and can be accessed efciently with queries written in the
logic programming language Datalog.
4.1 The bddbddb Program Database
All inputs and results for the static analyzer are stored as rela-
tions in the bddbddb database. The domains in the database in-
clude bytecodes B, variables V , methods M, contexts C, heap ob-
jects named by their allocation site H, and integers Z. The context
domain represents the various call chains that can occur in the pro-
gram, and is used to qualify pointer information. Two pointer re-
lations that are true in the same calling context are associated with
the same value in C. For a further treatment of this, see [59].
The source program is represented as a number of input rela-
tions: actual , ret, dld, dst, arrayld, arrayst represent param-
eter passing, method returns, eld loads, eld stores, array loads,
and array stores, respectively. There is a one-to-one correspon-
dence between attributes of primitive statements in the query lan-
guage and those in the relations.
In the following, we say that predicate A(x1, . . . , xn) is true if
tuple (x1, . . . , xn) is in relation A. Below we show the denitions
of three of the relations; the remaining ones are dened similarly.
dld: BV M V . dld(b, v1, m, v2), means that bytecode
b executes v1 = v2.m.
actual : B Z V . actual (b, z, v) means that variable v is zth
argument of the method call at bytecode b.
ret: BV . ret(b, v), means that variable v is the return result of
the method call at bytecode b.
The context-sensitive points-to analysis produces a numbering
of the calling contexts, the invocation graph of the context-sensitive
call graph, and nally the points-to results:
IE: C BC M is the context-sensitive invocation relation.
IE(c1, i, c2, m) means that invocation site i in context c1
may invoke method m in context c2.
vP: C V H is the variable points-to relation. vP(c, v, h)
means that variable v in context c may point to heap object
h.
A Datalog query consists of a set of rules, written in a Prolog-
style notation, where a predicate is dened as a conjunction of other
predicates. For example, the Datalog rule
D(w, z) : A(w, x), B(x, y), C(y, z).
says that D(w, z) is true if A(w, x), B(x, y), and C(y, z) are all
true.
Example 4. Statically detecting basic SQL injections.
We can express a ow-insensitive approximation of the basic SQL
injection query in Figure 1 as follows:
simpleSQLInjection(b1, b2, h) :
IE(c1, b1, _, getParameter),
ret(b1, v1), vP(c1, v1, h),
IE(c2, b2, _, execute),
actual (b2, 1, v2), vP(c2, v2, h).
The Datalog rule says that an object h is a cause of an injection
if b1 is a call to getParameter, b2 is a call of execute, and the
return result of getParameter v1 in some context c1 points to the
same heap object h as v2, the rst parameter of the call to execute
in some context c2. 2
4.2 Translation from PQL to Datalog
We perform static analysis by translating PQL queries into Dat-
alog and using bddbddb to resolve the queries. Datalog is a highly
expressive language, including the ability to recursively specify
properties, meaning that PQL queries may be translated to Data-
log approximation using a simple syntax-directed approach.
In the beginning of the translation process, we normalize the in-
put PQL queries so that the matches part of each query is an alter-
nation of sequence statements; in other words, the top most level
statement of the matches clause is an altStmt each of which clauses
is a seqStmt and altStmts mentioned in Figure 3 are used only at the
top level. Any event affected by a replaces clause is treated by
this process as being a possible nal event in the query. This is
equivalent to appending an alternation of all such statements to the
end of the matches clause before normalization.
Each PQL query becomes a Datalog relation dened over byte-
codes, eld/method names, and heap variables; one bytecode for
every program point in the longest possible sequence of events
through the query, one eld or method name for each member vari-
able in the PQL query, and one heap variable for each object vari-
able in the PQL query. Literals and wildcards are translated from
PQL into Datalog without change. We summarize the handling of
individual PQL constructs below:
Primitive statements. Each primitive statement in the query is
translated into one or more Datalog predicates. A syntax-directed
translation of PQL queries into Datalog is shown in Figure 7. The
left side of the table lists a PQL primitive statement and the right
hand side shows its Datalog translation. All of these translations
have the same basic form. The PQL statement refers to some heap
object hi. The bddbddb system, however, represents instructions in
terms of actual program variables. We must therefore rst extract
the program variables into some fresh Datalog variable v
hi
and then
query the vP relation to determine the possible values for hi. If a
eld or method name refers to a PQL member variable, it may be
referenced directly in the statement.
Alternation. Since the input queries are normalized so that alter-
nation statements are used only at the top level, each clause in an
alternative is represented by a separate Datalog rule with the same
head goal.
Sequencing. Because the static analysis is ow-insensitive, we do
not track sequencing directly, and instead merely demand that all
events in the sequence occur at some point. This can be done by
simply replacing the sequence operator ; with the Datalog con-
junction operator ,. Since there is no guarantee that the same
program variables are used in each event in the sequence, the vx, i,
and c Datalog variables must be fresh for each event. The hx and
m variables correspond to PQL constructs and so keep the same
name. Each event includes a reference to a bytecode where the
event occurs, and this bytecode is bound to one of the bytecode
attributes of the subquery relation. If bytecode parameters are left
unbound (for example, in the base case for the derivedStream
query in Figure 5, the unused bytecode parameters are set to NULL,
representing no program location.
Partial order. Similarly, because the static analysis is ow-
insensitive, the translation of a partial-order statement can simply
treat all its clauses as part of a sequence.
Exclusion. With ow-insensitivity, no guarantees about ordering
can be made. This means that we cannot deduce that an excluded
event (denoted with a ) occurs between two points in a sequence;
as a result, all excluded events are ignored. This is a source of
imprecision in our current analysis, but it is a conservative approx-
imation that maintains soundness.
Primitive Statement Datalog translation
primStmt h
1
.m = h
2
dst(_, v
1
, m, v
2
),
vP(c, v
1
, h
1
),
vP(c, v
2
, h
2
)
primStmt h
1
= h
2
.m dld(_, v
1
, m, v
2
),
vP(c, v
2
, h
1
),
vP(c, v
1
, h
2
)
primStmt h
1
[ ] = h
2
arrayst (_, v
1
, v
2
),
vP(c, v
1
, h
1
),
vP(c, v
2
, h
2
)
primStmt h
1
= h
2
[ ] arrayld(_, v
1
, v
2
),
vP(c, v
2
, h
1
),
vP(c, v
1
, h
2
)
primStmt h
0
= m ( h
1
, . . ., hn ) ret(i, h
0
), IE(c, i, _, m),
actual (i, 1, v
1
), vP(c, v
1
, h
1
),
actual (i, n, vn), vP(c, vn, hn)
primStmt h
0
= new typeName ( h
1
, . . ., hn ) ret(i, h
0
), IE(c, i, _, typeName.< init >),
actual (i, 1, v
1
), vP(c, v
1
, h
1
),
actual (i, n, vn), vP(c, vn, hn)
Figure 7: Translation of primitive statements primStmt from PQL (left) into Datalog (right) for static analysis and optimization.
Within. The within m construct is handled by requiring that
matching bytecodes be found in methods transitively called from
m. This involves querying a call graph; such a call graph is avail-
able as part of the pointer analysis.
Unication. Unication of objects is translated into equality of
heap allocation sites.
Subqueries. Invocations of PQL subqueries are represented by
referring to the equivalent Datalog relation. The program points
and any variables that are not parameters in the PQL subquery are
matched to wildcards and projected away.
Figure 8 shows a full translation of the query from Figure 5 into
Datalog. The rst rule is a translation of the main query, which has
only one path. It also involves three events, one member variable,
and four object variables. Each of the four PQL statements is trans-
lated in turn, and combined they form the core of the main query.
The derivedStream query is somewhat more interesting, as it has
two possible sequences, each with a different length. The base case
is given rst, which simply asserts the equality of its arguments
(the unication statement) and then returns immediately. As there
is no event in this path, the bytecode argument for derivedStream
is set to NULL. The second rule handles the recursive case and is
similar to mains translation.
The mainRelevant and derivedRelevant relations express
which bytecodes are actually part of the nal solution, and will
be explained in detail at the end of the next section.
4.3 Extracting the Relevant Bytecodes
The bddbddb system resolves each query more or less indepen-
dently; as a result, each subquery nds program points and heap
variables for any set of arguments, regardless of whether or not the
subquery can be invoked with those arguments. It is thus necessary,
when extracting the list of relevant bytecodes, to extract only those
bytecodes that actually participate in a match of the full query. This
is a two-step process. In the rst step, we determine which sub-
query invocations contribute to the nal result; in the second we
project the relevant subqueries onto the bytecode domain.
Finding relevant subquery matches. Relevant subqueries are de-
termined inductively: All members of the main relation is relevant,
and any member of any query relation that appears as a clause in a
relevant relation as the result of translating a subquery invocation is
relevant. This translates to one rule for each invocation statement
and an additional rule to express that all results of main are rele-
vant. In Figure 8, the single rule for mainRelevant declares any
solution to main to be relevant. There are two invocations of the
derivedStream subquery, and each gets its own rule. The rst han-
dles the recursive subquery inside derivedStream, and the second
deals with the call from main.
Extracting relevant program locations. Gathering the relevant
program locations is straightforward once the previous step is per-
formed; any program location that occurs in a relevant solution
to any query is relevant. For the special case of the main query,
we need not check for relevance because all solutions to the main
query are relevant. Figure 8 uses the relevant relation to express
this; the rst rule says that a bytecode is relevant if it is part of a
derivedStream relation that has been proven relevant, and the nal
three project any bytecode involved in main into the set of relevant
bytecodes.
5. EXPERIENCES WITH PQL
Many of the error patterns found in the literature can be ex-
pressed easily in PQL. We have selected four important and rep-
resentative error patterns to illustrate the use of PQL.
1. Serialization errors: a data corruption bug in web servers
that can be exploited to mount denial-of-service attacks.
main(b0, b1, b2, mm, hs, hv, hx, hy) :
IE(c0, b0, _, getInputStream),
actual(b0, 0, vs0), ret(b0, vx),
vP(c0, vx, hx), vP(c0, vs0, hs),
derivedStream(_, hx, hy, _),
IE(c1, b1,, readObject),
actual(b1, 0, vy1), ret(b1, vv1),
vP(c1, vv1, hv), vP(c1, vy1, hy),
IE(c2, b2, _, mm), actual(b2, 0, vv2),
vP(c2, vv2, hv).
derivedStream(b, hx, h
d
, _) :
hx = h
d
, b = NULL.
derivedStream(b, hx, h
d
, ht) :
IE(c, b, _, InputStream. < init > ),
actual (b, 1, vx), ret(b, vt),
vP(c, vx, hx), vP(c, vt, ht),
derivedStream(_, ht, h
d
, _).
mainRelevant(mm, hs, hv, hx, hy) :
main(_, _, _, mm, hs, hv, hx, hy).
derivedRelevant (ht, h
d
, h3) :
derivedRelevant (_, h
d
, ht),
derivedStream(_, ht, h
d
, h3).
derivedRelevant (hx, hy, h3) :
mainRelevant(_, _, _, hx, hy),
derivedStream(_, hx, hy, h3).
relevant (b) : derivedStream(b, hx, h
d
, ht),
derivedRelevant (hx, h
d
, ht).
relevant (b) : main(_, _, b, _, _, _, _, _).
relevant (b) : main(_, b, _, _, _, _, _, _).
relevant (b) : main(b, _, _, _, _, _, _, _).
Figure 8: Datalog translation of Figure 5.
These errors are instances of the simple pattern do not store
object of type X in Y [45].
2. SQL injections: a major threat to the security of database
servers, as discussed in Section 1. This is an instance of
taint analysis where the use of data obtained in some man-
ner is restricted.
3. Mismatched method pairs: some APIs require that meth-
ods be invoked in a certain order. Matching pairs of methods
that follow the pattern a call to method A must always be
followed by a call to method B such as install always
followed by uninstall are common in large systems. Fail-
ing to properly match method calls leads to resource leaks
and data structure inconsistencies. Patterns of this kind are
simple to specify, but are often difcult to check statically in
large applications.
4. Lapsed listeners: a common memory leakage pattern in
Java that may lead to resource exhaustion and crashes in
long-running applications. Listeners follow a more complex
pattern where event A invoked on an object is required to
be followed by event B invoked on a related, but different
object.
These examples considered together show the complementary na-
ture of static and dynamic analysis. Static analysis can solve sim-
ple problems like the serialization error query precisely, whereas
dynamic analysis becomes more useful for more complex queries
like matched method pairs and lapsed listeners.
5.1 Experimental Setup
For our experiments, we use several large open-source Java
applications whose characteristics are summarized in Figure 9.
webgoat is a test application designed to demonstrate potential se-
curity aws in Java. road2hibernate is a test program that exer-
cises the rather large hibernate object persistence library, which
is now a major component of the JBoss suite. snipsnap, roller,
and personalblog are widely deployed weblog and wiki appli-
cations. Eclipse is the current premier open-source Java IDE; all
Eclipse experiments in this paper were run on version 3.0.0.
All of our static analyses were done on an AMD Opteron 150
machine with 4GB of memory running Linux. Dynamic tests were
performed on a 2 GHz AMD Athlon XP with 256MB of memory
running Linux. First, we apply the context-sensitive pointer an-
alysis on our benchmarks. As shown in Figure 10, it takes up to
34 minutes to represent the program as BDD relations and com-
pute the points-to results. Fortunately, this preprocessing step only
needs to be performed once for all queries. Note that even though
road2hibernate consists of only 137 lines of code, the prepro-
cessing time is dominated by the analysis of large libraries it uses.
In Figure 13 we show characteristics of the checkers for the three
queries in our experiment. For static analysis, we show the time
taken just to resolve the Datalog query and the total time taken,
which is often considerably higher, as it includes loading and sav-
ing of large relations. Dynamic analysis is run only if warnings
from the static analysis are not immediately obvious as errors. The
times for the Web applications reect the average amount of time
required to serve a single page, as measured by the standard pro-
ling tool JMeter. road2hibernate is a command-line program
and its time is a simple start-to-nish timing.
Our performance numbers indicate that our approach on real ap-
plications is quite efcient. Unoptimized dynamic overhead is gen-
erally noticeable, but not crippling; after optimization it often be-
comes no longer measurable, though may still be as high as 37%
in heavily instrumented code. Likewise, our static analysis times
are in line with expectations for a context-sensitive pointer analysis
run over tens of thousands of classes.
5.2 Serialization Errors
In a three-year study of production software, Reimer et al. found
that a large class of high-impact coding errors violate design rules
of the form only store objects of type X in objects of type
Y [45]. Such rules can be easily expressed in PQL. The serial-
ization error we study is an instance of such a pattern. Specically,
HttpSession, a runtime representation of a Web session, is sup-
posed to be a persistent object to allow the Web server to save and
restore sessions when the load is too high. As a consequence, only
objects implementing the interface Serializable can be stored
within an HttpSession, via the setAttribute method. The PQL
query corresponding to this design rule is shown in Figure 11.
Violations of this rule will cause the persistence operation to fail,
either with exceptions or via data corruption. The former may be
exploited by a malicious user to mount a denial-of-service attack;
the latter may cause intermittent problems that are hard to test be-
cause session objects are written out only under high load. One
such problem in enterprise Java code reportedly took a team of en-
gineers close to two weeks to detect [45].
Source Source Library Total
Benchmark Description LOC classes classes classes
webgoat Sample Web application with known security aws 19,440 35 986 1,021
personalblog Blogging application based on J2EE 5,591 59 5,177 5,236
road2hibernate Test application for Hibernate, an object persistence library 138 2 7,060 7,062
snipsnap Blogging application based on J2EE 57,350 804 10,047 10,851
roller Blogging application based on J2EE 52,089 247 16,112 16,359
Eclipse Open-source Java IDE (GUI application) 2,834,133 19,439 19,439
Figure 9: Summary of information about benchmark Java programs.
Program relation Pointer Total
Benchmark generation analysis time
webgoat 65 13 78
personalblog 213 218 431
road2hibernate 767 512 1,279
snipsnap 170 151 321
roller 978 1,011 2,029
Figure 10: Static preprocessing time, in seconds.
As shown in Figure 12, a total of 61 calls to method
HttpSession.setAttribute are found in four benchmarks. Af-
ter the optimizer was run, only 12 remain as potential matches to
our query. This shows how pointer analysis is useful in suppressing
false warnings: the static checker is able to deduce that the concrete
types of the instances stored implement Serializable in some
cases, even though their declared type is not. 8 of the remaining
calls to setAttribute are obvious errors that can immediately be
seen to not be correct on any run. When our dynamic checker is ap-
plied to snipsnap, which contains the 4 unconrmed warnings, a
runtime match is found for one of these suspicious sites, conrming
that it is indeed an error.
5.3 Finding Security Flaws: SECURIFLY
Shown in Figure 14 is a more realistic example of the SQL in-
jection vulnerability rst mentioned in Section 1.1. Having control
over the username and pwd variables, the user can cause arbitrary
SQL code to be run or bypass access restrictions. SQL injection is
an instance of taint analysis which requires tracking the ow of
data from a set of sources to a set of sinks.
For applications written in the J2EE framework, we
have examined the J2EE APIs to identify the sources and
sinks for the case of SQL injections. Sources, listed in
query UserSource in Figure 15, include return results of
HttpServletRequests methods such as getParameter.
Sinks, enumerated in the replaces clause, include argu-
ments of method java.sql.Statement.execute(String sql),
java.sql.Connection.prepareStatement(String sql), and
so forth.
Because a user-controlled string may be incorporated into other
strings, the main query asks if a user-controlled string (subquery
query main()
returns
object !java.io.Serializable obj;
object javax.servlet.http.HttpSession session;
matches {
session.setAttribute(_, obj);
}
Figure 11: Query for nding serialization errors.
Total Static Stat. Dynam.
Benchmark calls warnings conrmed conrmed
errors errors
webgoat 5 1 1
personalblog 2 0 0
snipsnap 29 10 6 1
roller 25 1 1
Total 61 12 8 1
Figure 12: Results for the serialization error query. Calls refer
to invocations of HttpSession.setAttibute. indicates that
dynamic checking is unnecessary.
UserSource), can be propagated one or more times (subquery
StringPropStar) to create a string used in an SQL query (the ac-
tions in the replaces clauses of the main query). Unsafe database
accesses are replaced with routines that rst quote every metachar-
acter in every instance of the user string in the SQL command, thus
transforming possible attacks into legitimate commands.
Note that the string propagation query StringPropStar is
not specic to SQL injection, and can be used for a variety of
taint queries that involve propagation of Strings. It invokes the
StringProp query, which handles all the ways in which one string
can be derived from another.
Using PQL we have developed a runtime security protection sys-
tem for Web applications called SECURIFLY
1
. The system pre-
sented here can address the problem of SQL injection as well as
other vulnerabilities such as cross-site scripting and path traversal
attacks described in [33]. However, we have only performed a de-
tailed experimental study of runtime overhead for SQL injections.
Commonly used dynamic techniques such as application re-
walls [37] that rely on pattern-matching and monitor trafc ow-
ing in and out of the application are a poor solution for SQL in-
jection [35]. In contrast, SECURIFLY can detect attacks because it
observes how data ows through the application. Moreover, SECU-
RIFLY can gracefully recover from vulnerabilities before they can
do any harm by sanitizing tainted input whenever necessary. There
are some inherent advantages the dynamic approach has over the
static one.
SECURIFLY can be integrated with the server so that when-
ever a new Web application is added, it is instrumented auto-
matically. This removes the apprehension related to deploy-
ing unfamiliar potentially insecure Web applications. This
obviates the issue present with static tools of the code being
changed without the tool being rerun. This is particularly
important because analyzing Web applications statically can
prove to be difcult because of issues such as handling re-
ection.
1
The name SECURIFLYcomes from the idea of providing security
on the y.
Static analysis time Instrumentation points Runtime Overhead
Query Total Unopti- Opti- Uninstru- Unopti- Opti- Unopti- Optimi-
Benchmark resolution time mized mized mented mized mized mized mized
BAD STORES
webgoat 5 12 1 0
personalblog 23 34 1 0
snipsnap 48 67 18 3 .073 .074 .073 1% < 1%
roller 61 84 12 0
SQL INJECTIONS
webgoat 1 46 604 69 .024 .054 .033 125% 37%
personalblog 2 74 3,209 36 .040 .069 .049 72% 22%
road2hibernate 4 113 4,146 779 2.224 2.443 2.362 9% 3%
snipsnap 3 79 3,305 542 .073 .096 .080 31% 9%
roller 4 147 2,960 96 .008 .012 .008 50% < 1%
Figure 13: Summary of static analysis times, runtimes, dynamic overhead, and the number of instrumentation points with and without
optimizations. is used to indicate that no dynamic run was necessary because a static solution was sufcient. All times are in seconds.
SECURIFLY does not require changes to the original pro-
gram and does not need access to anything other than the
nal bytecode. This can be especially advantageous when
dealing with applications that rely on libraries whose source
is unavailable.
The dynamic checker for the SQL injection query will match
whenever a user controlled string ows in some way to a suspected
sink, regardless of whether a user input is harmful in a particular
execution. It will then react to replace the potentially dangerous
string with a safe one.
The errors located with our tool involved the applications build-
ing SQL strings out of data either sent in from the command line
or generated as parameters of an HTTP request. The former can be
exploited if the program can be executed by the malicious user. The
latter are vulnerable to the more common crafted-HTTP-request at-
tacks.
5.3.1 Importance of Static Optimization
Without static optimization, many program locations need to be
instrumented. This is because routines that cause one String to
be derived from another are very common. Heavily processed user
inputs that do not ever reach the database will also be carefully
tracked at runtime, introducing signicant overhead to the analysis.
Fortunately, the static optimizer effectively removes instrumen-
tation on calls to string processing routines that are not on a path
from user input to database access. Exploiting pointer information
dramatically reduces both the number of instrumentation points and
the overhead of the system, as shown in Figure 13. The reduction
in the number of instrumentation points due to static optimization
can be as high as 97% in roller and 99% in personalblog. As
shown in Figure 13, reduction in the number of instrumentation
points results in a smaller overhead. For instance, in webgoat, the
overhead is cut almost in half in the optimized version.
public void authenticate(HttpServletRequest request){
String username = request.getParameter("user");
java.sql.Statement stmt = con.createStatement();
String query =
"select
*
from users where username = " +
username + "and password = " + pwd + "";
stmt.execute(query);
... // process the result of SELECT
}
Figure 14: A classic example of SQL injection.
Note that the query does no direct checking of the value that
has been provided by the user, so if harmless data is passed along a
feasible injection vector, it will still trigger a match to the query. As
a result of this, drastic responses such as aborting the application
are not suitable outside of a debugging context.
5.3.2 Applying Input Sanitization
As seen in Figure 15, each operation that can unsafely use tainted
data receives a replaces clauses in the query main. When a pos-
sibly relevant sink is reached, any matches that have completed and
which are consistent with the instruction are gathered, and if such
matches are present, the replacing method is executed instead.
The SafePrepare and SafeExecute methods themselves nd
all substrings in the sink variable that match any of the possible
values for source. They then produce a new SQL Query string
identical to the old, but it quotes all the SQL metacharacters such
as