0% found this document useful (0 votes)

34 views

Detection of Attack-Targeted Scans From The Apache HTTP Server Access Logs

1. The document proposes a method to detect attack-oriented scans from Apache HTTP Server access logs in order to identify potential attacks earlier than traditional detection methods. 2. Sample data sets were analyzed using a rule-based model to detect web vulnerability scans as well as SQL injection and XSS attacks. Performance measures showed this method compared favorably to other commonly used detection techniques. 3. Tests on logs from real systems also identified attack scans, demonstrating the effectiveness of the proposed rule-based log analysis approach for improving web application security monitoring.

Uploaded by

Ivo Lemos

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views

Detection of Attack-Targeted Scans From The Apache HTTP Server Access Logs

Uploaded by

Ivo Lemos

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Applied Computing and Informatics 14 (2018) 28–36

Contents lists available at ScienceDirect

Applied Computing and Informatics

journal homepage: www.sciencedirect.com

Detection of attack-targeted scans from the Apache HTTP Server access

logs
Merve Basß Seyyar a,⇑, Ferhat Özgür Çatak b, Ensar Gül a
a_ _
Istanbul Sßehir University, Istanbul, Turkey
b
Tubitak/Bilgem, Kocaeli, Turkey

a r t i c l e i n f o a b s t r a c t

Article history: A web application could be visited for different purposes. It is possible for a web site to be visited by a
Received 8 January 2017 regular user as a normal (natural) visit, to be viewed by crawlers, bots, spiders, etc. for indexing purposes,
Revised 25 April 2017 lastly to be exploratory scanned by malicious users prior to an attack. An attack targeted web scan can be
Accepted 26 April 2017
viewed as a phase of a potential attack and can lead to more attack detection as compared to traditional
Available online 28 April 2017
detection methods. In this work, we propose a method to detect attack-oriented scans and to distinguish
them from other types of visits. In this context, we use access log files of Apache (or ISS) web servers and
Keywords:
try to determine attack situations through examination of the past data. In addition to web scan detec-
Rule-based model
Log analysis
tions, we insert a rule set to detect SQL Injection and XSS attacks. Our approach has been applied on sam-
Scan detection ple data sets and results have been analyzed in terms of performance measures to compare our method
Web application security and other commonly used detection techniques. Furthermore, various tests have been made on log sam-
XSS detection ples from real systems. Lastly, several suggestions about further development have been also discussed.
SQLI detection Ó 2017 The Authors. Production and hosting by Elsevier B.V. on behalf of King Saud University. This is an
open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

Contents

1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2. Related work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3. System model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.1. Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2. Data and log generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2.1. Web servers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2.2. Damn Vulnerable Web Application (DVWA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2.3. Web vulnerability scanners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3. Rules and methodology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4. Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.1. Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2. Model evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3. Scan detection on live data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Appendix A. Supplementary material . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

⇑ Corresponding author.
E-mail addresses: [email protected] (M. Basß Seyyar), [email protected] (F.Özgür Çatak), [email protected] (E. Gül).
Peer review under responsibility of King Saud University.

Production and hosting by Elsevier

http://dx.doi.org/10.1016/j.aci.2017.04.002
2210-8327/Ó 2017 The Authors. Production and hosting by Elsevier B.V. on behalf of King Saud University.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
M. Basß Seyyar et al. / Applied Computing and Informatics 14 (2018) 28–36 29

1. Introduction Another negative aspect of focusing on machine learning is

overfitting; referring to a model that models the training data
The dependency to web systems; consisting of web services and too well. Using a very complex models may result in overfitting
web applications, is growing with time. From health sector to elec- that may negatively influence the model’s predictive perfor-
tronic commerce (e-commerce), internet usage is needed in all mance and generalization ability [6]. Nevertheless, we design
areas of life. Due to incremental utilization of cloud technology; our rules to operate on past data which allows a detailed anal-
there is no doubt that this dependency will increase even more. ysis of a user’s actions [4] so that the complexity of our
However, web environment is hosting billions of users including approach is not too high.
malicious ones like script kiddies and cyber terrorists. Malicious The proposed model addresses the detection of web vulnerabil-
users misuse highly efficient automated scan tools to detect vul- ity scans on web applications by analyzing log files retrieved
nerabilities in web applications. Obtaining diagnostic information from web servers. Since most of the web servers log HTTP
about web applications and specific technologies thanks to these requests by default, data is easily available to be analyzed. Thus,
tools is known as ‘‘Reconnaissance” in penetration testing method- any extra configuration, installation, purchase or data format
ologies and standards like The Penetration Testing Execution Stan- modification are not needed. Furthermore, our analysis is based
dard (PTES). This information gathering phase is the first phase of upon rule-based detection strategy and we built our rule set on
all attacks just before exploitation because security vulnerabilities several features of log entries. As opposed to relevant work, the
would lead to threats either directly or indirectly [1]. Because, an number of these features is low enough to make input data less
overlooked vulnerability scan may result in a large scale problems complex.
such as information leakage and privacy violation. As a matter of Finally, our work contributes to a better understanding of cur-
fact, the detection of these malicious scans becomes very crucial rent web security vulnerabilities. For example, we can detect
to prevent web applications from exploitation and to take effective web vulnerability scanners and learn about vulnerability itself
countermeasures almost immediately. at the same time.
According to European Network and Information Security
Agency (ENISA) Threat Landscape 2015 (ETL 2015) [2], web based The rest of the paper is organized as follows: The related work is
and web applications attacks are ranked as number two and three presented in Section 2. Section 3 presents our system model in
in cyber-threat environment, and their rankings have remained details. Our model evaluation and real system test results are pre-
unchanged between 2014 and 2015. Since web security related sented in Section 4. The concluding remarks are given in Section 5.
threats have been perpetually evolving, web applications are more
disposed to security risks [3]. Also, attack methods to web applica-
tions are very diverse and their trends continue for a long time. For 2. Related work
instance, although Structured Query Language (SQL) injection and
Cross-Site Scripting (XSS) seem to be at a decreasing rate in 2014, Within this section, the most related researches for vulnerabil-
an increase in their exposures is seen in 2015. Therefore, one may ity scan detection have been reviewed.
easily deduce that web systems are in the focus of cyber criminals. Auxilia and Tamilselvan suggest a negative security model for
To detect all of mentioned attacks and scans, analyzing the log intrusion detections in web applications [7]. This method is one
files is usually preferred, because anomalies in users’ requests and of the dynamic detection techniques that is anomaly-based. The
related server responses could be clearly identified. Two primary authors propose to use Web Application Firewall (WAF) with a rule
reasons for this preference are that log files are easily available, set protecting web applications from unknown vulnerabilities.
and there is no need for expensive hardware for analysis [4]. In When analyzed their rules for Hypertext Transfer Protocol (HTTP)
addition, logs may provide successful detection especially for attacks detection, the rules appears to be generated by checking
encrypted protocols such as Secure Sockets Layer (SSL) and Secure the values of some important HTTP header fields, Uniform
Shell Daemon (SSHD) [5]. However, the heavier the website’s traf- Resource Identifier (URI) strings, cookies, etc. Associating WAF,
fic is, the more difficult the examination of the log files gets. There- Intrusion Detection System (IDS), rule engine reasoning together
fore, the need for an user-friendly web vulnerability scan detection makes this article interesting.
tool by analyzing log files seems pretty obvious. Goseva-Popstojanova et al. [8] propose a method to classify
Therefore, the objectives of this study can be summarized as malicious web sessions through web server logs. Firstly, the
follows: authors constitute four different data sets from honeypots; on
to detect vulnerability scans. which several web applications were installed. Afterwards, 43 dif-
to detect XSS and SQLI attacks. ferent features were extracted from web sessions to characterize
to examine access log files for detections. each session and three machine learning methods that are Support
Vector Machine (SVM), J48 and Partial Decision Trees (PART) were
Accordingly, the contributions of the work can be expressed as used to make the classifications. The authors assert that when all
follows: 43 features used in learning period, their method to distinguish
between attack and vulnerability scan sessions attains high accu-
The motivation of the relevant work is quite different, typically racy rates with low probability of false alarms. This comprehensive
focusing on machine-learning based predictive detection of research provides significant contribution in the area of web
malicious activities. Actually, all machine learning algorithms security.
have training phase and training data to built a classification Different from log analysis, Husák et al. [9] analyze extended
model. In order to increase accuracy of machine learning classi- network flow and parse HTTP requests. In addition to some Open
fier model, a large scale input training data is needed. In turn, an Systems Interconnection (OSI) Layer 3 and Layer 4 data, the
increase in memory consumption would occur. As a result, extracted HTTP information from network flow includes host
either the model would turn out to be not trainable, or training name, path, user agent, request method, response code, referrer
phase would last for days. On the other hand, executing the pro- and content type fields. To group network flow in three classes
posed rule set on access logs does not cause any memory con- such as repeated requests, HTTP scans, and web crawlers; source
sumption problems. Our script simply runs on Ubuntu Internet Protocol (IP), destination IP, and requested Uniform
terminal with a single line of code. Resource Locator (URL) split into domain and path are used. One
30 M. Basß Seyyar et al. / Applied Computing and Informatics 14 (2018) 28–36

of the interesting results they obtain is that the paths requested for 3. System model
HTTP scans are also requested for brute-force attack as well. How-
ever, not only HTTP requests but also HTTP responds should also be In this section, we describe how we construct and design the
analyzed to get more effective results. proposed model in detail. Also, we present our rules with underly-
After a learning period of non-malicious HTTP logs, Zolotukhin ing reasons.
et al. [10] analyze HTTP requests in an on-line mode to detect net-
work intrusions. Normal user behavior, anomalies related features 3.1. Assumptions
and intrusions detection are extracted from web resources, queries
attributes and user agent values respectively. The authors compare In access logs, POST data can not get logged. Thus, the proposed
five different anomaly-detection methods; that are Support Vector method cannot capture this sort of data.
Data Description (SVDD), K-means, Density-Based Spatial Cluster- Browsers or application servers may support other encodings.
ing of Applications with Noise (DBSCAN), Self-Organizing Map Since only two of them are in the context of this work, our script
(SOM) and Local Outlier Factor (LOF), according to their accuracy cannot capture data encoded in other styles.
rates in detecting intrusions. It is asserted that simulations results Our model is designed for detection of two well-known web
show higher accuracy rates compared to the other data-mining application attacks and malicious web vulnerability scans, not
techniques. for prevention. Thus, working on-line mod is not included in
Session Anomaly Detection (SAD) is a method developed by Cho the context of our research.
and Cha [11] as a Bayesian estimation technique. In this model,
web sessions are extracted from web logs and are labelled as ‘‘nor- 3.2. Data and log generation
mal” or ‘‘abnormal” depending on whether it is below or above the
assigned threshold value. In addition, two parameters that are page In this section, tools, applications, virtual environment used
sequences and their frequency are investigated in training data. In throughout this work and their installation and configuration set-
order to test their results; the authors use Whisker v1.4 as a tool tings are explained.
for generating anomalous web requests and it is asserted that
The Bayesian estimation technique has been successful for detect- 3.2.1. Web servers
ing 91% of all anomalous requests. Therefore, two points making 3.2.1.1. HTTP Server. As mentioned earlier, Apache/2.4.7 (Ubuntu)
this article different from the others are that SAD can be cus- Server is chosen as a web server. Apache is known to be the most
tomized by choosing site-dependent parameters; and the false commonly used web server. According to the W3Techs (Web Tech-
positive rates gets lower with web topology information. nology Surveys) [15], as of December 1, 2016; Apache is used by
Singh et al. [12] have presented an analysis of two web-based 51.2 percent of all web servers. In addition, it is open source, highly
attacks which are i-frame injection attacks and buffer overflow scalable and has a dynamically loadable module system. Apache
attacks. For analysis, log files created after attacks are used. They installation is made via apt-get command-line package manager.
compare the size of the transferred data and the length of input Any extra configuration is not necessary for the scope of this work.
parameters for normal and malicious HTTP requests. As a result,
they just have carried out descriptive statistics and have not men- 3.2.1.2. Apache Tomcat. The Apache Tomcat being an implementa-
tioned any detection techniques. tion of the Java Servlet, JavaServer Pages, Java Expression Language
In their work, Stevanovic et al. [13] use SOM and Modified and Java WebSocket technologies, is an open source software [16].
Adaptive Resonance Theory 2 (Modified ART2) algorithms for In this work, Apache Tomcat Version 8.0.33 is used. Atlassian JIRA
training and 10 features related to web sessions for clustering. Standalone Edition (Jira 3.19.0-25-generic #26) is used as a web
Then, the authors label these sessions as human visitors, well- application. Access log configuration of Tomcat is set to be similar
behaved web crawlers, malicious crawlers and unknown visitors. to access log entries in Apache.
In addition to classifying web sessions, similarities among the
browsing styles of Google, MSN, and Yahoo are also analyzed in 3.2.2. Damn Vulnerable Web Application (DVWA)
this article. The authors obtain lots of interesting results, one of DVWA is a vulnerable PHP/MySQL web application. It is
which is that 52% of malicious web crawlers and human visitors designed to help web developers find out critical web application
are similar in their browsing strategies; which means that it is hard security vulnerabilities by hands on activity. Different from illegal
to distinguish each other. website-hacking, it offers a totally legal environment to exploit for
Another completely different propose a semantic model that is security people. Thanks to DVWA; Brute Force, Cross Site Request
named ontological model [14]. They assert that attack signatures Forgery (CSRF), Command Execution, XSS (reflected) and SQL Injec-
are not independent from programming languages and platforms. tion vulnerabilities could be tested for three security levels; low,
As a result, signatures may become invalid after some changes in medium, high.
business logic. In contrary, their model is extendible and reusable In this work, DVWA 1.0.8 version (Release date: 11/01/2011) is
and could detect malicious scripts in HTTP requests and response. used. To install this web application, Linux Apache MySQL PHP
Also, thanks to ontological model, zero day attacks could be effec- (LAMP) Server; including MySql, PHP5, and phpMyAdmin, has
tively detected. Their paper also includes a comparison between been installed. The reasons for studying with DVWA are to better
the proposed Semantic Model and ModSecurity. understand XSS and SQL Injection attacks and to find out related
There are several differences between our work and the above payloads substituted in query string part of URIs. In this way, rule
mentioned works. Firstly, as in the most of the related works, selection to detect these attacks from access logs could be correctly
checking only the user-agent header field from a list is not enough determined. Also, web vulnerability scanners used in this work,
to detect web crawlers in the correct way. Correspondingly, we add have scanned this web application for data collection purposes.
extra fields to check to make the web crawler detection more accu-
rate. Additionally, unlike machine learning and data-mining, 3.2.3. Web vulnerability scanners
rule-based detection has been used in the proposed model. Finally, 3.2.3.1. Acunetix. Acunetix is one of the most commonly used com-
in contrast to other works, we prefer to use combined log format in mercial web vulnerability scanners. Acunetix scans a web site
order to make the number of features larger and to get more con- according to the determined configurations, produces a report
sistent results. about the existing vulnerabilities, groups them as high, medium,
M. Basß Seyyar et al. / Applied Computing and Informatics 14 (2018) 28–36 31

low and informational; and identifies the threat level of the web oped using Python and licensed under General Public License (GPL)
application with the related mitigation recommendations. In the v2.0. Framework is designed to help web administrators secure the
context of this work, Acunetix Web Vulnerability Scanner (WVS) web applications. W3AF could detect more than 200 vulnerabilities
Reporter v7.0 has been used with default scanning configurations [17]. W3AF has several plug-ins for different operations such as
in addition to site login information. crawling, brute forcing, and firewall bypassing. W3AF comes by
default in Kali Linux and could be found in ‘‘Applications/Web
Application Analysis/Web Vulnerability Scanners”. W3AF version
3.2.3.2. Netsparker. Netsparker is a web application security scan-
1.6.54 has been used with ‘‘fast-scan” profile through audit, crawl,
ner that is commercial too. Netsparker detects security vulnerabil-
grep and output plugins.
ities of a web application and produces a report including
mitigation solutions. In addition, detected vulnerabilities could
be exploited to confirm the report results. In the context of this
3.3. Rules and methodology
work, Netsparker Microsoft Software Library (MSL) Internal Build
4.6.1.0 along with Vulnerability Database 2016.10.27.1533 has
As mentioned earlier, our script runs on access log files. The
been used with special scanning configurations including custom
main reason for this choice is the opportunity for detailed analysis
cookie information.
about users actions. By examining past data, information security
policies for the web applications could be correctly created and
3.2.3.3. Web Application Attack and Audit Framework (W3AF). W3AF implemented. Additionally, further exploitations could be pre-
is an open source web application security scanner. W3AF is devel- vented in advance. Unlike the proposed model, Network Intrusion

Fig. 1. Flow chart of the proposed rule-based model.

32 M. Basß Seyyar et al. / Applied Computing and Informatics 14 (2018) 28–36

Detection System (NIDS) may not detect attacks when HTTPS is trators might run our script every day to check for an attack. Lastly,
used [4]. However, working with logs has some disadvantages. real-time detection and prevention is not possible with the pro-
Since log files do not contain all data of HTTP request and response, posed method which runs off-line. Thus, we could not guarantee
some important data could not be analyzed. For example, POST to run on-line. In fact, this approach is conceptually sufficient for
parameters that are vulnerable to injections attacks could not be the scope of this work. Differently from the test environment; an
logged by web servers. Another negative aspects are the size of logs extra module that directly accesses logs, or a script that analyses
and parsing difficulty. Nevertheless, to solve this problem, we sep- logs faster could be developed to use our approach in a live or real
arate the access log files on a daily basis. Therefore, web adminis- environment.
Our method could be described as rule-based detection. Unlike
Table 1 anomaly based detection, our rules are static including both black-
HTTP methods in Acunetix. list and whitelist approaches. In detail, XSS and SQL injection
HTTP method Number detection part of our method is a positive security model; on the
other hand, the rest is a negative security model. Thus, data eva-
Connect 2
Get 2758
sion is tried to be kept at a minimum level. In order to classify IP
Options 2 addresses in the access log file, we identify three different visitor
Post 668 types as follows:
Trace 2
Track 2 Table 6
Total 3434 Details of classified data sets.

Visitor type Log file Line number IP number

Table 2 Type 1 Normal 62,539 15

HTTP methods in Netsparker. Type 2 Web robot 28,804 143
Type 3 Acunetix 6539 1
HTTP method Number Type 3 Netsparker 7314 1
Get 3059 Type 3 W3AF 3996 2
Head 590 Type 1, 2 and 3 Total 109,192 162
Netsparker 1
Options 14
Post 956 Table 7
Propfind 14 Confusion matrix.
Total 4634 Actual: Type 3 Actual: Type 1 or 2
Predicted: Type 3 TP = 3 FN = 1
Predicted: Type 1 or 2 FP = 0 TN = 158
Table 3
HTTP status codes in Netsparker.
Table 8
HTTP status code Number
Summary of results for general data set.
200 177
301 1 IP number Accuracy Precision Recall F1
302 23 162 99.38% 100.00% 75.00% 85.71%
404 494
500 6
Total 701 1200
Running Time (in seconds)

1000
Table 4
HTTP status codes in W3AF. 800

HTTP status code Number

600
200 91
302 8 400
404 30
500 6
200
Total 135
0
00

00
0

0
00

Table 5
00

00
10

HTTP status codes in Acunetix.

Log Lines
HTTP status code Number
Fig. 2. Time performance of the proposed method.
200 598
301 38
302 686 Table 9
400 44 Details of log samples.
403 16
Log file Log duration File size Line number IP number
404 2022
405 4 Data Set 1 5 days 43 MB 202,145 3910
406 2 Data Set 2 210 days 13.4 MB 34,487 9269
417 2 Data Set 3 270 days 7.2 MB 36,310 4719
500 20 Data Set 4 90 days 1.3 MB 5936 1795
501 2 Data Set 5 90 days 0.48 MB 3554 579
Total 3434 Total 665 days 65.37 MB 282,432 20,272
M. Basß Seyyar et al. / Applied Computing and Informatics 14 (2018) 28–36 33

Table 10
Data sets test results.

Data set Period IP number Type 3 IP number Type 3 percentage (%)

Data Set 1 2004 10/March 370 13 3.51
11/March 786 20 2.54
12/March 1002 22 2.20
13/March 1960 39 1.99
14/March 1079 21 1.95
Data Set 2 2004 April 3140 1 0.03
May 4546 3 0.07
June 701 6 0.86
July 735 4 0.54
August 189 1 0.53
September 280 0 0.00
October 106 1 0.94
Data Set 3 2005 June 663 1 0.15
July 755 1 0.13
August 577 0 0.00
September 731 1 0.14
October 452 0 0.00
November 623 19 3.05
December 181 1 0.55
January 652 45 6.90
February 802 34 4.24
Data Set 4 2005 1–15/June 160 1 0.63
16–30/June 497 0 0.00
1–15/July 503 0 0.00
16–30/July 280 1 0.36
1–15/August 284 0 0.00
16–30/August 282 0 0.00
Data Set 5 2005 16–31/January 28 0 0.00
1–15/February 176 0 0.00
16–28/February 112 0 0.00
1–15/March 225 3 1.33
16–30/March 28 0 0.00

Type 1: Regular (normal) users with a normal (natural) visit. sibility for URL obfuscation, Hex and UTF-8 encodings of these pat-
Type 2: Crawlers, bots, spiders or robots. terns are also taken in consideration.
Type 3: Malicious users using automated web vulnerability Afterwards, we continue by separating IP addresses of Type 2
scanners. from the rest of the access log file in Phase 2. To do this, two differ-
ent approaches are used. Firstly, user-agent part of all log entries is
As shown in Fig. 1 in Phase 1, our first step is to detect SQL compared with the user-agent list from robots database that is
injection and XSS attacks. Although different places of HTTP (the publicly available in [18]. However, since this list may not be up-
HTTP body, URI) could be used to exploit a vulnerability [4]; we to-date, another bot detection rules are added. In order to identify
will analyze path and query parts of the requested URI for these rules, we use the following observations about web robots:
detection.
In detail; for XSS, we use regular expressions to recognize some 1. Most of the web robots make a request for ‘‘/robots.txt” file [19].
patterns such as HTML tags, ‘src’ parameter of the ‘img’ tag and 2. Web robots have higher rate of ‘‘4xx” requests since they usu-
some Javascript event handlers. Likewise; for SQL injection, we ally request unavailable pages [20–23].
check the existence of the singlequote, the doubledash, ‘#’, exec() 3. Web robots have higher unassigned referrer (‘‘–”) rates [23–25].
function and some SQL keywords. In addition, since there is a pos-

4 1

3.5 0.9

3 0.8

0.7
2.5
Type 3 (%)

Type 3 (%)

0.6
2
0.5
1.5
0.4
1
0.3

0.5 0.2

0 0.1
04 04 04 04 04
ar. ar. ar. ar. ar. 0
.M .M .M .M .M
10 11 12 13 14 Apr May Jun Jul Aug Sep Oct
Days Months

Fig. 3. Data Set 1 test results. Fig. 4. Data Set 2 test results.
34 M. Basß Seyyar et al. / Applied Computing and Informatics 14 (2018) 28–36

4. According to the access logs that we analyzed, user-agent status codes of Type 3 differ from Type 2 and Type 1. In fact, Type
header field of web robots may contain some keywords such 3 has higher rate of ‘‘404” requests, average of which for Acunetix,
as bot, crawler, spider, wanderer, and robot. Netsparker and W3AF is 31% in our data set. Thus, we generate a
rule to check the presence of these HTTP methods and the percent-
As a result of above mentioned observations, we add some extra age of ‘‘404” requests. User-agent header fields of Type 3 could
rules to correctly distinguish Type 2 from other visitors. generally be modified and obfuscated manually at the configura-
For the rest our rule set as indicated at Phase 3 in Fig. 1, we con- tion phase before vulnerability scan. Even so, we made a list of
tinue by investigating our access log files formed as a result of vul- well-known automated web vulnerability scanners, and compare
nerability scanning mentioned in the previous section. As shown in it with user-agent header fields. Finally, we notice that these scan-
Tables 1 and 2, our first immediate observation is that as compared ners make at least more than 100 HTTP requests in a certain time,
to Type 2 and Type 1, Type 3’s requests include different HTTP we select this value as a threshold for Type 3 detection.
methods; such as Track, Trace, Netsparker, Pri, Propfind and Quit. The pseudo code of the proposed model is shown in Algorithm 1:
Secondly, as shown in Table 3, Tables 4 and 5; we deduct that
Algorithm 1. Pseudo-Code for Proposed Model.

∈
←−
∈

←−
∈

←−
←−
←−
←−
” ”
∈ ” ” ” ” ” ” ” ” ” ”
” ” ∈

←−
∈

←−
←−
←−
∈
” ” ” ” ” ” ” ” ” ” ” ”

M. Basß Seyyar et al. / Applied Computing and Informatics 14 (2018) 28–36 35

4. Results recall. As a result, our model has 99.38% accuracy, 100.00% preci-
sion, 75.00% recall and finally 85.71% F1 score as we can see in
This section is based on the evaluation of our model against Table 8.
some important metrics. Moreover, test results of attack detection Fig. 2 illustrates the relation between the line number of the log
on live data are also included. files and the running time. It is clear that the running time rises
steadily as the number of the lines increases.
4.1. Experimental setup

4.3. Scan detection on live data

To implement our rules, Python programming language version
3.5 has been chosen. Script is executed on Ubuntu operating sys-
We have built or model according to the data sets mentioned in
tem mentioned in Section 3.2.2 via terminal. To parse log lines,
Section 4.1. Additionally, we test our model according to several
‘‘apache-log-parser 1.7.0” which is a Python package has been
large-scale, live, not labelled and publicly available data sets. In
used. As well as, we benefit from python libraries that are collec-
this section, we share our test results illustrated in tables and
tions, datetime, numpy, ua-parser and argparse.
graphs.
Since there are not any actual, publicly available and labelled data
In accordance with this purpose, we have used log samples
sets to evaluate our model, we create our data sets. In fact, we deploy
from real systems [26]. As stated in the related web source, these
two different web applications on two different web servers to form
samples are collected from various systems, security devices,
Type 1 and Type 3 traffics. Details are expressed in Section 3.2.2.
applications, etc.; and neither Chuvakin nor we did not sanitize,
Type 1 (normal traffic) is the data set collected from Jira Soft-
anonymized or modified them in any way. Since they include HTTP
ware as a web application running on Tomcat web server during
access logs, we have chosen the log samples named Bundle 9, Bun-
4 days.The size of the related access log file is 16.3 MB. As shown
dle 7, Bundle 1, Bundle 4 and Bundle 3. For the rest of the work,
Table 6, log file contains 62,539 log entries from 15 different IP
these bundles are expressed as Data Set 1, Data Set 2, Data Set 3,
addresses. These requests are generated in a local network.
Data Set 4 and Data Set 5 respectively. Details of these data sets
For Type 2 traffic, an external traffic that is open to the internet
are shown in Table 9.
is needed. To this end, we make use of three different access log
In order to test the log samples, Data Set 1, Data Set 2, Data Set
files retrieved from a company website. In detail, log files contain
3, Data Set 4 and Data Set 5 are divided into daily, monthly,
crawling data collected during 13 days from requests of several
web robots. The size of the related access log files is totally
6.4 MB, and log files contain 28,804 log entries from 143 different
IP addresses as shown Table 6. 7

To generate Type 3 traffic, DVWA running on Apache HTTP Ser-

ver is used as a web application. Before scanning, the security level 6

of DVWA is configured as low security. Moreover, we scan this

5
application via Acunetix, Netsparker and W3AF as web vulnerabil-
ity scanners. Firstly, DVWA is scanned for 22 min and 22 s with
Type 3 (%)

4
Acunetix. Secondly, DVWA is scanned for 19 min and 56 s with
Netsparker. Lastly, DVWA is scanned for 2 min and 6 s with
3
W3AF. The details of the related access log files are summarized
as Type 3 in Table 6.
2
For the evaluation of the proposed model, we combine all men-
tioned access log files into one file that is our general data set.
1
Then, we run our Python script on the mentioned data set.
0
4.2. Model evaluation Jun Jul Aug Sep Oct Nov Dec Jan Feb
Months
Initially, to evaluate the proposed model, we compute the con- Fig. 5. Data Set 3 test results.
fusion matrix where TP, FN, FP, and TN denote true negatives, false
negatives, false positives, and true negatives respectively as shown
in Table 7.
0.7
After, we evaluate the following measures:
ðTN þ TPÞ 0.6
accuracyðaccÞ ¼
ðTN þ FN þ FP þ TPÞ
0.5
ðTPÞ
precisionðprecÞ ¼
ðTP þ FPÞ
Type 3 (%)

0.4
ðTPÞ
recallðrecÞ ¼
ðTP þ FNÞ 0.3
ð2TPÞ
F1 score ¼ ð1Þ
ð2TP þ FP þ FNÞ 0.2

More specifically, the accuracy provides the percentage of Type 3 0.1

that are detected correctly. The precision determines the fraction
of IP addresses correctly classified as Type 3 over all IP addresses 0
1 2 3 4 5 6
classified as Type 3. The recall (a.k.a. sensitivity) is the fraction of
15-Day Period
IP addresses correctly classified as Type 3 over all IP addresses of
Type 3. Finally, the F1 -score is a harmonic mean of precision and Fig. 6. Data Set 4 test results.
36 M. Basß Seyyar et al. / Applied Computing and Informatics 14 (2018) 28–36

1.4 Intrusion Kill Chains, vol. 1, API, 2011, URL <https://books.google.com.

tr/books?id=oukNfumrXpcC>.
[2] European Union Agency for Network and Information Security (ENISA), ENISA
1.2
Threat Landscape 2015. URL <https://www.enisa.europa.eu/publications/
etl2015>, 2016 (accessed November 29, 2016).
1 [3] D.V. Bernardo, Clear and present danger: interventive and retaliatory
approaches to cyber threats, Appl. Comput. Infor. 11 (2) (2015) 144–157,
http://dx.doi.org/10.1016/j.aci.2014.11.002, URL <http://
Type 3 (%)

0.8
www.sciencedirect.com/science/article/pii/S2210832714000386>.
[4] R. Meyer, Detecting Attacks on Web Applications from Log Files. URL <https://
0.6 www.sans.org/reading-room/whitepapers/logging/detecting-attacks-web-
applications-log-files-2074>, 2008 (accessed December 12, 2016).
[5] D.B. Cid, Log Analysis using OSSEC. URL <http://www.academia.edu/8343225/
0.4
Log_Analysis_using_OSSEC>, 2007 (accessed November 29, 2016).
[6] Wikipedia, Overfitting. URL <https://en.wikipedia.org/wiki/Overfitting>, 2016
0.2 (accessed December 27, 2016).
[7] M. Auxilia, D. Tamilselvan, Anomaly detection using negative security model in
web application, in: 2010 International Conference on Computer Information
0
1 2 3 4 5 Systems and Industrial Management Applications (CISIM), 2010, pp. 481–486,
http://dx.doi.org/10.1109/CISIM.2010.5643461.
15-Day Period
[8] K. Goseva-Popstojanova, G. Anastasovski, R. Pantev, Classification of malicious
web sessions, in: 2012 21st International Conference on Computer
Fig. 7. Data Set 5 test results.
Communications and Networks (ICCCN), 2012, pp. 1–9, http://dx.doi.org/
10.1109/ICCCN.2012.6289291.
monthly, 15-day and 15-day periods respectively. Related details [9] M. Husák, P. Velan, J. Vykopal, Security monitoring of http traffic using
are expressed in Table 10. extended flows, in: 2015 10th International Conference on Availability,
Reliability and Security, 2015, pp. 258–265, http://dx.doi.org/10.1109/
Type 3 percentage of each data set is shown in Figs. 3–7. ARES.2015.42.
[10] M. Zolotukhin, T. Hämäläinen, T. Kokkonen, J. Siltanen, Analysis of http
requests for anomaly detection of web attacks, in: 2014 IEEE 12th
5. Conclusion
International Conference on Dependable, Autonomic and Secure Computing,
2014, pp. 406–411, http://dx.doi.org/10.1109/DASC.2014.79.
In this work, we studied web vulnerability scans detection [11] S. Cho, S. Cha, Sad: web session anomaly detection based on parameter
estimation, Comput. Secur. 23 (4) (2004) 312–319, http://dx.doi.org/10.1016/
through access log files of web servers in addition to detection of
j.cose.2004.01.006, URL <http://www.sciencedirect.com/science/article/pii/
XSS and SQLI attacks. In accordance with this purpose, we used S0167404804000264>.
rule-based methodology. Firstly, we examined the behavior of [12] N. Singh, A. Jain, R.S. Raw, R. Raman, Detection of Web-Based Attacks by
the automated vulnerability scanners. Moreover, we implemented Analyzing Web Server Log Files, Springer India, New Delhi, 2014, http://dx.doi.
org/10.1007/978-81-322-1665-0_10, pp. 101–109.
our model with a Python script. Afterwards, our model has been [13] D. Stevanovic, N. Vlajic, A. An, Detection of malicious and non-malicious
evaluated based on data we have collected. Finally, we tested our website visitors using unsupervised neural network learning, Appl. Soft
model on the log samples from real systems. Comput. 13 (1) (2013) 698–708, http://dx.doi.org/10.1016/j.
asoc.2012.08.028, URL <http://www.sciencedirect.com/science/article/pii/
It is clear that our method has very high probability of detection S1568494612003778>.
and low probability of false alarm. More specifically, the accuracy [14] A. Razzaq, Z. Anwar, H.F. Ahmad, K. Latif, F. Munir, Ontology for attack
and the precision rates of our model are 99.38%, 100.00% respec- detection: an intelligent approach to web application security, Comput. Secur.
45 (2014) 124–146, http://dx.doi.org/10.1016/j.cose.2014.05.005, URL <http://
tively. More importantly, malicious scans can be captured more www.sciencedirect.com/science/article/pii/S0167404814000868>.
precisely because different types of scanning tools including both [15] W3Techs (Q-Success DI Gelbmann GmbH), Usage Statistics and Market Share
open source and commercial tools were examined. Therefore, our of Apache for Websites. URL <https://w3techs.com/technologies/details/ws-
apache/all/all>, 2009–2017 (accessed December 12, 2016).
results indicates that static rules can detect successfully web vul-
[16] The Apache Software Foundation, Apache Tomcat. URL <http://tomcat.apache.
nerability scans. Besides, we have observed that our model func- org> (accessed December 24, 2016).
tions properly with larger and live data sets and correctly detects [17] w3af.org, w3af. URL <http://w3af.org>, 2013 (accessed December 12, 2016).
[18] The Web Robots Pages, Robots Database. URL <http://www.robotstxt.org/db.
Type 3 IP addresses.
html> (accessed September 4, 2016).
As shown in the Fig. 2, the relation between the number of lines [19] M.C. Calzarossa, L. Massari, D. Tessera, An extensive study of web robots
of the log files and the running time is linear. As a result, how long traffic, in: Proceedings of International Conference on Information
a log file would be analyzed, could be predicted in advance. Integration and Web-based Applications & Services, IIWAS ’13, ACM,
New York, NY, USA, 2013, pp. 410:410–410:417, http://dx.doi.org/10.1145/
The results presented in this work may enhance researches 2539150.2539161.
about malicious web scans and may support the development of [20] M.D. Dikaiakos, A. Stassopoulou, L. Papageorgiou, An investigation of web
attack detection studies. Also, if security analysts or administrators crawler behavior: characterization and metrics, Comput. Commun. 28 (8)
(2005) 880–897, http://dx.doi.org/10.1016/j.comcom.2005.01.003, URL
execute the proposed python script several times within the same <http://www.sciencedirect.com/science/article/pii/S0140366405000071>.
day, he/she could prevent most of the web related attacks. [21] M. Dikaiakos, A. Stassopoulou, L. Papageorgiou, Characterizing Crawler
Future work considerations related to this work are twofold. In Behavior from Web Server Access Logs, Springer Berlin Heidelberg, Berlin,
Heidelberg, 2003, http://dx.doi.org/10.1007/978-3-540-45229-4_36.
the first place, one could make our model possible to analyze other [22] M.C. Calzarossa, L. Massari, Analysis of Web Logs: Challenges and Findings,
log files such as audit log and error log. Secondly, in addition to the Springer Berlin Heidelberg, Berlin, Heidelberg, 2011, http://dx.doi.org/
scope of this work; different from SQLI and XSS attacks, other well- 10.1007/978-3-642-25575-5_19, pp. 227–239.
[23] D. Stevanovic, A. An, N. Vlajic, Feature evaluation for web crawler detection
known web application attacks like CSRF could be addressed too.
with data mining techniques, Expert Syst. Appl. 39 (10) (2012) 8707–8717,
http://dx.doi.org/10.1016/j.eswa.2012.01.210, URL <http://
Appendix A. Supplementary material www.sciencedirect.com/science/article/pii/S0957417412002382>.
[24] A.G. Lourenço, O.O. Belo, Catching web crawlers in the act, in: Proceedings of the
6th International Conference on Web Engineering, ICWE ’06, ACM, New York,
Supplementary data associated with this article can be found, in NY, USA, 2006, pp. 265–272, http://dx.doi.org/10.1145/1145581.1145634.
the online version, at http://dx.doi.org/10.1016/j.aci.2017.04.002. [25] D. Stevanovic, A. An, N. Vlajic, Detecting Web Crawlers from Web Server Access
Logs with Data Mining Classifiers, Springer Berlin Heidelberg, Berlin, Heidelberg,
2011, http://dx.doi.org/10.1007/978-3-642-21916-0_52, pp. 483–489.
References [26] A. Chuvakin, Public Security Log Sharing Site. URL <http://log-
sharing.dreamhosters.com>, 2009 (accessed December 15, 2015).
[1] E.M. Hutchins, M.J. Cloppert, R.M. Amin, Intelligence-driven Computer
Network Defense Informed by Analysis of Adversary Campaigns and