Detection of Attack-Targeted Scans From The Apache HTTP Server Access Logs
Detection of Attack-Targeted Scans From The Apache HTTP Server Access Logs
a r t i c l e i n f o a b s t r a c t
Article history: A web application could be visited for different purposes. It is possible for a web site to be visited by a
Received 8 January 2017 regular user as a normal (natural) visit, to be viewed by crawlers, bots, spiders, etc. for indexing purposes,
Revised 25 April 2017 lastly to be exploratory scanned by malicious users prior to an attack. An attack targeted web scan can be
Accepted 26 April 2017
viewed as a phase of a potential attack and can lead to more attack detection as compared to traditional
Available online 28 April 2017
detection methods. In this work, we propose a method to detect attack-oriented scans and to distinguish
them from other types of visits. In this context, we use access log files of Apache (or ISS) web servers and
Keywords:
try to determine attack situations through examination of the past data. In addition to web scan detec-
Rule-based model
Log analysis
tions, we insert a rule set to detect SQL Injection and XSS attacks. Our approach has been applied on sam-
Scan detection ple data sets and results have been analyzed in terms of performance measures to compare our method
Web application security and other commonly used detection techniques. Furthermore, various tests have been made on log sam-
XSS detection ples from real systems. Lastly, several suggestions about further development have been also discussed.
SQLI detection Ó 2017 The Authors. Production and hosting by Elsevier B.V. on behalf of King Saud University. This is an
open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2. Related work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3. System model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.1. Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2. Data and log generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2.1. Web servers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2.2. Damn Vulnerable Web Application (DVWA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2.3. Web vulnerability scanners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3. Rules and methodology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4. Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.1. Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2. Model evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3. Scan detection on live data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Appendix A. Supplementary material . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
⇑ Corresponding author.
E-mail addresses: [email protected] (M. Basß Seyyar), [email protected] (F.Özgür Çatak), [email protected] (E. Gül).
Peer review under responsibility of King Saud University.
http://dx.doi.org/10.1016/j.aci.2017.04.002
2210-8327/Ó 2017 The Authors. Production and hosting by Elsevier B.V. on behalf of King Saud University.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
M. Basß Seyyar et al. / Applied Computing and Informatics 14 (2018) 28–36 29
of the interesting results they obtain is that the paths requested for 3. System model
HTTP scans are also requested for brute-force attack as well. How-
ever, not only HTTP requests but also HTTP responds should also be In this section, we describe how we construct and design the
analyzed to get more effective results. proposed model in detail. Also, we present our rules with underly-
After a learning period of non-malicious HTTP logs, Zolotukhin ing reasons.
et al. [10] analyze HTTP requests in an on-line mode to detect net-
work intrusions. Normal user behavior, anomalies related features 3.1. Assumptions
and intrusions detection are extracted from web resources, queries
attributes and user agent values respectively. The authors compare In access logs, POST data can not get logged. Thus, the proposed
five different anomaly-detection methods; that are Support Vector method cannot capture this sort of data.
Data Description (SVDD), K-means, Density-Based Spatial Cluster- Browsers or application servers may support other encodings.
ing of Applications with Noise (DBSCAN), Self-Organizing Map Since only two of them are in the context of this work, our script
(SOM) and Local Outlier Factor (LOF), according to their accuracy cannot capture data encoded in other styles.
rates in detecting intrusions. It is asserted that simulations results Our model is designed for detection of two well-known web
show higher accuracy rates compared to the other data-mining application attacks and malicious web vulnerability scans, not
techniques. for prevention. Thus, working on-line mod is not included in
Session Anomaly Detection (SAD) is a method developed by Cho the context of our research.
and Cha [11] as a Bayesian estimation technique. In this model,
web sessions are extracted from web logs and are labelled as ‘‘nor- 3.2. Data and log generation
mal” or ‘‘abnormal” depending on whether it is below or above the
assigned threshold value. In addition, two parameters that are page In this section, tools, applications, virtual environment used
sequences and their frequency are investigated in training data. In throughout this work and their installation and configuration set-
order to test their results; the authors use Whisker v1.4 as a tool tings are explained.
for generating anomalous web requests and it is asserted that
The Bayesian estimation technique has been successful for detect- 3.2.1. Web servers
ing 91% of all anomalous requests. Therefore, two points making 3.2.1.1. HTTP Server. As mentioned earlier, Apache/2.4.7 (Ubuntu)
this article different from the others are that SAD can be cus- Server is chosen as a web server. Apache is known to be the most
tomized by choosing site-dependent parameters; and the false commonly used web server. According to the W3Techs (Web Tech-
positive rates gets lower with web topology information. nology Surveys) [15], as of December 1, 2016; Apache is used by
Singh et al. [12] have presented an analysis of two web-based 51.2 percent of all web servers. In addition, it is open source, highly
attacks which are i-frame injection attacks and buffer overflow scalable and has a dynamically loadable module system. Apache
attacks. For analysis, log files created after attacks are used. They installation is made via apt-get command-line package manager.
compare the size of the transferred data and the length of input Any extra configuration is not necessary for the scope of this work.
parameters for normal and malicious HTTP requests. As a result,
they just have carried out descriptive statistics and have not men- 3.2.1.2. Apache Tomcat. The Apache Tomcat being an implementa-
tioned any detection techniques. tion of the Java Servlet, JavaServer Pages, Java Expression Language
In their work, Stevanovic et al. [13] use SOM and Modified and Java WebSocket technologies, is an open source software [16].
Adaptive Resonance Theory 2 (Modified ART2) algorithms for In this work, Apache Tomcat Version 8.0.33 is used. Atlassian JIRA
training and 10 features related to web sessions for clustering. Standalone Edition (Jira 3.19.0-25-generic #26) is used as a web
Then, the authors label these sessions as human visitors, well- application. Access log configuration of Tomcat is set to be similar
behaved web crawlers, malicious crawlers and unknown visitors. to access log entries in Apache.
In addition to classifying web sessions, similarities among the
browsing styles of Google, MSN, and Yahoo are also analyzed in 3.2.2. Damn Vulnerable Web Application (DVWA)
this article. The authors obtain lots of interesting results, one of DVWA is a vulnerable PHP/MySQL web application. It is
which is that 52% of malicious web crawlers and human visitors designed to help web developers find out critical web application
are similar in their browsing strategies; which means that it is hard security vulnerabilities by hands on activity. Different from illegal
to distinguish each other. website-hacking, it offers a totally legal environment to exploit for
Another completely different propose a semantic model that is security people. Thanks to DVWA; Brute Force, Cross Site Request
named ontological model [14]. They assert that attack signatures Forgery (CSRF), Command Execution, XSS (reflected) and SQL Injec-
are not independent from programming languages and platforms. tion vulnerabilities could be tested for three security levels; low,
As a result, signatures may become invalid after some changes in medium, high.
business logic. In contrary, their model is extendible and reusable In this work, DVWA 1.0.8 version (Release date: 11/01/2011) is
and could detect malicious scripts in HTTP requests and response. used. To install this web application, Linux Apache MySQL PHP
Also, thanks to ontological model, zero day attacks could be effec- (LAMP) Server; including MySql, PHP5, and phpMyAdmin, has
tively detected. Their paper also includes a comparison between been installed. The reasons for studying with DVWA are to better
the proposed Semantic Model and ModSecurity. understand XSS and SQL Injection attacks and to find out related
There are several differences between our work and the above payloads substituted in query string part of URIs. In this way, rule
mentioned works. Firstly, as in the most of the related works, selection to detect these attacks from access logs could be correctly
checking only the user-agent header field from a list is not enough determined. Also, web vulnerability scanners used in this work,
to detect web crawlers in the correct way. Correspondingly, we add have scanned this web application for data collection purposes.
extra fields to check to make the web crawler detection more accu-
rate. Additionally, unlike machine learning and data-mining, 3.2.3. Web vulnerability scanners
rule-based detection has been used in the proposed model. Finally, 3.2.3.1. Acunetix. Acunetix is one of the most commonly used com-
in contrast to other works, we prefer to use combined log format in mercial web vulnerability scanners. Acunetix scans a web site
order to make the number of features larger and to get more con- according to the determined configurations, produces a report
sistent results. about the existing vulnerabilities, groups them as high, medium,
M. Basß Seyyar et al. / Applied Computing and Informatics 14 (2018) 28–36 31
low and informational; and identifies the threat level of the web oped using Python and licensed under General Public License (GPL)
application with the related mitigation recommendations. In the v2.0. Framework is designed to help web administrators secure the
context of this work, Acunetix Web Vulnerability Scanner (WVS) web applications. W3AF could detect more than 200 vulnerabilities
Reporter v7.0 has been used with default scanning configurations [17]. W3AF has several plug-ins for different operations such as
in addition to site login information. crawling, brute forcing, and firewall bypassing. W3AF comes by
default in Kali Linux and could be found in ‘‘Applications/Web
Application Analysis/Web Vulnerability Scanners”. W3AF version
3.2.3.2. Netsparker. Netsparker is a web application security scan-
1.6.54 has been used with ‘‘fast-scan” profile through audit, crawl,
ner that is commercial too. Netsparker detects security vulnerabil-
grep and output plugins.
ities of a web application and produces a report including
mitigation solutions. In addition, detected vulnerabilities could
be exploited to confirm the report results. In the context of this
3.3. Rules and methodology
work, Netsparker Microsoft Software Library (MSL) Internal Build
4.6.1.0 along with Vulnerability Database 2016.10.27.1533 has
As mentioned earlier, our script runs on access log files. The
been used with special scanning configurations including custom
main reason for this choice is the opportunity for detailed analysis
cookie information.
about users actions. By examining past data, information security
policies for the web applications could be correctly created and
3.2.3.3. Web Application Attack and Audit Framework (W3AF). W3AF implemented. Additionally, further exploitations could be pre-
is an open source web application security scanner. W3AF is devel- vented in advance. Unlike the proposed model, Network Intrusion
Detection System (NIDS) may not detect attacks when HTTPS is trators might run our script every day to check for an attack. Lastly,
used [4]. However, working with logs has some disadvantages. real-time detection and prevention is not possible with the pro-
Since log files do not contain all data of HTTP request and response, posed method which runs off-line. Thus, we could not guarantee
some important data could not be analyzed. For example, POST to run on-line. In fact, this approach is conceptually sufficient for
parameters that are vulnerable to injections attacks could not be the scope of this work. Differently from the test environment; an
logged by web servers. Another negative aspects are the size of logs extra module that directly accesses logs, or a script that analyses
and parsing difficulty. Nevertheless, to solve this problem, we sep- logs faster could be developed to use our approach in a live or real
arate the access log files on a daily basis. Therefore, web adminis- environment.
Our method could be described as rule-based detection. Unlike
Table 1 anomaly based detection, our rules are static including both black-
HTTP methods in Acunetix. list and whitelist approaches. In detail, XSS and SQL injection
HTTP method Number detection part of our method is a positive security model; on the
other hand, the rest is a negative security model. Thus, data eva-
Connect 2
Get 2758
sion is tried to be kept at a minimum level. In order to classify IP
Options 2 addresses in the access log file, we identify three different visitor
Post 668 types as follows:
Trace 2
Track 2 Table 6
Total 3434 Details of classified data sets.
1000
Table 4
HTTP status codes in W3AF. 800
00
0
0
00
00
00
00
00
00
00
00
00
Table 5
00
00
10
20
30
40
50
60
70
80
90
10
11
Table 10
Data sets test results.
Type 1: Regular (normal) users with a normal (natural) visit. sibility for URL obfuscation, Hex and UTF-8 encodings of these pat-
Type 2: Crawlers, bots, spiders or robots. terns are also taken in consideration.
Type 3: Malicious users using automated web vulnerability Afterwards, we continue by separating IP addresses of Type 2
scanners. from the rest of the access log file in Phase 2. To do this, two differ-
ent approaches are used. Firstly, user-agent part of all log entries is
As shown in Fig. 1 in Phase 1, our first step is to detect SQL compared with the user-agent list from robots database that is
injection and XSS attacks. Although different places of HTTP (the publicly available in [18]. However, since this list may not be up-
HTTP body, URI) could be used to exploit a vulnerability [4]; we to-date, another bot detection rules are added. In order to identify
will analyze path and query parts of the requested URI for these rules, we use the following observations about web robots:
detection.
In detail; for XSS, we use regular expressions to recognize some 1. Most of the web robots make a request for ‘‘/robots.txt” file [19].
patterns such as HTML tags, ‘src’ parameter of the ‘img’ tag and 2. Web robots have higher rate of ‘‘4xx” requests since they usu-
some Javascript event handlers. Likewise; for SQL injection, we ally request unavailable pages [20–23].
check the existence of the singlequote, the doubledash, ‘#’, exec() 3. Web robots have higher unassigned referrer (‘‘–”) rates [23–25].
function and some SQL keywords. In addition, since there is a pos-
4 1
3.5 0.9
3 0.8
0.7
2.5
Type 3 (%)
Type 3 (%)
0.6
2
0.5
1.5
0.4
1
0.3
0.5 0.2
0 0.1
04 04 04 04 04
ar. ar. ar. ar. ar. 0
.M .M .M .M .M
10 11 12 13 14 Apr May Jun Jul Aug Sep Oct
Days Months
Fig. 3. Data Set 1 test results. Fig. 4. Data Set 2 test results.
34 M. Basß Seyyar et al. / Applied Computing and Informatics 14 (2018) 28–36
4. According to the access logs that we analyzed, user-agent status codes of Type 3 differ from Type 2 and Type 1. In fact, Type
header field of web robots may contain some keywords such 3 has higher rate of ‘‘404” requests, average of which for Acunetix,
as bot, crawler, spider, wanderer, and robot. Netsparker and W3AF is 31% in our data set. Thus, we generate a
rule to check the presence of these HTTP methods and the percent-
As a result of above mentioned observations, we add some extra age of ‘‘404” requests. User-agent header fields of Type 3 could
rules to correctly distinguish Type 2 from other visitors. generally be modified and obfuscated manually at the configura-
For the rest our rule set as indicated at Phase 3 in Fig. 1, we con- tion phase before vulnerability scan. Even so, we made a list of
tinue by investigating our access log files formed as a result of vul- well-known automated web vulnerability scanners, and compare
nerability scanning mentioned in the previous section. As shown in it with user-agent header fields. Finally, we notice that these scan-
Tables 1 and 2, our first immediate observation is that as compared ners make at least more than 100 HTTP requests in a certain time,
to Type 2 and Type 1, Type 3’s requests include different HTTP we select this value as a threshold for Type 3 detection.
methods; such as Track, Trace, Netsparker, Pri, Propfind and Quit. The pseudo code of the proposed model is shown in Algorithm 1:
Secondly, as shown in Table 3, Tables 4 and 5; we deduct that
Algorithm 1. Pseudo-Code for Proposed Model.
∈
←−
∈
←−
∈
←−
←−
←−
←−
” ”
∈ ” ” ” ” ” ” ” ” ” ”
” ” ∈
←−
∈
←−
←−
←−
∈
” ” ” ” ” ” ” ” ” ” ” ”
M. Basß Seyyar et al. / Applied Computing and Informatics 14 (2018) 28–36 35
4. Results recall. As a result, our model has 99.38% accuracy, 100.00% preci-
sion, 75.00% recall and finally 85.71% F1 score as we can see in
This section is based on the evaluation of our model against Table 8.
some important metrics. Moreover, test results of attack detection Fig. 2 illustrates the relation between the line number of the log
on live data are also included. files and the running time. It is clear that the running time rises
steadily as the number of the lines increases.
4.1. Experimental setup
4
Acunetix. Secondly, DVWA is scanned for 19 min and 56 s with
Netsparker. Lastly, DVWA is scanned for 2 min and 6 s with
3
W3AF. The details of the related access log files are summarized
as Type 3 in Table 6.
2
For the evaluation of the proposed model, we combine all men-
tioned access log files into one file that is our general data set.
1
Then, we run our Python script on the mentioned data set.
0
4.2. Model evaluation Jun Jul Aug Sep Oct Nov Dec Jan Feb
Months
Initially, to evaluate the proposed model, we compute the con- Fig. 5. Data Set 3 test results.
fusion matrix where TP, FN, FP, and TN denote true negatives, false
negatives, false positives, and true negatives respectively as shown
in Table 7.
0.7
After, we evaluate the following measures:
ðTN þ TPÞ 0.6
accuracyðaccÞ ¼
ðTN þ FN þ FP þ TPÞ
0.5
ðTPÞ
precisionðprecÞ ¼
ðTP þ FPÞ
Type 3 (%)
0.4
ðTPÞ
recallðrecÞ ¼
ðTP þ FNÞ 0.3
ð2TPÞ
F1 score ¼ ð1Þ
ð2TP þ FP þ FNÞ 0.2
0.8
www.sciencedirect.com/science/article/pii/S2210832714000386>.
[4] R. Meyer, Detecting Attacks on Web Applications from Log Files. URL <https://
0.6 www.sans.org/reading-room/whitepapers/logging/detecting-attacks-web-
applications-log-files-2074>, 2008 (accessed December 12, 2016).
[5] D.B. Cid, Log Analysis using OSSEC. URL <http://www.academia.edu/8343225/
0.4
Log_Analysis_using_OSSEC>, 2007 (accessed November 29, 2016).
[6] Wikipedia, Overfitting. URL <https://en.wikipedia.org/wiki/Overfitting>, 2016
0.2 (accessed December 27, 2016).
[7] M. Auxilia, D. Tamilselvan, Anomaly detection using negative security model in
web application, in: 2010 International Conference on Computer Information
0
1 2 3 4 5 Systems and Industrial Management Applications (CISIM), 2010, pp. 481–486,
http://dx.doi.org/10.1109/CISIM.2010.5643461.
15-Day Period
[8] K. Goseva-Popstojanova, G. Anastasovski, R. Pantev, Classification of malicious
web sessions, in: 2012 21st International Conference on Computer
Fig. 7. Data Set 5 test results.
Communications and Networks (ICCCN), 2012, pp. 1–9, http://dx.doi.org/
10.1109/ICCCN.2012.6289291.
monthly, 15-day and 15-day periods respectively. Related details [9] M. Husák, P. Velan, J. Vykopal, Security monitoring of http traffic using
are expressed in Table 10. extended flows, in: 2015 10th International Conference on Availability,
Reliability and Security, 2015, pp. 258–265, http://dx.doi.org/10.1109/
Type 3 percentage of each data set is shown in Figs. 3–7. ARES.2015.42.
[10] M. Zolotukhin, T. Hämäläinen, T. Kokkonen, J. Siltanen, Analysis of http
requests for anomaly detection of web attacks, in: 2014 IEEE 12th
5. Conclusion
International Conference on Dependable, Autonomic and Secure Computing,
2014, pp. 406–411, http://dx.doi.org/10.1109/DASC.2014.79.
In this work, we studied web vulnerability scans detection [11] S. Cho, S. Cha, Sad: web session anomaly detection based on parameter
estimation, Comput. Secur. 23 (4) (2004) 312–319, http://dx.doi.org/10.1016/
through access log files of web servers in addition to detection of
j.cose.2004.01.006, URL <http://www.sciencedirect.com/science/article/pii/
XSS and SQLI attacks. In accordance with this purpose, we used S0167404804000264>.
rule-based methodology. Firstly, we examined the behavior of [12] N. Singh, A. Jain, R.S. Raw, R. Raman, Detection of Web-Based Attacks by
the automated vulnerability scanners. Moreover, we implemented Analyzing Web Server Log Files, Springer India, New Delhi, 2014, http://dx.doi.
org/10.1007/978-81-322-1665-0_10, pp. 101–109.
our model with a Python script. Afterwards, our model has been [13] D. Stevanovic, N. Vlajic, A. An, Detection of malicious and non-malicious
evaluated based on data we have collected. Finally, we tested our website visitors using unsupervised neural network learning, Appl. Soft
model on the log samples from real systems. Comput. 13 (1) (2013) 698–708, http://dx.doi.org/10.1016/j.
asoc.2012.08.028, URL <http://www.sciencedirect.com/science/article/pii/
It is clear that our method has very high probability of detection S1568494612003778>.
and low probability of false alarm. More specifically, the accuracy [14] A. Razzaq, Z. Anwar, H.F. Ahmad, K. Latif, F. Munir, Ontology for attack
and the precision rates of our model are 99.38%, 100.00% respec- detection: an intelligent approach to web application security, Comput. Secur.
45 (2014) 124–146, http://dx.doi.org/10.1016/j.cose.2014.05.005, URL <http://
tively. More importantly, malicious scans can be captured more www.sciencedirect.com/science/article/pii/S0167404814000868>.
precisely because different types of scanning tools including both [15] W3Techs (Q-Success DI Gelbmann GmbH), Usage Statistics and Market Share
open source and commercial tools were examined. Therefore, our of Apache for Websites. URL <https://w3techs.com/technologies/details/ws-
apache/all/all>, 2009–2017 (accessed December 12, 2016).
results indicates that static rules can detect successfully web vul-
[16] The Apache Software Foundation, Apache Tomcat. URL <http://tomcat.apache.
nerability scans. Besides, we have observed that our model func- org> (accessed December 24, 2016).
tions properly with larger and live data sets and correctly detects [17] w3af.org, w3af. URL <http://w3af.org>, 2013 (accessed December 12, 2016).
[18] The Web Robots Pages, Robots Database. URL <http://www.robotstxt.org/db.
Type 3 IP addresses.
html> (accessed September 4, 2016).
As shown in the Fig. 2, the relation between the number of lines [19] M.C. Calzarossa, L. Massari, D. Tessera, An extensive study of web robots
of the log files and the running time is linear. As a result, how long traffic, in: Proceedings of International Conference on Information
a log file would be analyzed, could be predicted in advance. Integration and Web-based Applications & Services, IIWAS ’13, ACM,
New York, NY, USA, 2013, pp. 410:410–410:417, http://dx.doi.org/10.1145/
The results presented in this work may enhance researches 2539150.2539161.
about malicious web scans and may support the development of [20] M.D. Dikaiakos, A. Stassopoulou, L. Papageorgiou, An investigation of web
attack detection studies. Also, if security analysts or administrators crawler behavior: characterization and metrics, Comput. Commun. 28 (8)
(2005) 880–897, http://dx.doi.org/10.1016/j.comcom.2005.01.003, URL
execute the proposed python script several times within the same <http://www.sciencedirect.com/science/article/pii/S0140366405000071>.
day, he/she could prevent most of the web related attacks. [21] M. Dikaiakos, A. Stassopoulou, L. Papageorgiou, Characterizing Crawler
Future work considerations related to this work are twofold. In Behavior from Web Server Access Logs, Springer Berlin Heidelberg, Berlin,
Heidelberg, 2003, http://dx.doi.org/10.1007/978-3-540-45229-4_36.
the first place, one could make our model possible to analyze other [22] M.C. Calzarossa, L. Massari, Analysis of Web Logs: Challenges and Findings,
log files such as audit log and error log. Secondly, in addition to the Springer Berlin Heidelberg, Berlin, Heidelberg, 2011, http://dx.doi.org/
scope of this work; different from SQLI and XSS attacks, other well- 10.1007/978-3-642-25575-5_19, pp. 227–239.
[23] D. Stevanovic, A. An, N. Vlajic, Feature evaluation for web crawler detection
known web application attacks like CSRF could be addressed too.
with data mining techniques, Expert Syst. Appl. 39 (10) (2012) 8707–8717,
http://dx.doi.org/10.1016/j.eswa.2012.01.210, URL <http://
Appendix A. Supplementary material www.sciencedirect.com/science/article/pii/S0957417412002382>.
[24] A.G. Lourenço, O.O. Belo, Catching web crawlers in the act, in: Proceedings of the
6th International Conference on Web Engineering, ICWE ’06, ACM, New York,
Supplementary data associated with this article can be found, in NY, USA, 2006, pp. 265–272, http://dx.doi.org/10.1145/1145581.1145634.
the online version, at http://dx.doi.org/10.1016/j.aci.2017.04.002. [25] D. Stevanovic, A. An, N. Vlajic, Detecting Web Crawlers from Web Server Access
Logs with Data Mining Classifiers, Springer Berlin Heidelberg, Berlin, Heidelberg,
2011, http://dx.doi.org/10.1007/978-3-642-21916-0_52, pp. 483–489.
References [26] A. Chuvakin, Public Security Log Sharing Site. URL <http://log-
sharing.dreamhosters.com>, 2009 (accessed December 15, 2015).
[1] E.M. Hutchins, M.J. Cloppert, R.M. Amin, Intelligence-driven Computer
Network Defense Informed by Analysis of Adversary Campaigns and