0% found this document useful (0 votes)
41 views

An - Approach - To - Understand - The - End - User - Behaviour Trough Log Analysis

This document summarizes a research paper that proposes an approach to understand end user behavior through log analysis. The researchers developed a prototype system that uses relational algebra to correlate log data and classify suspicious users based on a decision tree model. Log files provide important event data for cyber forensics investigations but pose challenges due to the large volumes and heterogeneous nature of the data from different sources. The system aims to help with efficient investigation by preprocessing log data and building an evidence chain.

Uploaded by

basy deco
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views

An - Approach - To - Understand - The - End - User - Behaviour Trough Log Analysis

This document summarizes a research paper that proposes an approach to understand end user behavior through log analysis. The researchers developed a prototype system that uses relational algebra to correlate log data and classify suspicious users based on a decision tree model. Log files provide important event data for cyber forensics investigations but pose challenges due to the large volumes and heterogeneous nature of the data from different sources. The system aims to help with efficient investigation by preprocessing log data and building an evidence chain.

Uploaded by

basy deco
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

International Journal of Computer Applications (0975 – 8887)

Volume 5– No.11, August 2010

An Approach to Understand the End User Behavior


through Log Analysis
Nikhil Kumar Singh Deepak Singh Tomar Bhola Nath Roy
Department of Department of Department of
Computer Science and Engineering Computer Science and Engineering Computer Science and Engineering
Maulana Azad Maulana Azad Maulana Azad
National Institute of Technology National Institute of Technology National Institute of Technology
Bhopal, India Bhopal, India Bhopal, India

ABSTRACT preserving it, and presenting it in better way to use it potentially


Categorizing the end user in the web environment is a mind- in the prosecution of cyber criminals [4].
numbing task. Huge amount of operational data is generated Some logs are generally more likely than others to record
when end user interacts in web environment. This generated information that may be useful in several situations, such as
operational data is stored in various logs and may be useful attack detection, fraud and misuse. For each type of situation,
source of capturing the end user activates. Pointing out the some records are generally more likely to contain detailed
suspicious user in a web environment is a challenging task. To information about the activities in question. Other records
conduct efficient investigation in cyber space the available logs typically contain less detailed information, and are often only
should be correlated. In this paper a prototype system is useful to correlate the events recorded in the main log types. For
developed and implemented which is based on relational algebra example, an intrusion detection system can record the malicious
to build the chain of evidence. The prototype system is used to commands issued to a server from an external host, this would be
preprocess the real generated data from logs and classify the a primary source of attack information. A manager of an accident
suspicious user based on decision tree. At last various challenges then might consider a firewall logs in search of other attempts to
in the logs managements are presented. connect the source IP address, which is a secondary source of
computer attacks [5].
Keywords
cyber forensic; log file; correlation; decision tree,chain of The research described in this paper focuses on the nature
evidence ,cyber crime;. of the event information provided in commonly available
computer and other log and the extent to which it is possible to
1. INTRODUCTION correlate such event information despite its heterogeneous nature
and origins.
Log files are like the black box on an aero plane that 2. LOG FILES
records the events occurred within an organization‟s system and Log files are excellent sources for determining the health
networks. Logs are composed of log entries that play a very status of a system and are used to capture the events happened
important role in evidence gathering and each entry contains within an organization‟s system and networks. Logs are a
information related to a specific event that has occurred within a collection of log entries and each entry contains information
system or a network. Log files helps cyber forensic process in related to a specific event that has taken place within a system or
probing and seizing computer, obtaining electronic evidence for network. Many logs within an association contain records
criminal investigations and maintaining computer records for the associated with computer security which are generated by many
federal rules of evidence. sources, including operating systems on servers, workstations,
Cyber forensic is a analytical method for extracting networking equipment and other security software‟s, such as
information and data from victim‟s computer storage media, it antivirus software, firewalls, intrusion detection and prevention
follows a systematic approach for finding, collecting, preserving systems and many other applications. Routine log analysis is
data that guarantees information accuracy, reliability and beneficial for identifying security incidents, policy violations,
presenting all evidence in acceptable manner in court of law. The fraudulent activity, and operational problems. Logs are also
primary goal of Cyber forensic is to reduce investigation time and useful for performing auditing and forensic analysis, supporting
complexity, it‟s not designed to solve crime but narrow the internal investigations, establishing baselines, and identifying
investigation. Cyber Forensics is done on cyber-crimes, password operational trends and long-term problems.
breaking, spamming, data recovery and analysis, tracking user Initially, logs were used for troubleshooting problems, but
activity, forensic imaging & verification, viruses, file types nowadays they are used for many functions within most
(extensions), encryption, Hacking etc.[1,2,3]. organizations and associations, such as optimizing system and
network performance, recording the actions of users, and
The establishment of cyber forensic is as same way as
providing data useful for investigating malicious activity. Logs
science and art is at its earlier stage. But the method of auditing,
have evolved to contain information related to many different
security, and law enforcement in cyber forensic changes at rapid
types of events occurring within networks and systems. Within an
pace as technology evolves with time. Even almost daily, new
organization, many logs contain records related to computer
strategies, procedures and models are developed for forensic
security; common examples of these computer security logs are
professionals in order to find electronic evidence, collecting it,

27
International Journal of Computer Applications (0975 – 8887)
Volume 5– No.11, August 2010

audit logs that track user authentication attempts and security format, and other binary formats. Some logs are designed in a
device logs that record possible attacks. way that they are readable to humans, whereas some others don‟t,
With the world wide deployment of network servers, some use standard formats, whereas others use proprietary
service station and other computing devices, the number of formats. Some logs are such that they are not stored on a single
threats against networks and systems have greatly increased in host, but are transmitted to other hosts for processing; a common
number, volume, and variety of computer security logs and with example can be SNMP traps.
the revolution of computer security logs, computer security log 3.5 Log Confidentiality and Integrity
management are required. Log management is essential to ensure Protection of Log records to maintain their integrity and
that computer security records are stored in sufficient detail for an confidentiality is very essential and challenging. For example,
appropriate period of time. Log management is the process for logs might intentionally or unintentionally capture sensitive
generating, transmitting, storing, analyzing, and disposing of information such as users‟ passwords and the content of e-mails.
computer security log data. The fundamental problem with log This raises security and privacy concerns relating both the
management is effectively balancing a limited quantity of log individuals that examine the logs and others that might be able to
management resources with a continuous supply of log data. Log access the logs through authorized or unauthorized means. Logs
generation and storage can be complicated by several factors, which are secured improperly in storage or in transit might also
including a high number of log sources; inconsistent log content, be susceptible to intentional and inadvertently alteration and
formats, and timestamps among sources; and increasingly large destruction. This could cause a variety of impacts, including
volumes of log data [5,6]. Log management also involves allowing malicious activities to go unnoticed and manipulating
protecting the confidentiality, integrity, and availability of logs. evidence to conceal the identity of a malicious party.
Another problem with log management is ensuring that security,
system, and network administrators regularly perform effective Protection of logs availability is also a very big issue. Many logs
analysis of log data. having a size limit when this limit is reached, the log might
overwrite old data with new data or stop logging all together both
3. TROUBLES IN LOG MANAGEMENT of which would cause a loss of log data availability. To meet data
In an association, many Operating Systems, security software, retention requirements, it‟s necessary to establish log archival i.e.
and other applications generate and preserve their independent keeping copies of log files for a longer period of time than the
log files. This complicates log management in the following ways original log sources can support. Because of the volume of logs, it
[5, 7] might be appropriate in some cases to reduce the logs by filtering
out log entries that do not need to be archived. The
confidentiality and integrity of the archived logs also need to be
3.1 Multiple Log Sources protected
Logs can be found on many hosts throughout the organization
that should be required to conduct log management throughout
4. ROLE OF EVENT LOG DATA IN
the organization. In addition, a single log source can generate EVIDENCE GATHERING
multiple logs for example, an application storing authentication Logs are composed of log entries; each entry contains information
attempts in one log and network activity in another log related to a specific event that has occurred within a system or
network. If the suspicious end user exploits web form as an
3.2 Heterogeneous Log Content access point for input attacks like cross-site scripting, SQL
Log file capture certain pieces of information in each entry, such injection and buffer overflow attack on a web application, it may
as client and server IP addresses, ports, date and time etc. For be detected using the log file [5].An interesting question is raised,
efficiency, log sources often record only the pieces of information why event data should be logged on a given system. Essentially
that they consider most important. It creates difficulty to make an there are four categories of reasons.
relationship between event records and different log sources
because they may not have any common attribute (e.g., source 1 4.1 Accountability
records the source IP address but not the username, and source 2 Log file data can be used to identify which type of accounts are
records the username but not the source IP address). Even the associated with certain events and that information can be used to
representation of log value varies with log source; these emphasize where training and/or disciplinary actions are needed.
differences may be slight, such as one date being in
YYYYMMDD format and another being in MMDDYYYY 4.2 Rebuilding
format, or they may be much more complex. What was happening before and during an event can be reviewed
chronologically by using log file data. For this it should be
3.3 Inconsistent Timestamps ensured that the clocks are regularly synchronized to a central
Usually every application who ganerates logs uses the local source to ensure that the date/time stamps are in synchronization.
timestamps i.e. the timestamps of the internal clock. If the host‟s
clock is not synchronized or inaccurate, then log file analysis is
more difficult, specially when the environment has multiple hosts. 4.3 Intrusion Detection
For example, timestamps may indicate that event “X” happened 2 Log data can be reviewed for detecting unusual or unauthorized
minutes after event “Y”, whereas event „X‟ has actually happened events, assuming that the correct data is being logged and
55 seconds before event “Y”. reviewed. But variation of unusual activities is a main problem
i.e. login attempts outside of designated schedules, failed login
3.4 Multiple Log Formats attempts, port sweeps, locked accounts, network activity levels,
Each application that creates logs may use its own format, eg. memory utilization, key file/data access, etc.
XML format or SNMP format, comma-separated or tab-separated

28
International Journal of Computer Applications (0975 – 8887)
Volume 5– No.11, August 2010

4.4 Problem Recognition 5.3 Decision Tree Construction


Log data can be used for problem recognition and to identify In this step a decision tree is constructed from the resultant
security events, for example resource utilization, investigating training data set by applying decision tree algorithm.
causal factors of failed and so on.
5. PROPOSED FRAMEWORK 6. METHODOLOGY
The Details of Proposed framework are as follows: Individual log files records activities related to a particular
application although useful in many developments contain a lot of
data that might not be particularly useful in evidence gathering.
However, comparative analysis of different types of log files,
coming from different applications run on the same host or
different host can reveal useful interrelations that can be used in
evidence gathering. This work deals with comparative analysis of
firewall log file and web server log file. Firewalls log event fall
into three broad categories: critical system issues (hardware
failures and the like), significant authorized administrative events
(rule set changes, administrator account changes), and network
connection logs. Interesting information present in the firewall
log required for our proposed framework can be, changes to
firewall policy, addition/deletion/change in administrative
accounts, network connection logs of the compromised system,
which include dropped and rejected connections, time/protocol/IP
addresses/usernames for allowed connections, amount of data
transferred etc. Web server log records every request and
important information about the requests to web server made by
users. For example, every time a browser requests a page, an
entry is automatically made in this log by the web server,
containing information such as the address of the computer on
which the browser was running, the time at which the access was
made, and the transfer time of the page, the accessed page etc.
This information is very useful.
This work is concentrated on correlations of firewall logs
and web logs coming from different applications running on the
same host, as well as correlations of logs coming from different
(or the same) applications running on different hosts during the
same period of time, a decision tree is also developed on the basis
of that correlated information, helps in taking proper decision.
During the initial pre-processing firewall log and web log from
web server is accessed and the client‟s ip address in firewall logs
which probes cross the threshold limit within a fix time period is
determined, if this ip address also present in web log and it
accesses the restricted area then it‟s suspicious for server.

7. MODEL FOR WORK


7.1 Algebraic Representation:
Fig. 1 Proposed Framework
Symbol Meaning
5.1 Centralization of Log files
In this step, log files maintained by the web server and firewall
l1 Firewall Log
are extracted and stored in the central location. The data are
transformed in a suitable format for conducting effective
analysis.. l2 Web Log
5.2 Chain of evidence analyzer
The evidence analyzer takes the firewall log and web log from the fIp Set of client IP address of firewall log
centralized log. It applies the rule based correlation by URLs and
Time techniques as shown in Section VI and creates the training
data set. fDp Set of destination port of firewall log

29
International Journal of Computer Applications (0975 – 8887)
Volume 5– No.11, August 2010

nIp Normal Users IP address

sIp Suspicious user IP address

swIp Suspicious web user IP address

atIp Attacker IP address

7.3 Decision Tree


A decision tree is a tool which is used to support decisions,
* There is many to many relationship between fIp and fDp this tool uses a tree-like graph or model of decisions and
their possible consequences, Decision tree are generally
used where there is a need of decision analysis such as
operations research, this helps in identifying a strategy
which will help in reaching a goal. Decision tree is also
used to calculate conditional probabilities.

7.4 Decision tree induction


A decision tree is flow charts like tree structure, where
each internal node denotes a test or decision on an
attribute, each branch represents an output come of the test
and leaf node represent class distributions. Selection of test
attribute at each node in the tree is based on information
7.2 Relational Algebra Representation: gain of that attribute, a attributes having highest
information gain selected as test attributes. To calculate
information gain of any attribute, assume a set D having d
Symbol Meaning data sample, suppose the class level attribute having n
distinct values defined as n distinct classes Ci(for i=1,…….
L1 Firewall Log n), then the expected information needs to be classified to a
given sample.
L2 Web Log

G Aggregate Function
Where pi is the probability of any attribute sample belongs to the
class Ci. Let us assume attribute X having m distinct values, {x1,
x2, x3,…..,xm}. Attribute X can be used to partition D into m
G subsets, {D1, D2, D3,………, Dm} where Dj contains those

30
International Journal of Computer Applications (0975 – 8887)
Volume 5– No.11, August 2010

Fig. 2 End User probing in Sample Windows Firewall log

Fig. 3 Sample Web Server Log

31
International Journal of Computer Applications (0975 – 8887)
Volume 5– No.11, August 2010

samples in D that have a value xj of A. The entropy based on the client (NU-normal user, SU-suspicious user, SUW- suspicious
partition into the subset X is given by user for web, AT- attacker).

9. CONSTRUCTION OF DECISION TREE


This training data set (D) consists of 4 data samples (d=4) and 4
Where is a ratio of number of sample in the subset attributes. Initially use equation to compute the expected
to the total number of samples in D. Entropy value and the purity information need to classify these 4 samples.
of subset partition is inversely proportional. The information gain
is then calculated by

Whole these procedures compute the information gain of each


and every attributes and attribute having highest information gain
selected for test attributes for given set D.
Next we need to compute information gain of each attributes,
let‟s start with attributes probes, probes having two different
8. IMPLEMENTATION DETAILS classes (yes, no).
A tool has been created and implemented, which takes firewall For probes = “Yes”
log and the web log of same time as input and generates a
decision regarding client behavior whether client is normal user,
suspicious user, suspicious web user or an attacker. For probes = “no”
There were 70563 entries in firewall log of 30 clients that have
accesses the server therefore determine the probes details of these
30 clients and calculate number of different ports probe by these
30 clients then take left outer join of these 30 firewall client ip Above two equation show the expected information needed to
address with client ip address of web log then again take left outer classify probes then Entropy of probes is
join of this resultant set with ip address of client having restricted
area entry in web log file. The final resultant set describe as

Figure 4 Sample Training Data Set Generated by


Integrating Web Log & Firewall

Resultant set is an training data set contains of 5 attributes (Ip,


probes, web_entry, restricted_zone, decision),where Ip describe
the distinct client ip address of firewall log , probes describes
describe where respective ip cross the probes threshold or not
,web_entry describe where respective ip having web entry or not
, restricted zone entries describe where respective ip having Hence, the information gain
restricted web entry or not and decision where show behavior of

32
International Journal of Computer Applications (0975 – 8887)
Volume 5– No.11, August 2010

Figure 6 Sample Trained Data Set

Figure 7 Decision tree generated from Trained Data Set

33
International Journal of Computer Applications (0975 – 8887)
Volume 5– No.11, August 2010

Similarly the Gain(web_entry) =0.563, Gain (restrict _zone) 12. REFERENCES


=0.4671 .Since probes has highest information gain among all [1] http://www.all-about-forensic-science.com/cyber-
the attributes so its selected as the test attributes. Decision tree is forensics.html
shown in Fig 5, describe the client behavior. To check the validity [2] Gary L Palmer A Road Map for Digital Forensic Research.
of decision tree take a trained data set and execute the generated Technical ReportDTR-T0010-01, DFRWS. Report for the
rules again on the trained data set resulting in decision tree shown First Digital Forensic Research Workshop (DFRWS),(2001).
in Figure 7. It shows that the generated rules and decision tree [3] Tamas Abraham “Event Sequence Mining to Develop
stay correct for all cases. Profiles for Computer Forensic Investigation Purposes”
Information Networks Division Defence Science and
9.1 Extracting Classification rules based on Technology Organization, Australia
decision tree: [4] http://www.cyberforensics.com
The decision tree of fig. 5 can be converted into classification IF- [5] Robert Rinnan “Benefits of Centralized Log file Correlation”
THEN rule by tracing the path from root node to each leaf node in Master‟s Thesis, Master of Science in Information
the tree. Security30 ECTS, Department of Computer Science and
Media Technology Gjøvik University College, 2005.
[6] Deepak Singh Tomar, J.L.Rana and S.C.Shrivastava,
IF probes >=200, web_entry = "yes" and restricted_zone = "yes" Evidence Gathering System for Input Attacks in (IJCNS)
International Journal of Computer and Network Security
then decision = "Attacker" Vol. 1, No. 1, October 2009.
[7] Muhammad Kamran Ahmed, Mukhtar Hussain and Asad
IF probes >=200, web_entry = "yes" and restricted_zone = "NO" Raza “An Automated User Transparent Approach to log
Web URLs for Forensic Analysis” Fifth International
then decision = "Suspicious User for web" Conference on IT Security Incident Management and IT
Forensics 2009.
[8] Pavel Gladyshev “Formalising Event Reconstruction in
IF probes >=200, web_entry = "NO" and restricted_zone = "NO" Digital Investigations” Ph.D. dissertation Department of
then decision = "Suspicious User" Computer Science, University College Dublin, 2004.
[9] Nabil HAMMOUD “Decentralized Log Event Correlation
Architecture “ MEDES, Lyon, France,2009
IF probes <=200, web_entry = "yes" and restricted_zone = "No"
then decision = "Normal User" [10] Tamas Abraham and Olivier de Vel “Investigative Profilling
with Computer Forensic Log Data and
Association”IEEE,2002
[11] Data Mining – Concept and Techniques by Jiawei Han and
10. CONCLUSIONS Micheline Kamber.
In this work the implemented system extracts the evidence from
different sources, relates generated logs on the basis of relational
Nikhil Kumar Singh M.Tech(Final Year) in Computer Science
algebra and classifies suspicious user based on decision tree. The & Engg., B.E. in Computer Science and research scholar of
implemented system encourages the web administrator to study Maulana Azad National Institute of Technology(MANIT),
the navigation behavior of suspicious user and assist to enforce Bhopal.
the effective security policy.
The future work will cover the issues related to log Mr. Deepak Singh Tomar M.Tech & B.E. in Computer Science
consistency, log integrity and log rotation & Engg. and working as Assistant Professor Computer Science &
Engg. Department(MANIT, Bhopal). Total 14 Years Teaching
11. ACKNOWLEDGMENTS Experience (PG & UG). Guided 16 M.Tech Thesis
The research presented in this paper would not have been possible
without our college, at MANIT, Bhopal. We wish to express our Mr. Bhola Nath Roy M.Tech & B.E. in Computer Science &
gratitude to all the people who helped turn the World-Wide Web Engg. and working as Assistant Professor Computer Science &
into the useful and popular distributed hypertext. We also wish to Engg. Department(MANIT, Bhopal).
thank the anonymous reviewers for their valuable suggestions

34

You might also like