document1
document1
Abstract – Intrusion detection involves a lot of tools that are IDSs have gained acceptance as a necessary addition to every
used to identify different types of attacks against computer organization’s security infrastructure despite the documented
systems and networks. With the development of network contributions intrusion detection technologies make to system
technologies and applications network attacks are greatly security, in many organizations one must still justify the
increasing both in number and severe. Open source and acquisition of IDSs. We may use IDSs to prevent problem
commercial network intrusion detection tools are not able to behaviors by increasing the perceived risk of discovery of
predict new type of attacks based on the previous attacks. So,
data mining is one of the methods used in IDS (Intrusion
those who would attack or abuse the system.
Detection System). In recent years data mining based network There are two general categories of attacks which intrusion
intrusion detection system has been giving high accuracy and detection technologies attempt to identify - anomaly detection
good detection on different types of attacks. In this paper, the and misuse detection. Anomaly detection identifies activities
performance of the data mining algorithms like C4.5 and that vary from established patterns for users, or groups of
improved C4.5 are being used in order to detect the different users. Anomaly detection typically involves the creation of
types of attacks with high accuracy and less error prone. knowledge bases that contain the profiles of the monitored
activities. The second general approach to intrusion detection
Keywords- C4.5 Decision Tree; Improved C4.5 Decision Tree; is misuse detection. This technique involves the comparison
Intrusion detection system.
of a user's activities with the known behaviors of attackers
I. INTRODUCTION attempting to penetrate a system. While anomaly detection
Nowadays, many organizations and companies use Internet typically utilizes threshold monitoring to indicate when a
services as their communication and marketplace to do certain established metric has been reached, misuse detection
business such as at EBay and Amazon.com website. Together techniques frequently utilize a rule-based approach. When
with the growth of computer network activities, the growing applied to misuse detection, the rules become scenarios for
rate of network attacks has been advancing, impacting to the network attacks. The intrusion detection mechanism
availability, confidentiality, and integrity of critical identifies a potential attack if a user's activities are found to
information data. Therefore a network system must use one be consistent with the established rules. The use of
or more security tools such as firewall, antivirus, IDS and comprehensive rules is critical in the application of expert
Honey Pot to prevent important data from criminal systems for intrusion detection [2].
enterprises. There are many methods applied into intrusion detection,
A network system using a firewall only is not enough to such as methods based on statistics, methods based on data
prevent networks from all attack types. The firewall cannot mining, methods based on machine learning and so on. In
defense the network against intrusion attempts during the recent years, data mining technology is developing rapidly
opening port. Hence a Real-Time Intrusion Detection System and increasingly mature. Now it is gradually applied to the
(RT-IDS), shown in Fig 1, is a prevention tool that gives an intrusion detection field, and has made a number of important
alarm signal to the computer user or network administrator achievements at home and abroad. The basic principles of
for antagonistic activity on the opening session, by inspecting intrusion detection based on data mining are as follows:
hazardous network activities [1]. Firstly intelligently analyze and deal with security audit data
from different data sources(such as host-based, network-
based, alarm-based), this can help system generate intrusion
rules and establish anomaly detection model by extracting
regularity of data; Then use these knowledge to discriminate
new network behaviors. The main methods are: classification
analysis, clustering analysis, genetic algorithm, neural
networks, association rule mining, sequential pattern mining,
and outlier detection and so on. Decision tree technology is
an intuitionist and straightforward classification method. It
has great advantage in extracting features and rules.
Fig 1: Intrusion detection system environment Therefore applying decision tree technology into intrusion
4971
K.V.R.Swamy et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 3 (5) , 2012,4971 - 4975
detection is of great significance [3]. Locations of Intrusion division of an attack either over time or among several
Detection Systems in Networks:Usually an intrusion seemingly unrelated attackers is difficult for these methods to
detection system captures data from the network and applies detect. Rule-based systems also lack flexibility in the rule-to-
its rules to that data or detects anomalies in it. Depending audit record representation. Slight variations in an attack
upon the network topology, the type of intrusion activity (i.e. sequence can affect the activity-rule comparison to a degree
internal, external or both), and our security policy (what we that the intrusion is not detected by the intrusion detection
want to protect from hackers), IDSs can be positioned at one mechanism. While increasing the level of abstraction of the
or more places in the network . For example, if we want to rule-base does provide a partial solution to this weakness, it
detect only external intrusion activities, and we have only one also reduces the granularity of the intrusion detection
router connecting to the Internet, the best place for an device[11].
intrusion detection system may be just inside the router or a
firewall. On the other hand, if we have multiple paths to the III. PROPOSED ARCHITECTURE
internet, and we want to detect internal threats as well, we Following framework gives the overall description about
should place one IDS box in every network segment. Fig. the proposed approach. In this framework,KDD dataset[7] is
shows typical locations where you can place an intrusion used as training data for classification purpose.
detection system. Proposed framework has following algorithms.
1) Min Max Normalization
II. RELATED WORK 2) Decision Tree Algorithms.
In his paper, Except for the information gain measure
and its improved versions, Lopez de Mantaras[4] presented a
distance-based attribute selection measure. His experimental
study proves that the distance based measure is not biased
toward attributes with large numbers of values, and avoids
the practical issues towards the gain ratio measure.
Mingers[5] provides an experimental study of the relative
accuracy of different attribute selection measures in the
decision tree in order to overcome the bias in the tuples.
Nageswara Rao, Dr. D. Rajya Lakshmi, Prof T.
Venkateswara Rao et at[6] proposed robust statistical
preprocessor in order to improve the accuracy. But the
limitation in that paper is existing c45 does not handle when
the dataset is large. An expert system consists of a set of rules
that encode the knowledge of a human "expert". These rules
are used by the system to make conclusions about the
security-related data from the intrusion detection system.
Expert systems permit the incorporation of an extensive
amount of human experience into a computer application that
then utilizes that knowledge to identify activities that match
the defined characteristics of misuse and attack.
Unfortunately, expert systems require frequent updates to
remain current. While expert systems offer an enhanced
ability to review audit data, the required updates may be
ignored or performed infrequently by the administrator. At a
minimum, this leads to an expert system with reduced
capabilities. At worst, this lack of maintenance will degrade
the security of the entire system by causing the system's users Fig 2: Proposed Framework
to be misled into believing that the system is secure, even as
one of the key components becomes increasingly ineffective A. KDD Dataset
over time. Rule-based systems suffer from an inability to The KDD Cup 1999 dataset was derived from the 1998
detect attacks scenarios that may occur over an extended DARPA Intrusion detection evaluation program prepared and
period of time. While the individual instances of suspicious managed by MIT Lincoln Laboratory. The dataset was a
activity may be detected by the system, they may not be collection of simulated raw TCP dump data over a period of
reported if they appear to occur in isolation. Intrusion nine weeks. There are 4,898,430 labeled and 311,029
scenarios in which multiple attackers operate in concert are unlabeled connection records in the dataset [8]. The labeled
also difficult for these methods to detect because they do not connection records consist of 41 attributes: 7 symbolic and 34
focus on the state transitions in an attack, but instead numeric. The complete listing of the set of features in the
concentrate on the occurrence of individual elements. Any dataset is given in Table 1.
4972
K.V.R.Swamy et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 3 (5) , 2012,4971 - 4975
TABLE I: List of attributes in KDD dataset recorded data may have been deleted. Furthermore, the
recording of the history or modifications to the data
may have been overlooked. Missing data, particularly
for tuples with missing values for some attributes, may
need to be inferred. There are many possible reasons
for noisy data (having incorrect attribute values). Data
cleaning (or data cleansing) routines attempt to fill in
missing values, smooth out noise while identifying
outliers, and correct inconsistencies in the data.
Handling Missing Values: The attribute mean or
stddev to fill in the missing value.
D. C45 ALGORITHM
Algorithm: Geneate_decision_tree
Input: Data partition, D, which is a set of training tuples and
their associated class labels. Attribute_list, the set of
candidate attributes. Attribute_selection_method, a procedure
to determine the splitting criterion that “best” partitions the
data tuples into individual classes. This criterion consists of a
splitting_attribute and, possibly, either a split point or
splitting subset.
Output: a decision tree
Method:
(1) create a node N;
(2) if tuples in D are all of the same class, C then
(3) return N as a leaf nod labeled with the class C;
(4) If attribute_list is empty then
(5) Return N as a leaf node labeled with the majority class in
D; //majority voting
(6) Apply attribute_seletion_method (D, arrtibute_list) to find
B. Data Transformation the “best” splitting_criterion;
In data transformation, the data are (7)Label node N with splitting_criterion;
transformed or consolidated into forms appropriate for (8)If splitting_attribute is discrete-valued and
mining. Min-max normalization performs a linear Multiway splits allowed then // not restricted to binary trees
transformation on the original data. Suppose that minA (9) attribute_list→attribute_list - splitting_attribute; //remove
and maxA are the minimum and maximum values of an splitting_attribute
attribute A. Min-max normalization maps a value, v, of (10) for each outcome j of splitting_criterion // partition the
A to v0 in the range [new_minA, new_maxA] by tuples and grow sub-tees for each partition
computing (11) Let Dj be the set of a data tuples in D satisfying outcome
j; // a partition
(12) If Dj is empty then
(13) Attach a leaf labeled with the majority class in D to node
N;
Min-max normalization preserves the relationships
(15) Else attach the node returned by Geneate_decision_tree
among the original data values. It will encounter an
(Dj, attribute list) to node N;
“out-of-bounds” error if a future input case for
(16) Return N;
normalization falls outside of the original data range
for A.
E. IMPROVED C45
C. Data Preprocessing
(1) create a node N;
Incomplete, noisy, and inconsistent data are
(2)if tuples in D are all of the same class, C then
commonplace properties of large real world databases
(3) return N as a leaf node labeled with the class C;
and data warehouses. Incomplete data can occur for a
(4) if attribute list is empty then
number of reasons. Attributes of interest may not
(5) return N as a leaf node labeled with the majority class in
always be available. Other data may not be included
D; // majority voting
simply because it was not considered important at the
(6) apply Attribute selection to each attribute(L, attribute list)
time of entry. Relevant data may not be recorded due
to find the “best” splitting criterion;
to a misunderstanding, or because of equipment
malfunctions. Data that were inconsistent with other
4973
K.V.R.Swamy et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 3 (5) , 2012,4971 - 4975
Gain measures how well a given attribute separates training common practice in intrusion detection to claim good
examples into targeted classes. The one with the highest performance with “live data” makes it difficult to verify and
information is selected. Given a collection S of c outcomes improve pervious research results, as the traffic is never
The expected information needed to classify a tuple in D is quantified or released for privacy concerns. As our test
given by dataset, the KDD99 dataset contains one type of normal data
and 24 different types of attacks. For implementation
Modified Information or entropy is given as Netbeans is used.
m The input is KDD data set. It is about 10% of KDD
ModInfo(D) = − Si l og
i =1
Si ,m different classes dataset.
2
ModInfo(D) = − S i l og
i =1
Si
= − S1 log S1 − S 2 log S 2
Where S1 indicates set of samples which belongs to
target class ‘anamoly’, S 2 indicates set of samples
which belongs to target class ‘normal’.
Information or Entropy to each attribute is calculated
using
v
In fo A ( D ) = i =1
D i / D × M o d In fo ( D i )
4974
K.V.R.Swamy et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 3 (5) , 2012,4971 - 4975
CONCLUSION
Experimental results show the existing C4.5 decision tree
gives 95.7 percent of accuracy for detecting attacks. But the
proposed decision tree gives better attack classified results
compare to existing C4.5 technique. Proposed Algorithm
gives 96.9 percent of accuracy for detecting attacks with less
false positive and true negative rates. Data mining algorithms
require an offline training phase, but the testing phase
requires much less time and future work could investigate
how well it can be adapted to performing online.
REFERENCES
[1] Real-time Intrusion Detection and Classification by Phurivit
Sangkatsanee1, Naruemon Wattanapongsakorn and Chalermpol
Charnsripinyo.
[2] Intelligent Adaptive Intrusion Detection Systems Using Neural
Networks (Comparitive study) by Aida O. Ali, Ahmed I. saleh and
Tamer R. Badawy.
[3] An intrusion detection algorithm based on decision tree technology by
Juan Wang, Qiren Yang and Dasen Ren.
[4] R. L. de Mantaras “A distance-based attribute selection measure for
decision tree induction. Machine Learning, 6:81–92, 1991
Fig 5: Improved C4.5 decision tree result [5] J. Mingers “An empirical comparison of selection measures for
decision-tree induction. Machine Learning, 3:319–342, 1989.
[6] Nageswararao,Dr.D.RajyaLakshmi,Prof T.Venkateswara Rao, “ Robust
Statistical Outlier based Feature Selection Technique for Network
Following results gives the improved C45 performance on Intrusion Detection” ,(IJSCE 2012).
10% KDD dataset with 5291 instances: [7] Mahbod Tavallaee, Ebrahim Bagheri, Wei Lu, and Ali A. Ghorbani “A
Detailed Analysis of the KDD CUP 99 Data Set”, IEEE 2009.
TABLE2: Improved C45 performance on 10% KDD dataset [8] www.cs.waikato.ac.nz/ml/weka
[9] J. R. Quinlan, "C4.5: programs for machine learning", Morgan
Kaufmann Publishers, 1993.
PROPERTY EXISTING C4.5 IMPROVED C4.5
[10] Hybrid Neural Network and C4.5 for Misuse Detection Zhi-Song Pan,
Correctly Classified Song-Can Chen, Gen-Bao Hu, Dao-Qiang Zhang, Proceedings of the
5067(95.76%) 5119(96.75%)
Instances Second International Conference on Machine Learning and
Incorrectly Classified Cybernetics, Xi‟an, 2-5 November 2003.
224(4.23%) 172(3.25%)
Instances [11] C. Kruegel, D. Mutz, W. Robertson, F. Valeur, “Bayesian event
classification for intrusion detection,” in Proc. of the 19th Annual
Computer Security Applications Conference, Las Vegas, NV, 2003.
4975