Overview of Data Mining
Overview of Data Mining
Volume 4 Issue 4, June 2020 Available Online: www.ijtsrd.com e-ISSN: 2456 – 6470
@ IJTSRD | Unique Paper ID – IJTSRD31368 | Volume – 4 | Issue – 4 | May-June 2020 Page 1333
International Journal of Trend in Scientific Research and Development (IJTSRD) @ www.ijtsrd.com eISSN: 2456-6470
and implementing customer focused strategies. To maintain the data mining process. Data is consolidated so that the
a proper relationship with a customer a business need to mining process is more efficient and the patterns are easier
collect data and the information. This is where data mining to understand. Data Transformation involves Data Mapping
plays its part. With data mining technologies the collected and code generation process.
data can be used for analysis. Instead of being confused
where to focus to retain customer, the seekers for the E. Data Mining
solution get filtered results. Data Mining is a process to identify interesting patterns and
knowledge from a large amount of data. In these steps,
E. Fraud Detection intelligent patterns are applied to extract the data patterns.
Billions of dollars have been lost to the action of frauds. The data is represented in the form of patterns and models
Traditional methods of fraud detection are and complex. are structured using classification and clustering techniques.
Data mining aids in providing meaningful patterns and
turning data into information. Any information that is valid F. Pattern Evaluation
and useful is knowledge. A perfect fraud detection system This step involves identifying interesting patterns
should protect information of all the users. A supervised representing the knowledge based on measures. Data and
method includes collection of sample records. These records visualization methods are used to make the data
are classified fraudulent or non-fraudulent. A model is built understandable by the user.
using this data and the algorithm is made to identify whether
the record is fraudulent or not. G. Knowledge Representation
Knowledge representation is a step where data visualization
F. Intrusion Detection and knowledge representation tools are used to represent
Any action that will compromise the integrity and the mined data. Data is visualized in the form of reports,
confidentiality of a resource is an intrusion. The defensive tables, etc.
measures to avoid an intrusion includes user authentication,
avoid programming errors, and information protection. Data
mining can help improve intrusion detection by adding a
level of focus to anomaly detection. It helps an analyst to
distinguish an activity from common everyday network
activity. Data mining also helps extract data which is more
relevant to the problem.
A. Data Cleaning
Data cleaning is the first step in data mining. It holds
importance as dirty data if used directly in mining can cause
confusion in procedures and produce inaccurate results.
Basically, this step involves the removal of noisy or
incomplete data from the collection. Many methods that
generally clean data by itself are they are not robust. III. TYPES OF DATA MINED
A. Flat files:
B. Data Integration Flat files is defined as data files in text form or binary form
When multiple heterogeneous data sources such as with a structure that can be easily extracted by data mining
databases, data cubes or files are combined for analysis, this algorithms. Data stored in flat files have no relationship or
process is called data integration. This can help in improving path among themselves, like if a relational database is stored
the accuracy and speed of the data mining process. Different on flat file, then there will be no relations between the tables.
databases have different naming conventions of variables, by Flat files are represented by data dictionary.
causing redundancies in the databases. Additional Data
Cleaning can be performed to remove the redundancies and B. Relational Database:
inconsistencies from the data integration without affecting A Relational database is defined as the collection of data
the reliability of data. organized in tables with rows and columns. Physical schema
in Relational databases is a schema which defines the
C. Data Reduction structure of tables. Logical schema in Relational databases is
This technique is applied to obtain relevant data for analysis a schema which defines the relationship among tables.
from the collection of data. The size of the representation is
much smaller in volume while maintaining integrity. Data C. Data Warehouses:
Reduction is performed using methods such as Naive Bayes, A is defined as the collection of data integrated from multiple
Decision Trees, Neural network, etc. sources that will and There are three types of: Enterprise,
Data Mart and Virtual Warehouse. Two approaches can be
D. Data Transformation used to update data in Data Warehouse: Query-driven
In this process, data is transformed into a form suitable for Approach and Update-driven Approach.
@ IJTSRD | Unique Paper ID – IJTSRD31368 | Volume – 4 | Issue – 4 | May-June 2020 Page 1334
International Journal of Trend in Scientific Research and Development (IJTSRD) @ www.ijtsrd.com eISSN: 2456-6470
D. Databases: Anomalies are also known as outliers, novelties, noise,
databases is a collection of data organized by time stamps, deviations and exceptions. Often they provide critical and
date, etc to represent transaction in databases. This type of actionable information. An anomaly is an item that deviates
database has the capability to roll back or undo its operation considerably from the common average within a or a
when a transaction is not completed or committed. It is combination of data. These types of items are statistically
highly flexible system where users can modify information aloof as compared to the rest of the data and hence, it
without changing any sensitive information. indicates that something out of the ordinary has happened
and requires additional attention. technique can be used in a
E. Multimedia databases: variety of domains, such as intrusion detection, system
Multimedia databases consists audio, video, images and text health monitoring, fraud detection, fault detection, event
media. They can be stored on Object-Oriented Databases. detection in sensor networks, and detecting disturbances.
They are complex information in a formats. Analysts often remove the anomalous data from the top
discover results with an increased accuracy.
F. Spatial Databases:
Spatial databases store geographical information. can store D. CLUSTERING ANALYSIS
data in the form of coordinates, topology, lines, polygons, etc. The cluster is actually a collection of data objects; those
objects are similar within the same cluster. That means the
G. Time Series Databases: objects are similar to one another within the same they are
Time series databases contains stock exchange data and user rather they are dissimilar or unrelated to the objects in other
logged activities. handle array of numbers indexed by time, groups or in other clusters. Clustering analysis is the process
date, etc. It requires real-time analysis. of discovering groups and clusters in the data in such a way
that the degree of association between two objects is highest
H. WWW: if they belong to the same group and lowest otherwise. result
WWW refers to World wide web which is a collection of of this analysis can be used to create customer profiling.
documents and resources like audio, video, text, etc which
are identified by Uniform Resource Locators (URLs) through E. REGRESSION ANALYSIS
web browsers, linked by HTML pages, and accessible via the In statistical terms, a regression analysis is the process of
Internet network. It is the most heterogeneous repository as identifying and analyzing the relationship among variables.
it collects data from multiple resources. It is dynamic in It can help you understand the characteristic value of the
nature as volume of data is continuously increasing and dependent variable changes, if any one of the independent
changing. variables is varied. This means one variable is dependent on
another, but it is not vice versa. is generally used for
IV. DATA MINING TECHNIQUES prediction and forecasting.
Data mining is highly effective and some techniques used for
data mining are as follows: V. BENEFITS AND DISADVANTAGES OF DATA
MINING
A. CLASSIFICATION ANALYSIS There are several types of benefits and advantages of data
This analysis is used to retrieve important and relevant mining systems. Some of them are as follows:
information about data, and metadata. It is used to classify One of the common benefits that can be derived with
different data in different classes. Classification is similar to these data mining systems is that they can be helpful
clustering in a way that it also segments data records into while predicting future trends. And that is quite possible
different segments called classes. But unlike clustering, here with the help of technology and behavioral changes
the data analysts would have the knowledge of different adopted by the people.
classes or cluster. So, in classification analysis you would Data mining helps organizations to make the profitable
apply algorithms to decide how new data should be adjustments in operation and production.
classified. The data mining is a cost-effective and efficient solution
compared to other statistical data applications.
B. ASSOCIATION RULE LEARNING Most parts of the data mining process is basically from
It refers to the method that can help you identify some information gathered with the help of marketing
interesting relations (dependency modeling) between analysis. With the help of such marketing analysis, one
different variables in large databases. This technique can can also find out those fraudulent acts and products
help you unpack some hidden patterns in the data that can available in the market. Moreover, with the help of it one
be used to identify variables within the data and the can understand the importance of accurate information.
concurrence of different variables that appear very It can be implemented in new systems as well as existing
frequently in the . rules are useful for examining and platforms. is the speedy process which makes it easy for
forecasting customer behavior. It is highly recommended in the users to analyze huge amount of data in less time.
the retail industry analysis. This technique is used to
determine shopping basket data analysis, product clustering, Data mining technology is something that helps one person
catalog design and store layout. In IT, programmers use in their and that is a process wherein which all the factors of
association rules to build programs capable of machine mining is involved precisely and while the involvement of
learning. these mining systems, one can come across several
disadvantages of data they are as follows:
C. ANOMALY OR OUTLIER DETECTION There are chances of companies may sell useful
This refers to the observation for data items in a that do not information of their customers to other companies for
match an expected pattern or an expected behavior. money.
@ IJTSRD | Unique Paper ID – IJTSRD31368 | Volume – 4 | Issue – 4 | May-June 2020 Page 1335
International Journal of Trend in Scientific Research and Development (IJTSRD) @ www.ijtsrd.com eISSN: 2456-6470
Many data mining analytics software is difficult to Therefore, after reading all the above-mentioned
operate and requires advance training to work on. information about data mining one can determine its
Different data mining tools work in different manners credibility and feasibility even better.
due to different algorithms employed in their design.
Therefore, the selection of correct data mining tool is a References
very difficult task. [1] “Data Mining Curriculum”. ACM SIGKDD. 2006-04-30.
The data mining techniques are not accurate, and so it Retrieved 2014-01-27.
can cause serious consequences in certain conditions.
[2] ^ Clifton, Christopher (2010). "Encyclopædia
Britannica: Definition of Data Mining". Retrieved 2010-
VI. CONCLUSION
12-09
Data Mining is an iterative process where the mining process
can be refined, and new data can be integrated to get more [3] ^ Hastie, Trevor; Tibshirani, Robert; Friedman, Jerome
efficient results. Data Mining meets the requirement of (2009). "The Elements of Statistical Learning: Data
effective, and flexible data analysis. It can be considered as a Mining, Inference, and Prediction". Archived from the
natural evaluation of information technology. As a original on 2009-11-10. Retrieved 2012-08-07
knowledge discovery process, data preparation and data
[4] ^ Han, Kamber, Pei, Jaiwei, Micheline, Jian (June 9,
mining tasks complete the data mining process. Data mining
2011). Data Mining: Concepts and Techniques (3rd
processes can be performed on any kind of data discussed in
ed.). Morgan Kaufmann. ISBN 978-0-12-381479-1.
the above section. Finally, the bottom line is that all the
techniques help in the discovery of new creative things. At [5] Kantardzic, Mehmed (2003). Data Mining: Concepts,
the end of this paper about data mining, one can clearly Models, Methods, and Algorithms. John Wiley & Sons.
understand the areas of applications, types of source data, ISBN 978-0-471-22852-3. OCLC 50055336.
process, techniques, and benefits with its own limitations.
@ IJTSRD | Unique Paper ID – IJTSRD31368 | Volume – 4 | Issue – 4 | May-June 2020 Page 1336