UNIT 1 - Lecture 1 - Introduction To Data Mining
UNIT 1 - Lecture 1 - Introduction To Data Mining
Mining
(8 credits)
Lecture :
Instructor : Dr. Kennedy Tuesday 12:00 – 13:55
E-mail: [email protected] Room: B307
Tutorial:
Monday 15:00 – 15:55
1
General Course Information
2
Course Units
Statistics
Statistics includes a number of methods to analyze numerical data
in large quantities.
Different statistical tools used in data mining are regression
analysis, cluster analysis, correlation analysis and Bayesian
network. Statistical models are usually built from a training data set.
Correlation analysis identifies the correlation of variables to each
other. Bayesian network is a directed graph that represents casual
relationship among data found out using the Bayesian probability
theorem.
Data Mining Techniques
Machine Learning
Machine learning investigates how computers can learn (or
improve their performance) based on data.
A computer programs need to automatically learn to recognize
complex patterns and make intelligent decisions based on data.
A typical machine learning problem is to program a computer so that
it can automatically recognize handwritten postal codes on mail after
learning from a set of examples.
Machine learning is used to build new models and to search for a
best model matching the test data.
Data mining uses a number of machine learning methods including
inductive concept learning, conceptual clustering and decision tree
induction.
Data Mining Techniques
Neural Networks
A neural network is a set of connected nodes called neurons. A
neuron is a computing device that computes some requirement of
its inputs and the inputs can even be the outputs of other neurons.
A neural network can be trained to find the relationship between
input attributes and output attribute by adjusting the connections
and the parameters of the nodes
Data Mining Techniques
Sales/Marketing
Data mining is used for market basket analysis to provide
information on what product combinations were purchased together
when they were bought and in what sequence.
This information helps businesses promote their most profitable
products and maximize the profit.
In addition, it encourages customers to purchase related products
that they may have been missed or overlooked.
Applications
Banking / Finance
Data mining is used to identify customers loyalty by analysing the
data of customer’s purchasing activities such as the data of
frequency of purchase in a period of time, a total monetary value of
all purchases and when was the last purchase.
After analysing those dimensions, the relative measure is
generated for each customer. The higher of the score, the more
relative loyal the customer is.
To help the bank to retain credit card customers, data mining is
applied. By analysing the past data, data mining can help banks
predict customers that likely to change their credit card affiliation so
they can plan and launch different special offers to retain those
customers.
Applications
Transportation
Data mining helps determine the distribution schedules among
warehouses and outlets and analyze loading patterns.
Medicine
Data mining enables to characterize patient activities to see
incoming office visits.
Data mining helps identify the patterns of successful medical
therapies for different illnesses.
Challenges in Data Mining
Distributed Data
Real world data is usually stored on different platforms in distributed
computing environments.
It could be in databases, individual systems, or even on the Internet.
It is practically very difficult to bring all the data to a centralized data
repository mainly due to organizational and technical reasons.
For example, different regional offices might be having their own
servers to store their data whereas it will not be feasible to store all
the data (millions of terabytes) from all the offices in a central server.
So, data mining demands the development of tools and algorithms
that enable mining of distributed data
Challenges in Data Mining
Complex Data
Real world data is really heterogeneous and it could be multimedia
data including images, audio and video, complex data, temporal
data, spatial data, time series, natural language text and so on.
It is really difficult to handle these different kinds of data and extract
required information.
Most of the times, new tools and methodologies would have to be
developed to extract relevant information.
Challenges in Data Mining
Performance
The performance of the data mining system mainly depends on the
efficiency of algorithms and techniques used.
If the algorithms and techniques designed are not up to the mark,
then it will affect the performance of the data mining process
adversely.
Challenges in Data Mining
Data Visualization
Data visualization is a very importance process in data mining
because it is the main process that displays the output in a
presentable manner to the user.
The information extracted should convey the exact meaning of what
it actually intends to convey.
But many times, it is really difficult to represent the information in an
accurate and easy-to-understand way to the end user.
The input data and output information being really complex, very
effective and successful data visualization techniques need to be
applied to make it successful.
Trends that Affect Data Mining
Data Trends .
Perhaps the most fundamental external trend is the explosion of
digital data mining during the past two decades.
During this period – the amount of data probably has grown
extremely.
Much of this data is accessible via networks.
On the other hand, during the same period the number of scientists,
engineers, and other analysts available to analyze this data has
remained relatively constant.
Only one conclusion is possible: either most of the data is destined
to be write-only; or techniques, such as data mining, must be
developed, which can automate, in part, the analysis of this data,
filter irrelevant information, and extract meaningful knowledge .
Trends that Affect Data Mining
Hardware Trends
Data mining requires numerically and statistically intensive
computations on large datasets.
The increasing memory and processing speed of workstations
enables the mining of dataset using current algorithms and
techniques that were too large to be mined just a few years ago.
In addition, the commoditization of high-performance computing
through SMP workstations and high-performance workstation
clusters enables attacking data mining problems that were
accessible using only the largest supercomputers of few years ago. .
Trends that Affect Data Mining
Network Trends.
The next generation Internet (NGI) will connect sites at OC-3 (155
MBits/sec), speeds and higher.
Business Trends.
Today businesses must be more profitable, react quicker, and o ffer
higher quality services than ever before, and do it all using fewer
people and at lower cost.
Data Mining:
Analysing databases or Data Warehouses to discover patterns
about the data to gain knowledge.
Knowledge is power.
Architecture of Data Mining
2. Loose Coupling:
In this architecture, data mining system uses the database or data
warehouse for data retrieval. In loose coupling data mining
architecture, data mining system retrieves data from the database
or data warehouse, processes data using data mining algorithms
and stores the result in those systems.
This architecture is mainly for memory-based data mining system
that does not require high scalability and high performance..
Architecture of Data Mining
3. Semi-tight Coupling:
In semi-tight coupling data mining architecture, besides linking to
database or data warehouse system, data mining system uses
several features of database or data warehouse systems to
perform some data mining tasks including sorting, indexing,
aggregation…etc.
In this architecture, some intermediate result can be stored in
database or data warehouse system for better performance.
Architecture of Data Mining
4. Tight Coupling:
In tight coupling data mining architecture, database or data
warehouse is treated as an information retrieval component of
data mining system using integration.
All the features of database or data warehouse are used to
perform data mining tasks. This architecture provides system
scalability, high performance, and integrated information
Architecture of Data Mining
It provides the intuitive and friendly user interface for end-user. That is to
interact with data mining system. Data mining result presented in
visualization form to the user in the front-end layer.
Architecture of Data Mining
Architecture of Data Mining
User interface
This module communicates between users and the data
mining system,
Allowing the user to interact with the system by specifying a
data mining query or task, providing information to help focus
the search, and performing exploratory datamining based on
the intermediate data mining results.
In addition, this component allows the user to browse
database and data warehouse schemas or data structures,
evaluate mined patterns, and visualize the patterns in different
forms. .
How is a data warehouse different from a database?
How are they similar?
WHEREAS
Relational databases
Data warehouses
Transactional databases
Advanced database systems
Flat files, and the world-wide web.
Data mining: on what kind of data?
Object-oriented databases
Object-relational databases