0% found this document useful (0 votes)

35 views

UNIT 1 - Lecture 1 - Introduction To Data Mining

This course covers data mining and warehousing concepts over 8 credits. It is taught on Tuesdays and has tutorials on Mondays. Students will have two tests worth 40% and a university exam worth 60%.

Uploaded by

Nyenesya Mwakilasa

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views

UNIT 1 - Lecture 1 - Introduction To Data Mining

This course covers data mining and warehousing concepts over 8 credits. It is taught on Tuesdays and has tutorials on Mondays. Students will have two tests worth 40% and a university exam worth 60%.

Uploaded by

Nyenesya Mwakilasa

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 62

IS 368 : Data Mining and Warehousing and

Mining
(8 credits)

Lecture :
Instructor : Dr. Kennedy Tuesday 12:00 – 13:55
E-mail: [email protected] Room: B307

Tutorial:
Monday 15:00 – 15:55
1
General Course Information

 This course has a total of 8 credits (2 Hrs lecture & 1

hr Tutorial)
 Lecture time - Tuesday (09:00 – 10:55)
 Tutorial - Monday (15:00 – 15:55)
 Course work carries 40 % marks
 Test one 20% marks (7th week)
 Test two 20% marks (12th week)
 University Examination carries 60%

2
Course Units

Unit I: Data mining

Unit II: Knowledge representation and discovery

Unit III: Data mining methods such as rule-based learning

Unit IV: Data mining methods

Unit V: Decision tree

Unit VI: Association rules and sequence mining

Unit VII: Scientific and industrial applications of Knowledge discovery

in databases
Unit I: Data mining

 What is data mining?

 Data mining refers to extracting or mining" knowledge from large
amounts of data”.
 It is the computational process of discovering useful patterns in
large data sets. It is involving methods at the intersection of artificial
intelligence, machine learning, statistics, and database systems.
 The data sources can include databases, data warehouses, the
Web, other information repositories, or data that are streamed into
the system dynamically.
 Some of the related terms
 Knowledge mining from databases
 Knowledge extraction
 Data/Pattern analysis,
Unit I: Data mining

 What is data mining?

 Knowledge mining from databases
 Knowledge extraction
 Data/Pattern analysis,
 The key properties of data mining are
 Automatic discovery of patterns
 Prediction of likely outcomes
 Creation of actionable information
 Focus on large datasets and databases
Unit I: Data mining

 Data mining is the core part of Knowledge Discovery in Database

(KDD) process
 Knowledge discovery as a process is depicted in following figure
and consists of an iterative sequence of the following steps:
 Data cleaning: the noise and inconsistent data is removed.
 Data integration: where multiple data sources may be combined
 A popular trend in the information industry is to perform data cleaning and
data integration as a preprocessing step, where the resulting data are stored
in a data warehouse.
 Data selection: where data relevant to the analysis task are
retrieved from the database
 Data transformation: where data are transformed or
consolidated into forms appropriate for mining by performing
summary or aggregation operations
Unit I: Data mining

 Data mining :an essential process where intelligent methods are

applied in order to extract data patterns
 Pattern evaluation to identify the truly interesting patterns
representing knowledge based on some interestingness
measures.
 A pattern is considered to be interesting if it is potentially useful,
easily understandable by humans, validates some hypothesis that
someone wants to confirm or valid on new data with some degree
of certainty.
 Knowledge presentation - The information mined from the data
needs to be presented to the user in an appealing way. Different
knowledge representation and visualization techniques are
applied to provide the output of data mining to the users.
Unit I: Data mining

Data mining as a process of knowledge discovery

Data Mining Techniques

 Data mining is a multi-disciplinary field.

 Integrates approaches and techniques from various disciplines
such as machine learning, statistics, artificial intelligence, neural
networks, database management, data warehousing etc
Data Mining Techniques

Statistics
Statistics includes a number of methods to analyze numerical data
in large quantities.
Different statistical tools used in data mining are regression
analysis, cluster analysis, correlation analysis and Bayesian
network. Statistical models are usually built from a training data set.
Correlation analysis identifies the correlation of variables to each
other. Bayesian network is a directed graph that represents casual
relationship among data found out using the Bayesian probability
theorem.
Data Mining Techniques

 Given below is a simple Bayesian network where the nodes

represent variables whereas edges represent the relationship
between the nodes.
Data Mining Techniques

Machine Learning
Machine learning investigates how computers can learn (or
improve their performance) based on data.
A computer programs need to automatically learn to recognize
complex patterns and make intelligent decisions based on data.
 A typical machine learning problem is to program a computer so that
it can automatically recognize handwritten postal codes on mail after
learning from a set of examples.
Machine learning is used to build new models and to search for a
best model matching the test data.
Data mining uses a number of machine learning methods including
inductive concept learning, conceptual clustering and decision tree
induction.
Data Mining Techniques

 A decision tree is a classification tree that decides the class of

an object by following the path from the root to a leaf node.
 Given below is a simple decision tree that is used for weather
forecasting
Data Mining Techniques

Neural Networks
A neural network is a set of connected nodes called neurons. A
neuron is a computing device that computes some requirement of
its inputs and the inputs can even be the outputs of other neurons.
A neural network can be trained to find the relationship between
input attributes and output attribute by adjusting the connections
and the parameters of the nodes
Data Mining Techniques

Database Oriented Techniques

Advancements in database and data warehouse implementation
helps data mining in a number of ways.
Database oriented techniques are used mainly to develop
characteristics of the available data.
Iterative database scanning for frequent item sets, attribute
focusing, and attribute oriented induction are some of the database
oriented techniques widely used in data mining.
The iterative database scanning searches for frequent item sets in
a database.
Attribute oriented induction generalizes low level data into high
level concepts using conceptual hierarchies.
Data Mining Techniques

Database Data Visualization

The information extracted from large volumes of data should be
presented well to the end user and data visualization techniques
make this possible.
Data is transformed into different visual objects such as dots,
lines, shapes etc and displayed in a two or three dimensional
space. Data visualization is an effective way to identify trends,
patterns, correlations and outliers from large amounts of data..

Note: Data mining system employs one or more techniques to

handle different kinds of data, different data mining tasks, different
application areas and different data requirements
Applications

Sales/Marketing
Data mining is used for market basket analysis to provide
information on what product combinations were purchased together
when they were bought and in what sequence.
This information helps businesses promote their most profitable
products and maximize the profit.
In addition, it encourages customers to purchase related products
that they may have been missed or overlooked.
Applications

Banking / Finance
Data mining is used to identify customers loyalty by analysing the
data of customer’s purchasing activities such as the data of
frequency of purchase in a period of time, a total monetary value of
all purchases and when was the last purchase.
After analysing those dimensions, the relative measure is
generated for each customer. The higher of the score, the more
relative loyal the customer is.
To help the bank to retain credit card customers, data mining is
applied. By analysing the past data, data mining can help banks
predict customers that likely to change their credit card affiliation so
they can plan and launch different special offers to retain those
customers.
Applications

Health Care and Insurance

The growth of the insurance industry entirely depends on the
ability to convert data into the knowledge, information or intelligence
about customers, competitors, and its markets.
 Data mining is applied in claims analysis such as identifying
which medical procedures are claimed together.
 Data mining enables to forecasts which customers will
potentially purchase new policies.
 Data mining allows insurance companies to detect risky
customers’ behaviour patterns.
 Data mining helps detect fraudulent behaviour.
Applications

Health Care and Insurance

Transportation
Data mining helps determine the distribution schedules among
warehouses and outlets and analyze loading patterns.

Medicine
Data mining enables to characterize patient activities to see
incoming office visits.
Data mining helps identify the patterns of successful medical
therapies for different illnesses.
Challenges in Data Mining

Noisy and Incomplete Data

Data in large quantities normally will be inaccurate or unreliable.
These problems could be due to errors of the instruments that
measure the data or because of human errors.
Suppose a retail chain collects the email id of customers who
spend more than Tsh 20,000 and the billing staff enters the details
into their system.
The person might make spelling mistakes while entering the email
id which results in incorrect data. Even some customers might not
be ready to disclose their email id which results in incomplete data.
The data even could get altered due to system or human errors. All
these result in noisy and incomplete data which makes the data
mining really challenging.
Challenges in Data Mining

Distributed Data
Real world data is usually stored on different platforms in distributed
computing environments.
It could be in databases, individual systems, or even on the Internet.
It is practically very difficult to bring all the data to a centralized data
repository mainly due to organizational and technical reasons.
For example, different regional offices might be having their own
servers to store their data whereas it will not be feasible to store all
the data (millions of terabytes) from all the offices in a central server.
So, data mining demands the development of tools and algorithms
that enable mining of distributed data
Challenges in Data Mining

Complex Data
Real world data is really heterogeneous and it could be multimedia
data including images, audio and video, complex data, temporal
data, spatial data, time series, natural language text and so on.
It is really difficult to handle these different kinds of data and extract
required information.
Most of the times, new tools and methodologies would have to be
developed to extract relevant information.
Challenges in Data Mining

Performance
The performance of the data mining system mainly depends on the
efficiency of algorithms and techniques used.
If the algorithms and techniques designed are not up to the mark,
then it will affect the performance of the data mining process
adversely.
Challenges in Data Mining

Data Privacy and Security

Data mining normally leads to serious issues in terms of data
security, privacy and governance.
For example, when a retailer analyzes the purchase details, it
reveals information about buying habits and preferences of
customers without their permission
Challenges in Data Mining

Data Visualization
Data visualization is a very importance process in data mining
because it is the main process that displays the output in a
presentable manner to the user.
The information extracted should convey the exact meaning of what
it actually intends to convey.
But many times, it is really difficult to represent the information in an
accurate and easy-to-understand way to the end user.
The input data and output information being really complex, very
effective and successful data visualization techniques need to be
applied to make it successful.
Trends that Aﬀect Data Mining

 Data Trends .
 Perhaps the most fundamental external trend is the explosion of
digital data mining during the past two decades.
 During this period – the amount of data probably has grown
extremely.
 Much of this data is accessible via networks.
 On the other hand, during the same period the number of scientists,
engineers, and other analysts available to analyze this data has
remained relatively constant.
 Only one conclusion is possible: either most of the data is destined
to be write-only; or techniques, such as data mining, must be
developed, which can automate, in part, the analysis of this data,
ﬁlter irrelevant information, and extract meaningful knowledge .
Trends that Aﬀect Data Mining

 Hardware Trends
 Data mining requires numerically and statistically intensive
computations on large datasets.
 The increasing memory and processing speed of workstations
enables the mining of dataset using current algorithms and
techniques that were too large to be mined just a few years ago.
 In addition, the commoditization of high-performance computing
through SMP workstations and high-performance workstation
clusters enables attacking data mining problems that were
accessible using only the largest supercomputers of few years ago. .
Trends that Aﬀect Data Mining

 Network Trends.
 The next generation Internet (NGI) will connect sites at OC-3 (155
MBits/sec), speeds and higher.

 This is over 100 times faster than the connectivity provided by

current networks. With this type of connectivity, it becomes possible
to correlate distributed datasets using current algorithms and
techniques.

 In addition, new protocol, algorithms, and languages are being

developed to facilitate distributed data mining using current and
next generation networks.
Trends that Aﬀect Data Mining

 Scientiﬁc Computing Trends.

 Scientists and engineers today view simulation as a third mode of
science.

 Data mining and knowledge discovery serve an important role

linking the three modes of science:

 Theory, experiment, and simulation, especially for those cases

in which the experiment or simulation results in large datasets..
Trends that Aﬀect Data Mining

 Business Trends.
 Today businesses must be more proﬁtable, react quicker, and o ﬀer
higher quality services than ever before, and do it all using fewer
people and at lower cost.

 With these types of expectations and constraints

 Data mining becomes a fundamental technology, enabling

business to more accurately predict opportunities and risks
generated by their customers and their customer’s transactions...
Data Mining & Data Warehousing

 Data Warehouse: “is a repository (or archive) of information

gathered from multiple sources, stored under a unified schema, at a
single site.”

 Collect data - Store in single repository

 Allows for easier query development as a single repository.

 Data Mining:
 Analysing databases or Data Warehouses to discover patterns
about the data to gain knowledge.
 Knowledge is power.
Architecture of Data Mining

Data can be stored in databases and/or data warehouse systems

 Question: Should we design a data mining system that decouples or
couples with databases and data warehouse systems?
 This question leads to four possible architectures of a data mining system as
follows:
1.No-coupling:
In this architecture, data mining system does not utilize any
functionality of a database or data warehouse system. A no-coupling
data mining system retrieves data from a particular data source such
as file system, processes data using major data mining algorithms
and stores results into the file system.
The no-coupling architecture is considered a poor architecture for data
mining system, however, it is used for simple data mining processes.
Architecture of Data Mining

2. Loose Coupling:
In this architecture, data mining system uses the database or data
warehouse for data retrieval. In loose coupling data mining
architecture, data mining system retrieves data from the database
or data warehouse, processes data using data mining algorithms
and stores the result in those systems.
This architecture is mainly for memory-based data mining system
that does not require high scalability and high performance..
Architecture of Data Mining

3. Semi-tight Coupling:
In semi-tight coupling data mining architecture, besides linking to
database or data warehouse system, data mining system uses
several features of database or data warehouse systems to
perform some data mining tasks including sorting, indexing,
aggregation…etc.
In this architecture, some intermediate result can be stored in
database or data warehouse system for better performance.
Architecture of Data Mining

4. Tight Coupling:
In tight coupling data mining architecture, database or data
warehouse is treated as an information retrieval component of
data mining system using integration.
All the features of database or data warehouse are used to
perform data mining tasks. This architecture provides system
scalability, high performance, and integrated information
Architecture of Data Mining

There are three tiers in the tight-coupling data mining architecture:

Data Layer

We can define data layer as a database or data warehouse systems. This

layer is an interface for all data sources. Data mining results are stored in
the data layer. Thus, we can present to end-user in form of reports or
another kind of visualization.
Data mining application layer

It is used to retrieve data from a database. Some transformation routine

has to perform here. That is to transform data into the desired format. Then
we have to process data using various data mining algorithms.
Front-end layer

It provides the intuitive and friendly user interface for end-user. That is to
interact with data mining system. Data mining result presented in
visualization form to the user in the front-end layer.
Architecture of Data Mining
Architecture of Data Mining

Data mining system may have the following major components

A database, data warehouse, or other information repository,
which consists of the set of databases, data warehouses,
spreadsheets, or other kinds of information repositories
Knowledge Base:
 Contains the domain knowledge used to guide the search or to
evaluate the interestingness of resulting patterns. For
example, the knowledge base may contain metadata which
describes data from multiple heterogeneous sources.
 Knowledge such as user beliefs, which can be used to
assess a pattern’s interestingness based on its
unexpectedness, may also be included.
Architecture of Data Mining

 Data Mining Engine

 The data mining engine is the core component of any data
mining system. It consists of a number of modules for
performing data mining tasks including association,
classification, characterization, clustering, prediction, time-
series analysis etc.

 A database or data warehouse server

 The database or data warehouse server contains the actual
data that is ready to be processed. Hence, the server is
responsible for retrieving the relevant data based on the data
mining request of the user..
Architecture of Data Mining

 A pattern evaluation module

 This component typically employs interestingness measures
 Interacts with the data mining modules so as to focus the
search toward interesting patterns.
 It may use interestingness thresholds to filter out discovered
patterns.
 Alternatively, the pattern evaluation module may be
integrated with the mining module, depending on the
implementation of the data mining method used.
 For efficient data mining, it is highly recommended to push the
evaluation of pattern interestingness as deep as possible into the
mining process so as to confine the search to only the interesting
patterns.
Architecture of Data Mining

 User interface
 This module communicates between users and the data
mining system,
 Allowing the user to interact with the system by specifying a
data mining query or task, providing information to help focus
the search, and performing exploratory datamining based on
the intermediate data mining results.
 In addition, this component allows the user to browse
database and data warehouse schemas or data structures,
evaluate mined patterns, and visualize the patterns in different
forms. .
How is a data warehouse different from a database?
How are they similar?

 A data warehouse is a repository of information collected from

multiple sources, over a history of time, stored under a unified
schema, and used for data analysis and decision support

WHEREAS

 A database, is a collection of interrelated data that represents

the current status of the stored data. There could be multiple
heterogeneous databases where the schema of one database
may not agree with the schema of another. A database system
supports ad-hoc query and on-line transaction processing.
How is a data warehouse different from a database?
How are they similar?

Similarities between a data warehouse and a

database:

Both are repositories of information, storing huge

amounts of persistent data.
Data mining: On what kind of data?

 Data mining can be applied to any kind of data as long as the

data are meaningful for a target application
 In principle, data mining should be applicable to any kind
of information repository. This includes:

 Relational databases
 Data warehouses
 Transactional databases
 Advanced database systems
 Flat files, and the world-wide web.
Data mining: on what kind of data?

 Advanced database systems include:

 Object-oriented databases

 Object-relational databases

 Special application-oriented databases such as

 Spatial databases
 Time-series databases
 Text databases
 Multimedia databases. .
Flat files

 Flat files is defined as data files in text form or binary form

with a structure that can be easily extracted by data mining
algorithms.
 Data stored in flat files have no relationship or path among
themselves, like if a relational database is stored on flat file,
then there will be no relations between the tables.
 Flat files are represented by data dictionary. Eg: CSV file.
 For example a spreadsheet application such as Excel can
be used as a flat file database.
 Application: Used in DataWarehousing to store data, Used
in carrying data to and from server, etc.
Relational Databases

 A Relational database is defined as the collection of data

organized in tables with rows and columns.
 Physical schema in Relational databases is a schema which
defines the structure of tables.
 Logical schema in Relational databases is a schema which
defines the relationship among tables.
 Standard API of relational database is SQL.
 SQL allows retrieval and manipulation of the data stored in
the tables, as well as the calculation of aggregate functions
such as average, sum, min, max and count.
 Application: Data Mining, Relational online analytical processing
(ROLAP) model, etc.
Relational Databases

 For instance, an SQL query to select the videos grouped by

category would be:
SELECT count(*) FROM Items WHERE type=video GROUP BY category.
 Data mining algorithms using relational databases can be
more versatile than data mining algorithms specifically written
for flat files, since they can take advantage of the structure
inherent to relational databases.
 While data mining can benefit from SQL for data selection,
transformation and consolidation,
 It goes beyond what SQL could provide, such as predicting,
comparing, detecting deviations, etc.
Data warehouses

 A data warehouse is a repository of information collected from

multiple sources, stored under a unified schema, and which
usually resides at a single site.

 Data warehouses are constructed via a process of data cleansing,

data transformation, data integration, data loading, and
periodic data refreshing.

 The next figure shows the basic architecture of a data warehouse

Data warehouses

Data Source in Mwanza

Data Source in Mbeya

Data Source in Kigoma

Architecture of a typical data warehouse
Data warehouses

 In order to facilitate decision making, the data in a data warehouse

are organized around major subjects, such as
 Customer
 Item
 Supplier
 Activity.
 The data are stored to provide information from a historical
perspective and are typically summarized.
 It provides a multidimensional view of data and allows the pre
computation and fast accessing of summarized data
Data warehouses

This matrix is an example of a two-dimensional “array.” An array is the fundamental

component of a multidimensional database.
Transactional databases

 In general, a transactional database consists of a flat file where

each record represents a transaction.
 A transaction typically includes a unique transaction identity
number (trans ID), and a list of the items making up the
transaction (such as items purchased in a store) as shown below:
Advanced database systems and advanced
database applications
An objected-oriented database

 An objected-oriented database is designed based on the object-

oriented programming paradigm where data are a large number of
objects organized into classes and class hierarchies.
 Each entity in the database is considered as an object.
 The object contains a set of variables that describe the object, a
set of messages that the object can use to communicate with
other objects or with the rest of the database system and
 A set of methods where each method holds the code to implement
a message.
 E.g object store
A spatial database

 A Spatial databases are databases that, in addition to usual data,

store geographical information like maps, and global or regional
positioning.
 Most spatial databases allow representing simple geometric
objects such as points, lines and polygons.
 Such spatial databases present new challenges to data mining
algorithms. An example of spatial databases is geographical
(map) databases
Time-Series Databases:

 Time-series databases contain time related data such stock

market data or logged activities.
 These databases usually have a continuous flow of new data
coming in, which sometimes causes the need for a challenging
real time analysis.
 Data mining in such databases commonly includes the study of
trends and correlations between evolutions of different variables,
as well as the prediction of trends and movements of the variables
in time.
Text Databases:

 A text database is a database that contains text documents or

other word descriptions in the form of long sentences or
paragraphs, such as:
 Product specifications
 Error or bug reports
 Warning messages,
 Summary reports
 Notes or
 Other documents.
A multimedia database

 Multimedia databases include video, images, audio and text

media.
 They can be stored on extended object-relational or object-
oriented databases, or simply on a file system. Multimedia is
characterized by its high dimensionality, which makes data mining
even more challenging.
 Data mining from multimedia repositories may require computer
vision, computer graphics, image interpretation, and natural
language processing methodologies.
The World-Wide Web (WWW)

 The World Wide Web is the most heterogeneous and dynamic

repository available.
 Data in the World Wide Web is organized in inter-connected
documents. These documents can be text, audio, video, raw data,
and even applications. Conceptually, the World Wide Web is
comprised of three major components: The content of the Web,
which encompasses documents available; the structure of the
Web, which covers the hyperlinks and the relationships between
documents; and the usage of the web, describing how and when
the resources are accessed.
 Data mining in the World Wide Web, or web mining, tries to
address all these issues and is often divided into web content
mining, web structure mining and web usage mining