Unit I - Chapter 1 - Data Mining
Unit I - Chapter 1 - Data Mining
We further looks at how data mining can meet this need by providing
tools to discover knowledge from data.
To know how data will be produced, we have two activities to get the
awareness of data collection.
The field is young, dynamic, and promising. Data mining has and will
continue to make great strides in our journey from the data age toward
the coming information age.
Example: Data mining turns a large collection of data into knowledge.
What novel and useful knowledge can a search engine learn from such
a huge collection of queries collected from users over time?
During the 1990s, the World Wide Web and web-based databases (e.g.,
XML databases) began to appear.
The effective and efficient analysis of data from such different forms of
data by integration of information retrieval, data mining, and
information network analysis technologies is a challenging task.
To know the importance of data warehouse, we have a good example for
this.
The widening gap between data and information calls for the systematic
development of data mining tools that can turn data tombs into
golden nuggets of knowledge.
1.2 What Is Data Mining?
3. Data Selection - where data relevant to the analysis task are retrieved
from the database.
The data sources can include databases, data warehouses, the Web,
other information repositories, or data that are streamed into the
system dynamically.
1.3 - What Kinds of Data Can Be Mined?
The most basic forms of data for mining applications are Database
Data, Data Warehouse Data and Transactional Data.
1.3.1 - Database Data
Similarly, each of the relations item, employee, and branch consists of a set
of attributes describing the properties of these entities.
Now you have only one group, lets make separate from different relations.
Who has a confidence that he/she will fail atleast one subject in this
semester ?
This is a difficult task, particularly since the relevant data are spread
out over several databases physically located at numerous sites.
For this the solution is, if AllElectronics had a Data Warehouse, this
task would be easy.
In the previous slide we said that if AllElectronics company asks for the
third quarter report it will be difficult.
Can you say how many students are there in Depaul College ..?
Can you say how may students have taken BCA, B.Com, BBA & BA ..?
Figure shows the typical framework for construction and use of a data
warehouse for AllElectronics.
To facilitate decision making, the data in a data warehouse are
organized around major subjects (e.g., customer, item, supplier, and
activity).
For example, rather than storing the details of each sales transaction,
the data warehouse may store a summary of the transactions per item
type for each store or, summarized to a higher level, for each sales
region.
For example, the total sales for the first quarter, Q1, for the items
related to security systems in Vancouver is $400 as stored in cell
(Vancouver, Q1, security).
By providing multidimensional data views and the precomputation of
summarized data, data warehouse systems can provide inherent
support for OLAP.
Do you have any idea how we can compress the above diagram details ?
For instance, we can drill down on sales data summarized by quarter to
see data summarized by month.
This kind of market basket data analysis would enable you to bundle
groups of items together as a strategy for boosting sales.
Data Streams (e.g., video surveillance and sensor data, which are
continuously transmitted)
For example, in the AllElectronics store, classes of items for sale include
computers and printers, and concepts of customers include bigSpenders
and budgetSpenders.
A data set may contain objects that do not comply with the general
behavior or model of the data.
You note that many tuples have no recorded value for several attributes
such as customer income.
How can you go about filling in the missing values for this attribute?
. This method is not very effective, unless the tuple contains several
attributes with missing values.
In general, the larger the width, the greater the effect of the smoothing.
Regression:
Linear regression involves finding the best line to fit two attributes (or
variables) so that one attribute can be used to predict the other.
So far, we have looked at techniques for handling missing data and for
smoothing data.
But data cleaning is a big job. What about data cleaning as a process?
How exactly does one proceed in tackling this task? Are there any tools
out there to help?
The first step in data cleaning as a process is Discrepancy Detection.
(finding variations in data).
Errors can also occur when the data are (inadequately) used for
purposes other than originally intended.
So, how can we proceed with discrepancy detection? As a starting
point, use any knowledge you may already have regarding properties of
the data.
This is where we can make use of the knowledge we gained about our
data.
In this step, you may write your own scripts and/or use some of the
tools. From this, you may find noise, outliers, and unusual values that
need investigation.
Further explanation about this topic will be done by Sr. Arul Devika