UNIT I Introduction to Data mining-converted
UNIT I Introduction to Data mining-converted
• Query: The query might not be well formed or precisely stated. The
data miner might not even be exactly sure of what he wants to see.
• Data: The data accessed is usually a different version from that of the
original operational database. The data have been cleansed and modified to
better support the mining process.
• Output: The output of the data mining query probably is not a subset of
the database. Instead it is the output of some analysis of the contents of
the database.
EXAMPL1.1
Credit card companies must determine whether to authorize credit card
purchases. Suppose that based on past historical information about
purchases, each purchase is placed into one of four classes: (1) authorize,
(2) ask for further identification before authorization, (3) do not
authorize, and (4) do not authorize but contact police. The data mining
functions here are twofold. First the historical data must be examined to
determine how the data fit into the four classes. Then the problem is to
apply this model to each new purchase. Although the second part indeed
may be stated as a simple database query, the first part cannot be.
Classification
EXAMPLE 1.2
An airport security screening station is used to determine: if passengers
are potential terrorists or criminals. To do this, the face of each
passenger is scanned and its basic pattern (distance between eyes, size
and shape of mouth, shape of head, etc.) is identified. This pattern is
compared to entries in a database to see if it matches any patterns that
are associated with known offenders.
Regression
Regression is used to map a data item to a real valued prediction variable.
In actually regression involves the learning of the function that does this
mapping. Regression assumes that the target data fit into some known
type of function (e.g., linear, logistic, etc.) and then determines the best
function of this type that models the given data. Some type of error
analysis is used to determine which function is "best." standard linear
regression, as illustrated in Example 1.3, is a simple example of regression.
EXAMPLE 1.3
A college professor wishes to reach a certain level of savings before
his / her retirement. Periodically, she predicts what her retirement
savings will be based on its current value and several past values. She uses
a simple linear regression formula to predict this value by fitting past
behavior to a linear function and then using this function to predict the
values at points in the future. Based on these values, she then alters
his/her investment portfolio.
Prediction
Many real-world data mining applications can be seen as predicting future
data states based on past and current data. Prediction can be viewed as a
type of classification. (Note: This is a data mining task that is different
from the prediction model, although the prediction task is a type of
prediction model.) The difference is that prediction is predicting a future
state rather than a current state. Here we are referring to a type of
application rather than to a type of data mining modeling approach, as
discussed earlier. Prediction applications include flooding, speech
recognition, machine learning, and pattern recognition.
Although future values may be predicted using time series analysis or
regression techniques, other approaches may be used as well. Example 1.5
illustrates the process.
EXAMPLE 1.5
Predicting flooding is a difficult problem. One approach uses monitors
placed at various points in the river. These monitors collect data relevant
to flood prediction: water level, rain amount, time, humidity, and so on.
Then the water level at a potential flooding point in the river can be
predicted based on the data collected by the sensors upriver from this
point. The prediction must be made with respect to the time the data
were collected.
Clustering
Clustering is similar to classification except that the groups are not
predefined, but rather defined by the data alone. Clustering is
alternatively referred to as unsupervised learning or segmentation. It can
be thought of as partitioning or segmenting the data into groups that
might or might not be disjointed. The clustering is usually accomplished by
determining the similarity among the data on predefined attributes. The
most similar data are grouped into clusters. Example 1.6 provides a simple
clustering example. Since the clusters are not predefined, a domain expert
is often required to interpret the meaning of the created clusters
EXAMPLE 1.6
A certain national department store chain creates special catalogs
targeted to various demographic groups based on attributes such as
income, location, and physical characteristics of potential customers (age,
height, weight, etc.). To determine the target mailings of the various
catalogs and to assist in the creation of new, more specific catalogs, the
company performs a clustering of potential customers based on the
determined attribute values. The results of the clustering exercise are
then used by management to create special catalogs and distribute them
to the correct target population based on the cluster for that catalog.
Summarization
Summarization maps data into subsets with associated simple
descriptions. Summarization is also called characterization or
generalization. It extracts or derives representative information about
the database. This may be accomplished by actually retrieving portions
of the data. Alternatively, summary type information (such as the mean of
some numeric attribute) can be derived from the data. The summarization
succinctly characterizes the contents of the database. Example 1.7
illustrates this process.
EXAMPLE 1.7
One of the many criteria used to compare universities by the U.S. News &
World Report is the average SAT or AC T score [GM99]. This is a
summarization used to estimate the type and intellectual level of the
student body.
Association Rules
Link analysis, alternatively referred to as affinity analysis or association,
refers to the data mining task of uncovering relationships among data. The
best example of this type of application is to determine association rules.
An association rule is a model that identifies specific types of data
associations. These associations are often used in the retail sales
community to identify items that are frequently purchased together.
Associations are also used in many other applications such as
predicting the failure of telecommunication switches.
EXAMPLE 1.8
A grocery store retailer is trying to decide whether to put bread on sale.
To help determine the impact of this decision, the retailer generates
association rules that show what other Section 1.2 Data Mining Versus
Knowledge Discovery in Databases 9 products are frequently purchased
with bread. He finds that 60% of the times that bread is sold so are
pretzels and that 70% of the time jelly is also sold. Based on these facts,
he tries to capitalize on the association between bread, pretzels, and jelly
by placing some pretzels and jelly at the end of the aisle where the bread
is placed. In addition, he decides not to place either of these items on sale
at the same time.
EXAMPLE 1.9
The Webmaster at the XYZ Corp. periodically analyzes the Web log data
to determine how users of the XYZ's Web pages access them. He is
interested in determining what sequences of pages are frequently
accessed. He determines that 70 percent of the users of page A follow
one of the following patterns of behavior: (A, B, C) or (A, D, B, C)
Or (A, E, B, C). He then determines to add a link directly from page A to
page C.
The terms knowledge discovery in databases (KDD) and data mining are
often used interchangeably. In fact, there have been many other names
given to this process of discovering useful (hidden) patterns in data:
knowledge extraction, information discovery, exploratory data analysis,
information harvesting, and unsupervised pattern recognition.
Over the last few years KDD has been used to refer to a process
consisting of many steps, while data mining is only one of these steps.
DEFINITION 1.1. Knowledge discovery in databases (KDD) is the
process of finding useful information and patterns in data.
• Selection: The data needed for the data mining process may be
obtained from many different and heterogeneous data sources. This first
step obtains the data from various databases, files, and non electronic
sources.
• Preprocessing: The data to be used by the process may have incorrect
or missing data. There may be anomalous data from multiple sources
involving different data types and metrics. There may be many different
activities performed at this time. Erroneous data may be corrected or
removed, whereas missing data must be supplied or predicted (often using
data mining tools).
• Transformation: Data from different sources must be converted into a
common format for processing. Some data may be encoded or transformed
into more usable formats. Data reduction may be used to reduce the
number of possible data values being considered.
• Data mining: Based on the data mining task being performed, this step
applies algorithms to the transformed data to generate the desired
results.
• Interpretation/evaluation: How the data mining results are presented
to the users is extremely important because the usefulness of the results
is dependent on it. Various visualization and GUI strategies are used at
this last step. Transformation techniques are used to make the data easier
to mine and more useful, and to provide more meaningful results. The
actual distribution of the data may be
Some attribute values may be combined to provide new values, thus
reducing the complexity of the data. For example, current date and birth
date could be replaced by age. One attribute could be substituted for
another. An example would be replacing a sequence
of actual attribute values with the differences between consecutive
values. Real valued attributes may be more easily handled by partitioning
the values into ranges and using these discrete range values. Some data
values may actually be removed. Outliers, extreme
values that occur infrequently, may actually be removed. The data may be
transformed by applying a function to the values. A common
transformation function is to use the log of the value rather than the
value itself. These techniques make the mining task easier by reducing the
dimensionality (number of attributes) or by reducing the variability of the
data values. The removal of outliers can actually improve the quality of the
results. As with all steps in the KDD process, however, care must be used
in performing transformation. If used incorrectly, the transformation
could actually change the data such that the results of the data mining
step are inaccurate. Visualization refers to the visual presentation of data.
The old expression "a picture is worth a thousand words" certainly is true
when examining the structure of data. For example, a line graph that
shows the distribution of a data variable is easier to understand and
perhaps more informative than the formula for the corresponding
distribution. The use of visualization techniques allows users to summarize,
extract, and grasp more complex results than more mathematical or text
type descriptions of the results. Visualization techniques include:
how to define a data mining query and whether a query language (like SQL)
can be developed to capture the many different types of data mining
queries.
• Describing a large database can be viewed as using approximation to help
uncover hidden information about the data.
• When dealing with large databases, the impact of size and efficiency of
developing an abstract model can be thought of as a type of search
problem. It is interesting to think about the various data mining problems
and how each may be viewed in several different perspectives based on
the viewpoint and background of the researchers /developers.
1. Human interaction: Since data mining problems are often not precisely
stated, interfaces may be needed with both domain and technical experts.
Technical experts are used to formulate the queries and assist in
interpreting the results. Users are needed to identify training data and
desired results.
2. Over fitting: When a model is generated that is associated with a
given database state it is desirable that the model also fit future
database states. Over fitting occurs when the model does not fit future
states. This may be caused by assumptions that are made about the data
or may simply be caused by the small size of the training database.
3. Outliers: There are often many data entries that do not fit nicely into
the derived model. This becomes even more of an issue with very large
databases. If a model is developed that includes these outliers , then the
model may not behave well for data that are not outliers .
4. Interpretation of results: Currently, data mining output may require
experts to correctly interpret the results, which might otherwise be
meaningless to the average database user.
5. Visualization of results: To easily view and understand the output of
data mining algorithms, visualization of the results is helpful.
6. Large datasets: The massive datasets associated with data mining
create problems when applying algorithms designed for small datasets.
Many modeling applications grow exponentially on the dataset size and thus
are too inefficient for larger datasets. Sampling and parallelization are
effective tools to attack this scalability problem.
7. High dimensionality: A conventional database schema may be composed
of many different attributes. The problem here is that not all attributes
may be needed to solve a given data mining problem. In fact, the use of
some attributes may interfere with the correct completion of a data
mining task. The use of other attributes may simply increase the overall
complexity and decrease the efficiency of an algorithm. This problem is
sometimes referred to as the dimensionality curse, meaning that there are
many attributes (dimensions) involved and it is difficult to determine
which ones should be used. One solution to this high dimensionality
problem is to reduce the number of attributes, which is known as
dimensionality reduction. However, determining which attributes not
needed is not always easy to do.
8. Multimedia data: Most previous data mining algorithms are targeted to
traditional data types (numeric, character, text, etc.). The use of
multimedia data such as is found in GIS databases complicates or
invalidates many proposed algorithms.
9. Missing data: During the preprocessing phase of KDD, missing data
may be replaced with estimates. This and other approaches to handling
missing data can lead to invalid results in the data mining step.
10. Irrelevant data: Some attributes in the database might not be of
interest to the data mining task being developed.
11. Noisy data: Some attribute values might be invalid or incorrect.
These values are often corrected before running data mining applications.
12. Changing data: Databases cannot be assumed to be static. However,
most data mining algorithms do assume a static database. This requires
that the algorithm be completely rerun anytime the database changes.
13. Integration: The KDD process is not currently integrated into normal
data processing activities. KDD requests may be treated as special,
unusual, or one-time needs. This makes them inefficient, ineffective, and
not general enough to be used on an ongoing basis. Integration of data
mining functions into traditional DBMS systems is certainly a desirable
goal.
14. Application: Determining the intended use for the information
obtained from the data mining function is a challenge. Indeed, how
business executives can effectively use the output is sometimes
considered the more difficult part, not the running of the algorithms
themselves. Because the data are of a type that has not previously
been known, business practices may have to be modified to determine how
to effectively use the information uncovered.
These issues should be addressed by data mining algorithms and products.
INFORMATION RETRIEVAL
Informal Definition
An informal definition could be "a set of rules that precisely defines
a sequence of operations" which would include all computer programs
(including programs that do not perform numeric calculations), and (for
example) any prescribed bureaucratic procedure or cook-book recipe.
In general, a program is only an algorithm if it stops eventually even
though infinite loops may sometimes prove desirable. Algorithms are
essential to the way computers process data. Many computer programs
contain algorithms that detail the specific instructions a computer should
perform—in a specific order—to carry out a specified task, such as
printing students' report cards.
In computer systems, an algorithm is basically an instance
of logic written in software by software developers, to be effective for
the intended "target" computer(s) to produce output from given (perhaps
null) input. An optimal algorithm, even running in old hardware, would
produce faster results than a non-optimal (higher time complexity)
algorithm for the same purpose, running in more efficient hardware; that
is why algorithms, like computer hardware, are considered technology.
Algorithm example
One of the simplest algorithms is to find the largest number in a list of
numbers of random order. Finding the solution requires looking at every
number in the list. From this follows a simple algorithm, which can be
stated in a high-level description in English as
High-level description:
Data Structure
In computer science, a data structure is a data organization,
management, and storage format that enable efficient access and
modification. More precisely, a data structure is a collection of data
values, the relationships among them, and the functions or operations that
can be applied to the data i.e., it is an algebraic structure about data
Data structures serve as the basis for abstract data types (ADT).
The ADT defines the logical form of the data type. The data structure
implements the physical form of the data type.
Different types of data structures are suited to different kinds of
applications, and some are highly specialized to specific tasks. For
example, relational databases commonly use B-tree indexes for data
retrieval, while compiler implementations usually use hash tables to look up
identifiers. Data structures provide a means to manage large amounts of
data efficiently for uses such as large databases and internet indexing
services. Usually, efficient data structures are key to designing
efficient algorithms. Some formal design methods and programming
languages emphasize data structures, rather than algorithms, as the key
organizing factor in software design. Data structures can be used to
organize the storage and retrieval of information stored in both main
memory and secondary memory.
Implementation
• A byte is the smallest amount of data that a Computer CPU can copy
from memory to a register or back in a single CPU instruction,
therefore a byte stream is the most efficient way to run big data
through a computer, hence Stream processing
• An array is a number of elements in a specific order, typically all of the
same type Elements are accessed using an integer index to specify
which element is required. Typical implementations allocate contiguous
memory words for the elements of arrays. Arrays may be fixed-length
or resizable.
• A linked list (also just list) is a linear collection of data elements of any
type, called nodes, where each node has itself a value, and points to the
next node in the linked list. The principal advantage of a linked list over
an array is that values can always be efficiently inserted and removed
without relocating the rest of the list.
• A record (also called tuple or struct) is an aggregate data structure. A
record is a value that contains other values, typically in fixed number
and sequence and typically indexed by names. The elements of records
are usually called fields or members.
• A union is a data structure that specifies which of a number of
permitted primitive types may be stored in its instances,
e.g. float or long integer. Contrast with a record, which could be
defined to contain a float and an integer; whereas in a union, there is
only one value at a time. Enough space is allocated to contain the widest
member data-type.
• A tagged union (also called variant, variant record, discriminated union,
or disjoint union) contains an additional field indicating its current type,
for enhanced type safety.
• An object is a data structure that contains data fields, like a record
does, as well as various methods which operate on the data contents. An
object is an in-memory instance of a class from a taxonomy. In the
context of object-oriented programming, records are known as plain old
data structures to distinguish them from objects.[12]
In addition, hashes, graphs and binary trees are other commonly used data
structures.
Language support