0% found this document useful (0 votes)
27 views47 pages

unit-2-part-4

XGNXVC CFXB G

Uploaded by

Thil Pa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views47 pages

unit-2-part-4

XGNXVC CFXB G

Uploaded by

Thil Pa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

DATA PREPROCESSING

Datamining
Unit - II
AGGREGATION

• “less is more”
• Aggregation - combining of two or more objects into a single object.
• In Example,
• One way to aggregate transactions for this data set is to replace all the transactions of a single store with a
single storewide transaction.
• This reduces number of records (1 record per store).
• How an aggregate transaction is created
• Quantitative attributes, such as price, are typically aggregated by taking a sum or an average.
• A qualitative attribute, such as item, can either be omitted or summarized as the set of all the items that
were sold at that location.
• Can also be viewed as a multidimensional array, where each attribute is a dimension.
• Used in OLAP
AGGREGATION
• Motivations for aggregation
• Smaller data sets require less memory and processing time which
allows the use of more expensive data mining algorithms.
• Availability of change of scope or scale
• by providing a high-level view of the data instead of a low-level view.
• Behavior of groups of objects or attributes is often more stable than
that of individual objects or attributes.
• Disadvantage of aggregation
• potential loss of interesting details.
AGGREGATION

average yearly precipitation has less variability than the average monthly precipitation.
SAMPLING
• Approach for selecting a subset of the data objects to be analyzed.
• Data miners sample because it is too expensive or time consuming to
process all the data.
• The key principle for effective sampling is the following:
• Using a sample will work almost as well as using the entire data set if the sample
is representative.
• A sample is representative if it has approximately the same property (of interest) as the
original set of data.
• Choose a sampling scheme/Technique – which gives high probability of getting a
representative sample.
SAMPLING
• Sampling Approaches: (a) Simple random (b) Stratified (c) Adaptive
• Simple random sampling
• equal probability of selecting any particular item.
• Two variations on random sampling:
• (1) sampling without replacement—as each item is selected, it is removed from the set of all objects that
together constitute the population, and
• (2) sampling with replacement—objects are not removed from the population as they are selected for the
sample.
• Problem: When the population consists of different types of objects, with widely different numbers of
objects, simple random sampling can fail to adequately represent those types of objects that are less
frequent.
• Stratified sampling:
• starts with prespecified groups of objects
• Simpler version -equal numbers of objects are drawn from each group even though the groups are of
different sizes.
• Other - the number of objects drawn from each group is proportional to the size of that group.
SAMPLING

Sampling and Loss of Information


• Larger sample sizes increase the probability that a sample will be representative, but they also eliminate
much of the advantage of sampling.
• Conversely, with smaller sample sizes, patterns may be missed or erroneous patterns can be detected.
SAMPLING

Determining the Proper Sample Size


• Desired outcome: at least one point will be obtained from each cluster.
• Probability of getting one object from each of the 10 groups increases as the sample size runs from 10
to 60.
SAMPLING

• Adaptive/Progressive Sampling:
• Proper sample size - Difficult to determine
• Start with a small sample, and then increase the sample size until a
sample of sufficient size has been obtained.
• Initial correct sample size is eliminated
• Stop increasing the sample size at leveling-off point(where no
improvement in the outcome is identified).
DIMENSIONALITY REDUCTION

• Data sets can have a large number of features.


• Example
• a set of documents, where each document is represented by a vector
whose components are the frequencies with which each word occurs in
the document.
• thousands or tens of thousands of attributes (components), one for each
word in the vocabulary.
DIMENSIONALITY REDUCTION

• Benefits to dimensionality reduction.


• Data mining algorithms work better if the dimensionality is lower.
• It eliminates irrelevant features and reduce noise
• Lead to a more understandable model
• fewer attributes
• Allow the data to be more easily visualized.
• Amount of time and memory required by the data mining algorithm is reduced with a reduction in
dimensionality.
• Reduce the dimensionality of a data set by creating new attributes that are a combination of the old
attributes.
• Feature subset selection or feature selection:
• The reduction of dimensionality by selecting new attributes that are a subset of the old.
DIMENSIONALITY REDUCTION

• The Curse of Dimensionality


• Data analysis become significantly harder as the dimensionality of the data
increases.
• data becomes increasingly sparse
• Classification
• there are not enough data objects to model a class to all possible objects.
• Clustering
• density and the distance between points - becomes less meaningful
DIMENSIONALITY REDUCTION

• Linear Algebra Techniques for Dimensionality Reduction


• Principal Components Analysis (PCA)
• for continuous attributes
• finds new attributes (principal components) that
• (1) are linear combinations of the original attributes,
• (2) are orthogonal (perpendicular) to each other, and
• (3) capture the maximum amount of variation in the data.
• Singular Value Decomposition (SVD)
• Related to PCA
FEATURE SUBSET SELECTION
• Another way to reduce the dimensionality - use only a subset of the features.
• Redundant Features
• Example:
• Purchase price of a product and the amount of sales tax paid
• Redundant to each other
• contain much of the same information.

• Irrelevant features contain almost no useful information for the data mining task at hand.
• Example: Students’ ID numbers are irrelevant to the task of predicting students’ grade point averages.

• Redundant and irrelevant features


• reduce classification accuracy and the quality of the clusters that are found.
• can be eliminated immediately by using common sense or domain knowledge,
• systematic approach - for selecting the best subset of features
• Best approach - try all possible subsets of features as input to the data mining algorithm of interest, and
then take the subset that produces the best results.
FEATURE SUBSET SELECTION

• 3 standard approaches to feature


selection:
• Embedded
• Filter
• Wrapper
FEATURE SUBSET SELECTION
• Embedded approaches:
• Feature selection occurs naturally as part of the data mining algorithm.
• During execution of algorithm, the Algorithm itself decides which attributes to use
and which to ignore.
• Example:- Algorithms for building decision tree classifiers

• Filter approaches:
• Features are selected before the data mining algorithm is run
• Approach that is independent of the data mining task.

• Wrapper approaches:
• Uses the target data mining algorithm as a black box to find the best subset of
attributes
• typically without enumerating all possible subsets.
FEATURE SUBSET SELECTION
• An Architecture for Feature Subset Selection :
• The feature selection process is viewed as consisting of four parts:
1. a measure for evaluating a subset,
2. a search strategy that controls the generation of a new subset of features,
3. a stopping criterion, and
4. a validation procedure.

• Filter methods and wrapper methods differ only in the way in which they
evaluate a subset of features.
• wrapper method – uses the target data mining algorithm
• filter approach - evaluation technique is distinct from the target data mining
algorithm.
FEATURE SUBSET SELECTION
FEATURE SUBSET SELECTION
• Feature subset selection is a search over all possible subsets of features.
• Evaluation step - determine the goodness of a subset of attributes with respect to a particular data mining task
• Filter approach: predict how well the actual data mining algorithm will perform on a given set of attributes.
• Wrapper approach: running the target data mining application, measure the result of the data mining.

• Stopping criterion
• conditions involving the following:
• the number of iterations,
• whether the value of the subset evaluation measure is optimal or exceeds a certain threshold,
• whether a subset of a certain size has been obtained,
• whether simultaneous size and evaluation criteria have been achieved, and
• whether any improvement can be achieved by the options available to the search strategy.

• Validation:
• Finally, the results of the target data mining algorithm on the selected subset should be validated.
• An evaluation approach: run the algorithm with the full set of features and compare the full results to results
obtained using the subset of features.
FEATURE SUBSET SELECTION
• Feature Weighting
• An alternative to keeping or eliminating features.
• One Approach
• Higher weight - More important features
• Lower weight - less important features
• Another Approach – automatic
• Example – Classification Scheme - Support vector machines
• Other Approach
• The normalization of objects – Cosine Similarity – used as weights
FEATURE CREATION
• Create a new set of attributes that captures the important
information in a data set from the original attributes
• much more effective.
• No. of new attributes < No. of original attributes
• Three related methodologies for creating new attributes:
1. Feature extraction
2. Mapping the data to a new space
3. Feature construction
FEATURE CREATION
• Feature Extraction
• The creation of a new set of features from the original raw data
• Example: Classify set of photographs based on existence of human face
(present or not)
• Raw data (set of pixels) - not suitable for many types of classification algorithms.
• Higher level features( presence or absence of certain types of edges and areas that are highly correlated with
the presence of human faces), then a much broader set of classification techniques can be applied to this
problem.

• Feature extraction is highly domain-specific


• New area means development of new features and feature extraction
methods.
FEATURE CREATION

Mapping the Data to a New Space


• A totally different view of the data can reveal important and interesting features.
• If there is only a single periodic pattern and not much noise, then the pattern is easily detected.
• If, there are a number of periodic patterns and a significant amount of noise is present, then these
patterns are hard to detect.
• Such patterns can be detected by applying a Fourier transform to the time series in order to
change to a representation in which frequency information is explicit.
• Example:
• Power spectrum that can be computed after applying a Fourier transform to the original time series.
FEATURE CREATION

• Feature Construction
• Features in the original data sets consists necessary information, but not suitable for the data mining
algorithm.
• If new features constructed out of the original features can be more useful than the original features.

• Example (Density).
• Dataset contains the volume and mass of historical artifact.
• Density feature constructed from the mass and volume features, i.e., density = mass/volume, would most
directly yield an accurate classification.
DISCRETIZATION AND BINARIZATION

• Classification algorithms, require that the data be in the form of


categorical attributes.
• Algorithms that find association patterns, require that the data be in
the form of binary attributes.
• Discretization - transforming a continuous attribute into a
categorical attribute
• Binarization - transforming both continuous and discrete attributes
into one or more binary attributes
DISCRETIZATION AND BINARIZATION
• Binarization of a categorical attribute (Simple technique):
• If there are m categorical values, then uniquely assign
each original value to an integer in the interval [0, m − 1].
• If the attribute is ordinal, then order must be maintained
by the assignment.
• Next, convert each of these m integers to a binary number
using n binary attributes
• n = [log2 (m)] binary digits are required to represent these
integers
DISCRETIZATION AND BINARIZATION
Example: a categorical variable with 5 values
{awful, poor, OK, good, great}
require three binary variables x1, x2, and x3.
DISCRETIZATION AND BINARIZATION
• Discretization of Continuous Attributes ( classification or
association analysis)
• Transformation of a continuous attribute to a categorical attribute
involves two subtasks:
• decide no. of categories
• decide how to map the values of the continuous attribute to these
categories.
• Step I: Sort Attribute Values and divide into n intervals by specifying n−1
split points.
• Step II : all the values in one interval are mapped to the same categorical
value.
DISCRETIZATION AND BINARIZATION
• Discretization of Continuous Attributes
• Problem of discretization is
• Deciding how many split points to choose and
• where to place them.
• The result can be represented either as
• a set of intervals {(x0, x1],(x1, x2],... ,(xn−1, xn)},
where x0 and xn may be +∞ or −∞, respectively,
or
• as a series of inequalities x0 < x ≤ x1,..., xn−1 < x < xn.
DISCRETIZATION AND BINARIZATION
• UnSupervised Discretization
• Discretization methods for Classification
• Supervised - known class information
• Unsupervised - unknown class information
• Equal width approach:
• divides the range of the attribute into a user-specified number of
intervals each having the same width.
• problem with outliers
• Equal frequency (equal depth) approach:
• Puts same number of objects into each interval
• K-means Clustering method
DISCRETIZATION AND BINARIZATION
UnSupervised Discretization

Original Data
DISCRETIZATION AND BINARIZATION
UnSupervised Discretization

Equal Width Discretization


DISCRETIZATION AND BINARIZATION
UnSupervised Discretization

Equal Frequency Discretization


DISCRETIZATION AND BINARIZATION
UnSupervised Discretization

K-means Clustering (better result)


DISCRETIZATION AND BINARIZATION
• Supervised Discretization
• When additional information (class labels) are used then it
produces better results.
• Some Concerns: purity of an interval and the minimum size of
an interval.
• statistically based approaches:
• start with each attribute value as a separate interval and
create larger intervals by merging adjacent intervals that are
similar according to a statistical test.
• Entropy based approaches:
DISCRETIZATION AND BINARIZATION
• Supervised Discretization
• Entropy based approaches:
• Entropy Definition

• ei - Entropy in ith interval


• pij = mij/mi probability of class j in the i th interval.
• k - no. of different class labels
• mi - no. of values in the i th interval of a partition,
• mij - no. of values of class j in interval i.
DISCRETIZATION AND BINARIZATION
• Supervised
Discretization
• Entropy
DISCRETIZATION AND BINARIZATION
• Supervised Discretization
• Entropy based approaches:
• Total entropy, e, of the partition is
• weighted average of the individual interval entropies,
• m - no. of values,
• wi = mi/m fraction of values in the i th interval
• n - no. of intervals.
• Perfectly Pure Interval:entropy is 0
• If an interval contains only values of one class
• Impure Interval: entropy is maximum
• classes of values in an interval occur equal
DISCRETIZATION AND BINARIZATION
• Supervised Discretization
• Entropy based approaches:
• Simple approach for partitioning a continuous attribute:
• starts by bisecting the initial values so that the resulting
two intervals give minimum entropy.
• consider each value as a possible split point
• Repeat splitting process with another interval
• choosing the interval with the worst (highest) entropy,
• until a user-specified number of intervals is reached,
or
• stopping criterion is satisfied.
DISCRETIZATION AND BINARIZATION
• Supervised
Discretization
• Entropy based
approaches:
• 3 categories for
both x & y
DISCRETIZATION AND BINARIZATION
• Supervised
Discretization
• Entropy based
approaches:
• 5 categories for
both x & y
• Observation:
• no improvement
for 6 categories
DISCRETIZATION AND BINARIZATION
• Categorical Attributes with Too Many Values
• If categorical attribute is an ordinal,
• techniques similar to those for continuous attributes
• If the categorical attribute is nominal,
• Example:-
• University that has a large number of departments.
• department name attribute - dozens of diff. values.
• combine departments into larger groups, such as
• engineering,
• social sciences, or
• biological sciences.
Variable Transformation

• Transformation that is applied to all the values of a variable.


• Example: magnitude of a variable is important
• then the values of the variable can be transformed by taking the absolute
value.
• Simple Function Transformation:
• A simple mathematical function is applied to each value individually.
• If x is a variable, then examples of such transformations include
• x k,
• log x,
• e x,
• √ x,
• 1/x,
• sin x, or |x|
Variable Transformation

• Variable transformations should be applied with caution since they


change the nature of the data.
• Example:-
• transformation fun. is 1/x
• if value is 1 or >1 then reduces the magnitude of values
• values {1, 2, 3} go to {1, 1/ 2, 1/3}
• if value is b/w 0 & 1 then increases the magnitude of values
• values {1, 1/2, 1/3} go to {1, 2, 3}.
• so better ask questions such as the following:
• Does the order need to be maintained?
• Does the transformation apply to all values( -ve & 0)?
• What is the effect of the transformation on the values between 0 & 1?
Variable Transformation

• Normalization or Standardization
• Goal of standardization or normalization
• To make an entire set of values have a particular property.
• A traditional example is that of “standardizing a variable” in statistics.
• x - mean (average) of the attribute values and
• sx - standard deviation,
• Transformation

• creates a new variable that has a mean of 0 and a standard deviation


of 1.
Variable Transformation

• Normalization or Standardization
• If different variables are to be combined, a transformation is necessary
to avoid having a variable with large values dominate the results of the
calculation.
• Example:
• comparing people based on two variables: age and income.
• For any two people, the difference in income will likely be much
higher in absolute terms (hundreds or thousands of dollars) than the
difference in age (less than 150).
• Income values(higher values) will dominate the calculation.
Variable Transformation

• Normalization or Standardization
• Mean and standard deviation are strongly affected by outliers
• Mean is replaced by the median, i.e., the middle value.
• x - variable
• absolute standard deviation of x is
• xi - i th value of the variable,
• m - number of objects, and
• µ - mean or median.
• Other approaches
• computing estimates of the location (center) and
• spread of a set of values in the presence of outliers
• These measures can also be used to define a standardization transformation.

You might also like