0% found this document useful (0 votes)

27 views47 pages

unit-2-part-4

XGNXVC CFXB G

Uploaded by

Thil Pa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views47 pages

unit-2-part-4

XGNXVC CFXB G

Uploaded by

Thil Pa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 47

DATA PREPROCESSING

Datamining
Unit - II
AGGREGATION

• “less is more”
• Aggregation - combining of two or more objects into a single object.
• In Example,
• One way to aggregate transactions for this data set is to replace all the transactions of a single store with a
single storewide transaction.
• This reduces number of records (1 record per store).
• How an aggregate transaction is created
• Quantitative attributes, such as price, are typically aggregated by taking a sum or an average.
• A qualitative attribute, such as item, can either be omitted or summarized as the set of all the items that
were sold at that location.
• Can also be viewed as a multidimensional array, where each attribute is a dimension.
• Used in OLAP
AGGREGATION
• Motivations for aggregation
• Smaller data sets require less memory and processing time which
allows the use of more expensive data mining algorithms.
• Availability of change of scope or scale
• by providing a high-level view of the data instead of a low-level view.
• Behavior of groups of objects or attributes is often more stable than
that of individual objects or attributes.
• Disadvantage of aggregation
• potential loss of interesting details.
AGGREGATION

average yearly precipitation has less variability than the average monthly precipitation.
SAMPLING
• Approach for selecting a subset of the data objects to be analyzed.
• Data miners sample because it is too expensive or time consuming to
process all the data.
• The key principle for effective sampling is the following:
• Using a sample will work almost as well as using the entire data set if the sample
is representative.
• A sample is representative if it has approximately the same property (of interest) as the
original set of data.
• Choose a sampling scheme/Technique – which gives high probability of getting a
representative sample.
SAMPLING
• Sampling Approaches: (a) Simple random (b) Stratified (c) Adaptive
• Simple random sampling
• equal probability of selecting any particular item.
• Two variations on random sampling:
• (1) sampling without replacement—as each item is selected, it is removed from the set of all objects that
together constitute the population, and
• (2) sampling with replacement—objects are not removed from the population as they are selected for the
sample.
• Problem: When the population consists of different types of objects, with widely different numbers of
objects, simple random sampling can fail to adequately represent those types of objects that are less
frequent.
• Stratified sampling:
• starts with prespecified groups of objects
• Simpler version -equal numbers of objects are drawn from each group even though the groups are of
different sizes.
• Other - the number of objects drawn from each group is proportional to the size of that group.
SAMPLING

Sampling and Loss of Information

• Larger sample sizes increase the probability that a sample will be representative, but they also eliminate
much of the advantage of sampling.
• Conversely, with smaller sample sizes, patterns may be missed or erroneous patterns can be detected.
SAMPLING

Determining the Proper Sample Size

• Desired outcome: at least one point will be obtained from each cluster.
• Probability of getting one object from each of the 10 groups increases as the sample size runs from 10
to 60.
SAMPLING

• Adaptive/Progressive Sampling:
• Proper sample size - Difficult to determine
• Start with a small sample, and then increase the sample size until a
sample of sufficient size has been obtained.
• Initial correct sample size is eliminated
• Stop increasing the sample size at leveling-off point(where no
improvement in the outcome is identified).
DIMENSIONALITY REDUCTION

• Data sets can have a large number of features.

• Example
• a set of documents, where each document is represented by a vector
whose components are the frequencies with which each word occurs in
the document.
• thousands or tens of thousands of attributes (components), one for each
word in the vocabulary.
DIMENSIONALITY REDUCTION

• Benefits to dimensionality reduction.

• Data mining algorithms work better if the dimensionality is lower.
• It eliminates irrelevant features and reduce noise
• Lead to a more understandable model
• fewer attributes
• Allow the data to be more easily visualized.
• Amount of time and memory required by the data mining algorithm is reduced with a reduction in
dimensionality.
• Reduce the dimensionality of a data set by creating new attributes that are a combination of the old
attributes.
• Feature subset selection or feature selection:
• The reduction of dimensionality by selecting new attributes that are a subset of the old.
DIMENSIONALITY REDUCTION

• The Curse of Dimensionality

• Data analysis become significantly harder as the dimensionality of the data
increases.
• data becomes increasingly sparse
• Classification
• there are not enough data objects to model a class to all possible objects.
• Clustering
• density and the distance between points - becomes less meaningful
DIMENSIONALITY REDUCTION

• Linear Algebra Techniques for Dimensionality Reduction

• Principal Components Analysis (PCA)
• for continuous attributes
• finds new attributes (principal components) that
• (1) are linear combinations of the original attributes,
• (2) are orthogonal (perpendicular) to each other, and
• (3) capture the maximum amount of variation in the data.
• Singular Value Decomposition (SVD)
• Related to PCA
FEATURE SUBSET SELECTION
• Another way to reduce the dimensionality - use only a subset of the features.
• Redundant Features
• Example:
• Purchase price of a product and the amount of sales tax paid
• Redundant to each other
• contain much of the same information.

• Irrelevant features contain almost no useful information for the data mining task at hand.
• Example: Students’ ID numbers are irrelevant to the task of predicting students’ grade point averages.

• Redundant and irrelevant features

• reduce classification accuracy and the quality of the clusters that are found.
• can be eliminated immediately by using common sense or domain knowledge,
• systematic approach - for selecting the best subset of features
• Best approach - try all possible subsets of features as input to the data mining algorithm of interest, and
then take the subset that produces the best results.
FEATURE SUBSET SELECTION

• 3 standard approaches to feature

selection:
• Embedded
• Filter
• Wrapper
FEATURE SUBSET SELECTION
• Embedded approaches:
• Feature selection occurs naturally as part of the data mining algorithm.
• During execution of algorithm, the Algorithm itself decides which attributes to use
and which to ignore.
• Example:- Algorithms for building decision tree classifiers

• Filter approaches:
• Features are selected before the data mining algorithm is run
• Approach that is independent of the data mining task.

• Wrapper approaches:
• Uses the target data mining algorithm as a black box to find the best subset of
attributes
• typically without enumerating all possible subsets.
FEATURE SUBSET SELECTION
• An Architecture for Feature Subset Selection :
• The feature selection process is viewed as consisting of four parts:
1. a measure for evaluating a subset,
2. a search strategy that controls the generation of a new subset of features,
3. a stopping criterion, and
4. a validation procedure.

• Filter methods and wrapper methods differ only in the way in which they
evaluate a subset of features.
• wrapper method – uses the target data mining algorithm
• filter approach - evaluation technique is distinct from the target data mining
algorithm.
FEATURE SUBSET SELECTION
FEATURE SUBSET SELECTION
• Feature subset selection is a search over all possible subsets of features.
• Evaluation step - determine the goodness of a subset of attributes with respect to a particular data mining task
• Filter approach: predict how well the actual data mining algorithm will perform on a given set of attributes.
• Wrapper approach: running the target data mining application, measure the result of the data mining.

• Stopping criterion
• conditions involving the following:
• the number of iterations,
• whether the value of the subset evaluation measure is optimal or exceeds a certain threshold,
• whether a subset of a certain size has been obtained,
• whether simultaneous size and evaluation criteria have been achieved, and
• whether any improvement can be achieved by the options available to the search strategy.

• Validation:
• Finally, the results of the target data mining algorithm on the selected subset should be validated.
• An evaluation approach: run the algorithm with the full set of features and compare the full results to results
obtained using the subset of features.
FEATURE SUBSET SELECTION
• Feature Weighting
• An alternative to keeping or eliminating features.
• One Approach
• Higher weight - More important features
• Lower weight - less important features
• Another Approach – automatic
• Example – Classification Scheme - Support vector machines
• Other Approach
• The normalization of objects – Cosine Similarity – used as weights
FEATURE CREATION
• Create a new set of attributes that captures the important
information in a data set from the original attributes
• much more effective.
• No. of new attributes < No. of original attributes
• Three related methodologies for creating new attributes:
1. Feature extraction
2. Mapping the data to a new space
3. Feature construction
FEATURE CREATION
• Feature Extraction
• The creation of a new set of features from the original raw data
• Example: Classify set of photographs based on existence of human face
(present or not)
• Raw data (set of pixels) - not suitable for many types of classification algorithms.
• Higher level features( presence or absence of certain types of edges and areas that are highly correlated with
the presence of human faces), then a much broader set of classification techniques can be applied to this
problem.

• Feature extraction is highly domain-specific

• New area means development of new features and feature extraction
methods.
FEATURE CREATION

Mapping the Data to a New Space

• A totally different view of the data can reveal important and interesting features.
• If there is only a single periodic pattern and not much noise, then the pattern is easily detected.
• If, there are a number of periodic patterns and a significant amount of noise is present, then these
patterns are hard to detect.
• Such patterns can be detected by applying a Fourier transform to the time series in order to
change to a representation in which frequency information is explicit.
• Example:
• Power spectrum that can be computed after applying a Fourier transform to the original time series.
FEATURE CREATION

• Feature Construction
• Features in the original data sets consists necessary information, but not suitable for the data mining
algorithm.
• If new features constructed out of the original features can be more useful than the original features.

• Example (Density).
• Dataset contains the volume and mass of historical artifact.
• Density feature constructed from the mass and volume features, i.e., density = mass/volume, would most
directly yield an accurate classification.
DISCRETIZATION AND BINARIZATION

• Classification algorithms, require that the data be in the form of

categorical attributes.
• Algorithms that find association patterns, require that the data be in
the form of binary attributes.
• Discretization - transforming a continuous attribute into a
categorical attribute
• Binarization - transforming both continuous and discrete attributes
into one or more binary attributes
DISCRETIZATION AND BINARIZATION
• Binarization of a categorical attribute (Simple technique):
• If there are m categorical values, then uniquely assign
each original value to an integer in the interval [0, m − 1].
• If the attribute is ordinal, then order must be maintained
by the assignment.
• Next, convert each of these m integers to a binary number
using n binary attributes
• n = [log2 (m)] binary digits are required to represent these
integers
DISCRETIZATION AND BINARIZATION
Example: a categorical variable with 5 values
{awful, poor, OK, good, great}
require three binary variables x1, x2, and x3.
DISCRETIZATION AND BINARIZATION
• Discretization of Continuous Attributes ( classification or
association analysis)
• Transformation of a continuous attribute to a categorical attribute
involves two subtasks:
• decide no. of categories
• decide how to map the values of the continuous attribute to these
categories.
• Step I: Sort Attribute Values and divide into n intervals by specifying n−1
split points.
• Step II : all the values in one interval are mapped to the same categorical
value.
DISCRETIZATION AND BINARIZATION
• Discretization of Continuous Attributes
• Problem of discretization is
• Deciding how many split points to choose and
• where to place them.
• The result can be represented either as
• a set of intervals {(x0, x1],(x1, x2],... ,(xn−1, xn)},
where x0 and xn may be +∞ or −∞, respectively,
or
• as a series of inequalities x0 < x ≤ x1,..., xn−1 < x < xn.
DISCRETIZATION AND BINARIZATION
• UnSupervised Discretization
• Discretization methods for Classification
• Supervised - known class information
• Unsupervised - unknown class information
• Equal width approach:
• divides the range of the attribute into a user-specified number of
intervals each having the same width.
• problem with outliers
• Equal frequency (equal depth) approach:
• Puts same number of objects into each interval
• K-means Clustering method
DISCRETIZATION AND BINARIZATION
UnSupervised Discretization

Original Data
DISCRETIZATION AND BINARIZATION
UnSupervised Discretization

Equal Width Discretization

DISCRETIZATION AND BINARIZATION
UnSupervised Discretization

Equal Frequency Discretization

DISCRETIZATION AND BINARIZATION
UnSupervised Discretization

K-means Clustering (better result)

DISCRETIZATION AND BINARIZATION
• Supervised Discretization
• When additional information (class labels) are used then it
produces better results.
• Some Concerns: purity of an interval and the minimum size of
an interval.
• statistically based approaches:
• start with each attribute value as a separate interval and
create larger intervals by merging adjacent intervals that are
similar according to a statistical test.
• Entropy based approaches:
DISCRETIZATION AND BINARIZATION
• Supervised Discretization
• Entropy based approaches:
• Entropy Definition

• ei - Entropy in ith interval

• pij = mij/mi probability of class j in the i th interval.
• k - no. of different class labels
• mi - no. of values in the i th interval of a partition,
• mij - no. of values of class j in interval i.
DISCRETIZATION AND BINARIZATION
• Supervised
Discretization
• Entropy
DISCRETIZATION AND BINARIZATION
• Supervised Discretization
• Entropy based approaches:
• Total entropy, e, of the partition is
• weighted average of the individual interval entropies,
• m - no. of values,
• wi = mi/m fraction of values in the i th interval
• n - no. of intervals.
• Perfectly Pure Interval:entropy is 0
• If an interval contains only values of one class
• Impure Interval: entropy is maximum
• classes of values in an interval occur equal
DISCRETIZATION AND BINARIZATION
• Supervised Discretization
• Entropy based approaches:
• Simple approach for partitioning a continuous attribute:
• starts by bisecting the initial values so that the resulting
two intervals give minimum entropy.
• consider each value as a possible split point
• Repeat splitting process with another interval
• choosing the interval with the worst (highest) entropy,
• until a user-specified number of intervals is reached,
or
• stopping criterion is satisfied.
DISCRETIZATION AND BINARIZATION
• Supervised
Discretization
• Entropy based
approaches:
• 3 categories for
both x & y
DISCRETIZATION AND BINARIZATION
• Supervised
Discretization
• Entropy based
approaches:
• 5 categories for
both x & y
• Observation:
• no improvement
for 6 categories
DISCRETIZATION AND BINARIZATION
• Categorical Attributes with Too Many Values
• If categorical attribute is an ordinal,
• techniques similar to those for continuous attributes
• If the categorical attribute is nominal,
• Example:-
• University that has a large number of departments.
• department name attribute - dozens of diff. values.
• combine departments into larger groups, such as
• engineering,
• social sciences, or
• biological sciences.
Variable Transformation

• Transformation that is applied to all the values of a variable.

• Example: magnitude of a variable is important
• then the values of the variable can be transformed by taking the absolute
value.
• Simple Function Transformation:
• A simple mathematical function is applied to each value individually.
• If x is a variable, then examples of such transformations include
• x k,
• log x,
• e x,
• √ x,
• 1/x,
• sin x, or |x|
Variable Transformation

• Variable transformations should be applied with caution since they

change the nature of the data.
• Example:-
• transformation fun. is 1/x
• if value is 1 or >1 then reduces the magnitude of values
• values {1, 2, 3} go to {1, 1/ 2, 1/3}
• if value is b/w 0 & 1 then increases the magnitude of values
• values {1, 1/2, 1/3} go to {1, 2, 3}.
• so better ask questions such as the following:
• Does the order need to be maintained?
• Does the transformation apply to all values( -ve & 0)?
• What is the effect of the transformation on the values between 0 & 1?
Variable Transformation

• Normalization or Standardization
• Goal of standardization or normalization
• To make an entire set of values have a particular property.
• A traditional example is that of “standardizing a variable” in statistics.
• x - mean (average) of the attribute values and
• sx - standard deviation,
• Transformation

• creates a new variable that has a mean of 0 and a standard deviation

of 1.
Variable Transformation

• Normalization or Standardization
• If different variables are to be combined, a transformation is necessary
to avoid having a variable with large values dominate the results of the
calculation.
• Example:
• comparing people based on two variables: age and income.
• For any two people, the difference in income will likely be much
higher in absolute terms (hundreds or thousands of dollars) than the
difference in age (less than 150).
• Income values(higher values) will dominate the calculation.
Variable Transformation

• Normalization or Standardization
• Mean and standard deviation are strongly affected by outliers
• Mean is replaced by the median, i.e., the middle value.
• x - variable
• absolute standard deviation of x is
• xi - i th value of the variable,
• m - number of objects, and
• µ - mean or median.
• Other approaches
• computing estimates of the location (center) and
• spread of a set of values in the presence of outliers
• These measures can also be used to define a standardization transformation.

Unit no 3
No ratings yet
Unit no 3
10 pages
DATA MINING Notes
No ratings yet
DATA MINING Notes
37 pages
Ssr Naac Biher 2024
No ratings yet
Ssr Naac Biher 2024
111 pages
unit 1
No ratings yet
unit 1
28 pages
paper_2
No ratings yet
paper_2
7 pages
1
No ratings yet
1
3 pages
DIGITAL SIGNAL PROCESSING QUIZ-1 (Responses) PDF
No ratings yet
DIGITAL SIGNAL PROCESSING QUIZ-1 (Responses) PDF
1 page
11_chapter6
No ratings yet
11_chapter6
19 pages
03 Data Preparation
No ratings yet
03 Data Preparation
28 pages
DM Lect3 4
No ratings yet
DM Lect3 4
30 pages
Week 2
No ratings yet
Week 2
96 pages
Day School 03
No ratings yet
Day School 03
32 pages
10-2 Data analysis and pre-processing part 4 PDF
No ratings yet
10-2 Data analysis and pre-processing part 4 PDF
23 pages
BI-Unit-3-Part-1-PPT.ppt
No ratings yet
BI-Unit-3-Part-1-PPT.ppt
51 pages
fm12
No ratings yet
fm12
1 page
Data Mining & Data Warehousing
No ratings yet
Data Mining & Data Warehousing
62 pages
1.4 Preprocessing
No ratings yet
1.4 Preprocessing
2 pages
Data Preprocessing 09112023 065121pm
No ratings yet
Data Preprocessing 09112023 065121pm
30 pages
DM 5th unit ppt
No ratings yet
DM 5th unit ppt
54 pages
Unit 1 Data Mining task
No ratings yet
Unit 1 Data Mining task
7 pages
Data Mining and Business Intelligence
No ratings yet
Data Mining and Business Intelligence
52 pages
Data Preprocessing Steps 2
No ratings yet
Data Preprocessing Steps 2
26 pages
DW&DM(Unit -4)
No ratings yet
DW&DM(Unit -4)
9 pages
CSE II I 2305501 Python Programming(EDx)
No ratings yet
CSE II I 2305501 Python Programming(EDx)
10 pages
DM 2 Part 2
No ratings yet
DM 2 Part 2
35 pages
DMA Notes
No ratings yet
DMA Notes
40 pages
Data Preprocessing
No ratings yet
Data Preprocessing
21 pages
Data Mining and Data Warehousing Notes ct1
No ratings yet
Data Mining and Data Warehousing Notes ct1
12 pages
R21 Unit 2
No ratings yet
R21 Unit 2
101 pages
Data Preprocessing
No ratings yet
Data Preprocessing
33 pages
17 Data Analysis
No ratings yet
17 Data Analysis
64 pages
R21 DM Unit1
No ratings yet
R21 DM Unit1
77 pages
Down 2
No ratings yet
Down 2
61 pages
Dimenn Red PDF
No ratings yet
Dimenn Red PDF
135 pages
Taylors College
No ratings yet
Taylors College
19 pages
1.3 Tasks of Data Mining
No ratings yet
1.3 Tasks of Data Mining
10 pages
Faculty Profile -Dr. P. Sengottuvel
No ratings yet
Faculty Profile -Dr. P. Sengottuvel
5 pages
What Is Business Analytics?: Predictive Analytics Descriptive Analytics Prescriptive Analytics
No ratings yet
What Is Business Analytics?: Predictive Analytics Descriptive Analytics Prescriptive Analytics
35 pages
Unit 2 DWDM
No ratings yet
Unit 2 DWDM
14 pages
HLB - Audit Procedure and Documentation, by Tatenda - 22.07.2022
No ratings yet
HLB - Audit Procedure and Documentation, by Tatenda - 22.07.2022
31 pages
Paper - Xvii Data Mining and Warehousing
No ratings yet
Paper - Xvii Data Mining and Warehousing
140 pages
Eagle Club_compressed
No ratings yet
Eagle Club_compressed
6 pages
Lecture 2
No ratings yet
Lecture 2
27 pages
BCA Data Mining
No ratings yet
BCA Data Mining
116 pages
LTi CIL Fan Catalogue 25 1
No ratings yet
LTi CIL Fan Catalogue 25 1
11 pages
Data
No ratings yet
Data
36 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
Data Mining Notes C2
No ratings yet
Data Mining Notes C2
12 pages
Lecture 1428550844
No ratings yet
Lecture 1428550844
87 pages
Business Uses of Data Mining and Data Warehousing MIS 304 Section 04 CRN-41595
No ratings yet
Business Uses of Data Mining and Data Warehousing MIS 304 Section 04 CRN-41595
23 pages
r20 DWDM Unit 2 PART 2
No ratings yet
r20 DWDM Unit 2 PART 2
15 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
02 Data Warehouse
No ratings yet
02 Data Warehouse
18 pages
6 Data Preprocessing
No ratings yet
6 Data Preprocessing
37 pages
Feature Selection and Extraction
No ratings yet
Feature Selection and Extraction
26 pages
CSU-CSUA-CSUAS-Sell-Sheet
No ratings yet
CSU-CSUA-CSUAS-Sell-Sheet
2 pages
BCA-404: Data Mining and Data Ware Housing
No ratings yet
BCA-404: Data Mining and Data Ware Housing
19 pages
Data Mining
No ratings yet
Data Mining
5 pages
Unit 3
No ratings yet
Unit 3
34 pages
BT Current Events Assignment
No ratings yet
BT Current Events Assignment
2 pages
Excel Formula
0% (1)
Excel Formula
331 pages
Data Mining Notes
No ratings yet
Data Mining Notes
14 pages
DMDW 5
No ratings yet
DMDW 5
25 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
122365
No ratings yet
122365
1 page
What is the Specialty of a Chameleon and a Moth
No ratings yet
What is the Specialty of a Chameleon and a Moth
2 pages
Guideline for Inclusion of Affiliated Colleges Under 2 (f) 12 (b) of UGC Act 1956
No ratings yet
Guideline for Inclusion of Affiliated Colleges Under 2 (f) 12 (b) of UGC Act 1956
8 pages
DWDM AR16 Unit 1.2
No ratings yet
DWDM AR16 Unit 1.2
14 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
BDA Class1
No ratings yet
BDA Class1
33 pages
Data Mining Notes
No ratings yet
Data Mining Notes
25 pages
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados
From Everand
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados
César Pérez López
No ratings yet
Bi Lesson 6
No ratings yet
Bi Lesson 6
36 pages
A Fish Feeding Robot Prototype Using A Solar Panel and An Internet of Things - Based Monitoring System
No ratings yet
A Fish Feeding Robot Prototype Using A Solar Panel and An Internet of Things - Based Monitoring System
8 pages
005 WMS JI MI MAU ACS WB XI 22 Work Methode Statement of Antitermite
No ratings yet
005 WMS JI MI MAU ACS WB XI 22 Work Methode Statement of Antitermite
5 pages
UNIT-III Data Warehouse and Minig Notes MDU
No ratings yet
UNIT-III Data Warehouse and Minig Notes MDU
42 pages
Data Mining Questions
100% (1)
Data Mining Questions
7 pages
An Enhanced Hankel Matrix Based Singular Value Decomposition Method For Removing Noise From Partial Discharge Signals.
No ratings yet
An Enhanced Hankel Matrix Based Singular Value Decomposition Method For Removing Noise From Partial Discharge Signals.
5 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
What Is Data Mining?
No ratings yet
What Is Data Mining?
17 pages
Datasheet FBLT054049 en US
No ratings yet
Datasheet FBLT054049 en US
1 page
Major Issues in Data Mining
No ratings yet
Major Issues in Data Mining
5 pages
The Conceptual History of Independence A PDF
No ratings yet
The Conceptual History of Independence A PDF
15 pages
Lowe and CHRISTIE Model Tower
No ratings yet
Lowe and CHRISTIE Model Tower
69 pages
The Future of Work in a Rapidly Changing World
No ratings yet
The Future of Work in a Rapidly Changing World
3 pages
Script 1
No ratings yet
Script 1
19 pages
Procedure For Renewable Energy Certificate
No ratings yet
Procedure For Renewable Energy Certificate
22 pages
6713 DSK Schem PDF
No ratings yet
6713 DSK Schem PDF
12 pages
Introducing Ic3 Digital Literacy Certification Global Standard Six
No ratings yet
Introducing Ic3 Digital Literacy Certification Global Standard Six
2 pages
Systems of Safety Applied To Focus Four Hazards: Usdol-Osha Susan Harwood Grant SHT21005SH0
No ratings yet
Systems of Safety Applied To Focus Four Hazards: Usdol-Osha Susan Harwood Grant SHT21005SH0
72 pages
Amul Supply Chain Management
50% (2)
Amul Supply Chain Management
35 pages
Orbital Projects & Services LLC: Rofessional Esume
No ratings yet
Orbital Projects & Services LLC: Rofessional Esume
2 pages
Aec Unit 4 MCQ
No ratings yet
Aec Unit 4 MCQ
3 pages
Lab 3: Unix and Linux
No ratings yet
Lab 3: Unix and Linux
3 pages
Proportional Valve Group: Technical Information
No ratings yet
Proportional Valve Group: Technical Information
76 pages
Cuban Missile Crisis SUMMARY
No ratings yet
Cuban Missile Crisis SUMMARY
31 pages
Academic Regulations - Autonomous - SRIT R19 - Batch 2019-23
No ratings yet
Academic Regulations - Autonomous - SRIT R19 - Batch 2019-23
24 pages
History of James Peter Whitford (1766-1818) and Sarah (Cree Woman)
100% (1)
History of James Peter Whitford (1766-1818) and Sarah (Cree Woman)
7 pages
B.I.T Institute of Technology:Hindupur: Answer Any of Two Questions
No ratings yet
B.I.T Institute of Technology:Hindupur: Answer Any of Two Questions
1 page
B.I.T Institute of Technology:Hindupur: Answer Any of Two Questions
No ratings yet
B.I.T Institute of Technology:Hindupur: Answer Any of Two Questions
1 page
ECE R15 CMM Sheet
No ratings yet
ECE R15 CMM Sheet
2 pages
Unit 4 MCQ
100% (1)
Unit 4 MCQ
3 pages
13 Batch - III-II - DSP - Course File
No ratings yet
13 Batch - III-II - DSP - Course File
4 pages
Aec Unit 4 MCQ
No ratings yet
Aec Unit 4 MCQ
3 pages
Unit 4 MCQ
No ratings yet
Unit 4 MCQ
3 pages
PERDEV - Q1 - Mod7 - Emotional Intelligence
86% (56)
PERDEV - Q1 - Mod7 - Emotional Intelligence
26 pages
Mid1 QP MPMC Ii-Ii Cse 15a04407 Microprocessor & Interfacing
No ratings yet
Mid1 QP MPMC Ii-Ii Cse 15a04407 Microprocessor & Interfacing
2 pages
17EC52 CIfdddd
No ratings yet
17EC52 CIfdddd
3 pages
AEC Bit
No ratings yet
AEC Bit
5 pages
Data Science in The U.S. Intelligence Community
No ratings yet
Data Science in The U.S. Intelligence Community
4 pages
Aruna Jyothi Talari
No ratings yet
Aruna Jyothi Talari
1 page
Individual Hand in Case PDF
No ratings yet
Individual Hand in Case PDF
7 pages
Storm Water Out Fall
No ratings yet
Storm Water Out Fall
1 page
Abaqus Fluid Structure Interaction Graz-Austria
No ratings yet
Abaqus Fluid Structure Interaction Graz-Austria
20 pages
Theories of Change
No ratings yet
Theories of Change
6 pages
FMCG PDCT in Cis
No ratings yet
FMCG PDCT in Cis
24 pages

unit-2-part-4

Uploaded by

unit-2-part-4

Uploaded by

DATA PREPROCESSING

Sampling and Loss of Information

Determining the Proper Sample Size

• Data sets can have a large number of features.

• Benefits to dimensionality reduction.

• The Curse of Dimensionality

• Linear Algebra Techniques for Dimensionality Reduction

• Redundant and irrelevant features

• 3 standard approaches to feature

• Feature extraction is highly domain-specific

Mapping the Data to a New Space

• Classification algorithms, require that the data be in the form of

Equal Width Discretization

Equal Frequency Discretization

K-means Clustering (better result)

• ei - Entropy in ith interval

• Transformation that is applied to all the values of a variable.

• Variable transformations should be applied with caution since they

• creates a new variable that has a mean of 0 and a standard deviation

You might also like