0% found this document useful (0 votes)

106 views

Data Preprocessing Solution-24-37

Uploaded by

gurudevpasupuleti09

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

106 views

Data Preprocessing Solution-24-37

Uploaded by

gurudevpasupuleti09

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Chapter 3

Data Preprocessing

3.1 Exercises
1. Data quality can be assessed in terms of several issues, including accuracy, completeness, and consis-
tency. For each of the above three issues, discuss how the assessment of data quality can depend on
the intended use of the data, giving examples. Propose two other dimensions of data quality.
Answer:
There can be various examples illustrating that the assessment of data quality can depend on the
intended use of the data. Here we just give a few.

• For accuracy, ﬁrst consider a recommendation system for online purchase of clothes. When it
comes to birth date, the system may only care about in which year the user was born, so that it
can provide the right choices. However, an app in facebook which makes birthday calenders for
friends must acquire the exact day on which a user was born to make a credible calendar.
• For completeness, a product manager may not care much if customers’ address information is
missing while a marketing analyst considers address information essential for analysis.
• For consistency, consider a database manager who is merging two big movie information databases
into one. When he decides whether two entries refer to the same movie, he may check the entry’s
title and release date. Here in either database, the release date must be consistent with the title
or there will be annoying problems. But when a user is searching for a movie’s information just
for entertainment using either database, whether the release date is consistent with the title is
not so important. A user usually cares more about the movie’s content.

Two other dimensions that can be used to assess the quality of data can be taken from the following:
timeliness, believability, value added, interpretability and accessability. These can be used to assess
quality with regard to the following factors:

• Timeliness: Data must be available within a time frame that allows it to be useful for decision
making.
• Believability: Data values must be within the range of possible results in order to be useful for
decision making.
• Value added: Data must provide additional value in terms of information that offsets the cost
of collecting and accessing it.
• Interpretability: Data must not be so complex that the effort to understand the information it
provides exceeds the benefit of its analysis.

19
20 CHAPTER 3. DATA PREPROCESSING

• Accessability: Data must be accessible so that the eﬀort to collect it does not exceed the beneﬁt
from its use.

2. In real-world data, tuples with missing values for some attributes are a common occurrence. Describe
various methods for handling this problem.
Answer:
The various methods for handling the problem of missing values in data tuples include:
(a) Ignoring the tuple: This is usually done when the class label is missing (assuming the mining
task involves classification or description). This method is not very effective unless the tuple
contains several attributes with missing values. It is especially poor when the percentage of
missing values per attribute varies considerably.
(b) Manually filling in the missing value: In general, this approach is time-consuming and may
not be a reasonable task for large data sets with many missing values, especially when the value
to be filled in is not easily determined.
(c) Using a global constant to fill in the missing value: Replace all missing attribute values
by the same constant, such as a label like “Unknown,” or −∞. If missing values are replaced by,
say, “Unknown,” then the mining program may mistakenly think that they form an interesting
concept, since they all have a value in common — that of “Unknown.” Hence, although this
method is simple, it is not recommended.
(d) Using a measure of central tendency for the attribute, such as the mean (for sym-
metric numeric data), the median (for asymmetric numeric data), or the mode (for
nominal data): For example, suppose that the average income of AllElectronics customers is
$28,000 and that the data are symmetric. Use this value to replace any missing values for income.
(e) Using the attribute mean for numeric (quantitative) values or attribute mode for
nominal values, for all samples belonging to the same class as the given tuple: For
example, if classifying customers according to credit risk, replace the missing value with the
average income value for customers in the same credit risk category as that of the given tuple. If
the data are numeric and skewed, use the median value.
(f) Using the most probable value to fill in the missing value: This may be determined
with regression, inference-based tools using Bayesian formalism, or decision tree induction. For
example, using the other customer attributes in your data set, you may construct a decision tree
to predict the missing values for income.

3. Exercise 2.2 gave the following data (in increasing order) for the attribute age: 13, 15, 16, 16, 19, 20,
20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70.

(a) Use smoothing by bin means to smooth the above data, using a bin depth of 3. Illustrate your
steps. Comment on the eﬀect of this technique for the given data.
(b) How might you determine outliers in the data?
(c) What other methods are there for data smoothing?

Answer:

(a) Use smoothing by bin means to smooth the above data, using a bin depth of 3. Illustrate your
steps. Comment on the eﬀect of this technique for the given data.
The following steps are required to smooth the above data using smoothing by bin means with a
bin depth of 3.
3.1. EXERCISES 21

• Step 1: Sort the data. (This step is not required here as the data are already sorted.)
• Step 2: Partition the data into equidepth bins of depth 3.
Bin 1: 13, 15, 16 Bin 2: 16, 19, 20 Bin 3: 20, 21, 22
Bin 4: 22, 25, 25 Bin 5: 25, 25, 30 Bin 6: 33, 33, 35
Bin 7: 35, 35, 35 Bin 8: 36, 40, 45 Bin 9: 46, 52, 70
• Step 3: Calculate the arithmetic mean of each bin.
• Step 4: Replace each of the values in each bin by the arithmetic mean calculated for the bin.
Bin 1: 142/3, 142/3, 142/3 Bin 2: 181/3, 181/3, 181/3 Bin 3: 21, 21, 21
Bin 4: 24, 24, 24 Bin 5: 262/3, 262/3, 262/3 Bin 6: 332/3, 332/3, 332/3
Bin 7: 35, 35, 35 Bin 8: 401/3, 401/3, 401/3 Bin 9: 56, 56, 56
This method smooths a sorted data value by consulting to its ”neighborhood”. It performs local
smoothing.
(b) How might you determine outliers in the data?
Outliers in the data may be detected by clustering, where similar values are organized into groups,
or ‘clusters’. Values that fall outside of the set of clusters may be considered outliers. Alterna-
tively, a combination of computer and human inspection can be used where a predetermined data
distribution is implemented to allow the computer to identify possible outliers. These possible
outliers can then be verified by human inspection with much less effort than would be required
to verify the entire initial data set.
(c) What other methods are there for data smoothing?
Other methods that can be used for data smoothing include alternate forms of binning such as
smoothing by bin medians or smoothing by bin boundaries. Alternatively, equiwidth bins can be
used to implement any of the forms of binning, where the interval range of values in each bin is
constant. Methods other than binning include using regression techniques to smooth the data by
fitting it to a function such as through linear or multiple regression. Also, classification techniques
can be used to implement concept hierarchies that can smooth the data by rolling-up lower level
concepts to higher-level concepts.

4. Discuss issues to consider during data integration.

Answer:
Data integration involves combining data from multiple sources into a coherent data store. Issues that
must be considered during such integration include:

• Schema integration: The metadata from the different data sources must be integrated in order
to match up equivalent real-world entities. This is referred to as the entity identification problem.
• Handling redundant data: Derived attributes may be redundant, and inconsistent attribute
naming may also lead to redundancies in the resulting data set. Also, duplications at the tuple
level may occur and thus need to be detected and resolved.
• Detection and resolution of data value conflicts: Differences in representation, scaling or
encoding may cause the same real-world entity attribute values to differ in the data sources being
integrated.

5. What are the value ranges of the following normalization methods?

(a) min-max normalization

22 CHAPTER 3. DATA PREPROCESSING

(b) z-score normalization

(c) z-score normalization using the mean absolute deviation instead of standard deviation
(d) normalization by decimal scaling
Answer:
(a) min-max normalization can deﬁne any value range and linearly map the original data to this
range.
(b) z-score normalization normalize the values for an attribute A based on the mean and standard
deviation. The value range for z-score normalization is [ minσAA−Ā , maxσAA−Ā ].
(c) z-score normalization using the mean absolute deviation is a variation of z-score normalization by
replacing the standard deviation with the mean absolute deviation of A, denoted by sA , which is
1
sA = (|v1 − Ā| + |v2 − Ā| + ... + |vn − Ā|).
n
The value range is [ minsAA−Ā , maxsA
A
−Ā
].
(d) normalization by decimal scaling normalizes by moving the decimal point of values of attribute
A. The value range is
minA maxA
[ , ],
10j 10j
where j is the smallest integer such that M ax(| 10j |) < 1.
vi

6. Use the methods below to normalize the following group of data:

200, 300, 400, 600, 1000
(a) min-max normalization by setting min = 0 and max = 1
(b) z-score normalization
(c) z-score normalization using the mean absolute deviation instead of standard deviation
(d) normalization by decimal scaling
Answer:
(a) min-max normalization by setting min = 0 and max = 1 get the new value by computing
vi − 200
vi′ = (1 − 0) + 0.
1000 − 200
The normalized data are:
0, 0.125, 0.25, 0.5, 1
(b) In z-score normalization, a value vi of A is normalized to vi′ by computing
vi − Ā
vi′ = ,
σA
where
1
Ā = (200 + 300 + 400 + 600 + 1000) = 500,
5
√
1
σA = (2002 + 3002 + ... + 10002 ) − Ā2 = 282.8.
5
The normalized data are:
−1.06, −0.707, −0.354, 0.354, 1.77
3.1. EXERCISES 23

(c) z-score normalization using the mean absolute deviation instead of standard deviation replaces
σA with sA , where
1
sA = (|200 − 500| + |300 − 500| + ... + |1000 − 500|) = 240
5
The normalized data are:
−1.25, −0.833, −0.417, 0.417, 2.08

j |) < 1 is 3. After normalization by decimal scaling, the

vi
(d) The smallest integer j such that M ax(| 10
data become:
0.2, 0.3, 0.4, 0.6, 1.0

7. Using the data for age given in Exercise 3.3, answer the following:

(a) Use min-max normalization to transform the value 35 for age onto the range [0.0, 1.0].
(b) Use z-score normalization to transform the value 35 for age, where the standard deviation of age
is 12.94 years.
(c) Use normalization by decimal scaling to transform the value 35 for age.
(d) Comment on which method you would prefer to use for the given data, giving reasons as to why.

Answer:

(a) Use min-max normalization to transform the value 35 for age onto the range [0.0, 1.0].
Using the corresponding equation with minA = 13, maxA = 70, new minA = 0, new maxA = 1.0,
then v = 35 is transformed to v ′ = 0.39.
(b) Use z-score normalization to transform the value 35 for age, where the standard deviation of age
is 12.94 years.
Using the corresponding equation where A = 809/27 = 29.96 and σA = 12.94, then v = 35 is
transformed to v ′ = 0.39.
(c) Use normalization by decimal scaling to transform the value 35 for age.
Using the corresponding equation where j = 2, v = 35 is transformed to v ′ = 0.35.
(d) Comment on which method you would prefer to use for the given data, giving reasons as to why.
Given the data, one may prefer decimal scaling for normalization as such a transformation would
maintain the data distribution and be intuitive to interpret, while still allowing mining on spe-
ciﬁc age groups. Min-max normalization has the undesired eﬀect of not permitting any future
values to fall outside the current minimum and maximum values without encountering an “out
of bounds error”. As it is probable that such values may be present in future data, this method
is less appropriate. Also, z-score normalization transforms values into measures that represent
their distance from the mean, in terms of standard deviations. It is probable that this type of
transformation would not increase the information value of the attribute in terms of intuitiveness
to users or in usefulness of mining results.

8. Using the data for age and body fat given in Exercise 2.4, answer the following:

(a) Normalize the two attributes based on z-score normalization.

(b) Calculate the correlation coeﬃcient (Pearson’s product moment coeﬃcient). Are these two at-
tributes positively or negatively correlated? Compute their covariance.
24 CHAPTER 3. DATA PREPROCESSING

Answer:
(a) Normalize the two variables based on z-score normalization.

age 23 23 27 27 39 41 47 49 50
z-age -1.83 -1.83 -1.51 -1.51 -0.58 -0.42 0.04 0.20 0.28
%fat 9.5 26.5 7.8 17.8 31.4 25.9 27.4 27.2 31.2
z-%fat -2.14 -0.25 -2.33 -1.22 0.29 -0.32 -0.15 -0.18 0.27
age 52 54 54 56 57 58 58 60 61
z-age 0.43 0.59 0.59 0.74 0.82 0.90 0.90 1.06 1.13
%fat 34.6 42.5 28.8 33.4 30.2 34.1 32.9 41.2 35.7
z-%fat 0.65 1.53 0.0 0.51 0.16 0.59 0.46 1.38 0.77

(b) Calculate the correlation coefficient (Pearson’s product moment coefficient). Are these two vari-
ables positively or negatively correlated?
The correlation coefficient is 0.82. The variables are positively correlated.

9. Suppose a group of 12 sales price records has been sorted as follows:

5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215.
Partition them into three bins by each of the following methods.
(a) equal-frequency (equidepth) partitioning
(b) equal-width partitioning
(c) clustering
Answer:
(a) equal-frequency (equidepth) partitioning
Partition the data into equidepth bins of depth 4:
Bin 1: 1: 5, 10, 11, 13 Bin 2: 15, 35, 50, 55 Bin 3: 72, 92, 204, 215
(b) equal-width partitioning
Partitioning the data into 3 equi-width bins will require the width to be (215 − 5)/3 = 70. We
get:
Bin 1: 5, 10, 11, 13, 15, 35, 50, 55, 72 Bin 2: 92 Bin 3: 204, 215
(c) clustering
Using K-means clustering to partition the data into three bins we get:
Bin 1: 5, 10, 11, 13, 15, 35 Bin 2: 50, 55, 72, 92 Bin 3: 204, 215

10. Use a ﬂowchart to summarize the following procedures for attribute subset selection:
(a) stepwise forward selection
(b) stepwise backward elimination
(c) a combination of forward selection and backward elimination
Answer:
3.1. EXERCISES 25

Figure 3.1: Stepwise forward selection.

(a) Stepwise forward selection

See Figure 3.1.
(b) Stepwise backward elimination
See Figure 3.2.
(c) A combination of forward selection and backward elimination
See Figure 3.3.

11. Using the data for age given in Exercise 3.3,

(a) Plot an equal-width histogram of width 10.
(b) Sketch examples of each of the following sampling techniques: SRSWOR, SRSWR, cluster sam-
pling, stratiﬁed sampling. Use samples of size 5 and the strata “youth”, “middle-aged”, and
“senior”.
Answer:
(a) Plot an equiwidth histogram of width 10.
See Figure 3.4.
(b) Sketch examples of each of the following sampling techniques: SRSWOR, SRSWR, cluster sam-
pling, stratiﬁed sampling. Use samples of size 5 and the strata “young”, “middle-aged”, and
“senior”.
See Figure 3.5.

12. ChiMerge [Ker92] is a supervised, bottom-up (i.e., merge-based) data discretization method. It relies
on χ2 analysis: adjacent intervals with the least χ2 values are merged together till the chosen stopping
criterion satisﬁes.
26 CHAPTER 3. DATA PREPROCESSING

Figure 3.2: Stepwise backward elimination.

(a) Brieﬂy describe how ChiMerge works.

(b) Take the IRIS data set, obtained from the UC-Irvine Machine Learning Data Repository
(http://www.ics.uci.edu/∼mlearn/MLRepository.html ), as a data set to be discretized. Perform
data discretization for each of the four numerical attributes using the ChiMerge method. (Let the
stopping criteria be: max-interval = 6). You need to write a small program to do this to avoid
clumsy numerical computation. Submit your simple analysis and your test results: split points,
final intervals, and your documented source program.
Answer:
(a) The ChiMerge algorithm consists of an initialization step and a bottom-up merging process, where
intervals are continuously merged until a termination condition is met. Chimerge is initialized by
first sorting the training examples according to their value for the attribute being discretized and
then constructing the initial discretization, in which each example is put into its own interval (i.e.,
place an interval boundary before and after each example). The interval merging process contains
two steps, repeated continuously: (1) compute the χ2 value for each pair of adjacent intervals, (2)
merge (combine) the pair of adjacent intervals with the lowest χ2 value. Merging continues until
a predefined stopping criterion is met.
(b) According to the description in (a), the ChiMerge algorithm can be easily implemented. Detailed
empirical results and discussions can be found in this paper: Kerber, R. (1992). ChiMerge :
Discretization of numeric attributes, In Proceedings of the Tenth National Conference on Artificial
Intelligence, 123-128.

13. Propose an algorithm, in pseudocode or in your favorite programming language, for the following:
(a) The automatic generation of a concept hierarchy for categorical data based on the number of
distinct values of attributes in the given schema
(b) The automatic generation of a concept hierarchy for numerical data based on the equal-width
partitioning rule
3.1. EXERCISES 27

Figure 3.3: A combination of forward selection and backward elimination.

Answer:

(a) The automatic generation of a concept hierarchy for categorical data based on the number of
distinct values of attributes in the given schema
Pseudocode for the automatic generation of a concept hierarchy for categorical data based on the
number of distinct values of attributes in the given schema:

begin
// array to hold name and distinct value count of attributes
// used to generate concept hierarchy
array count ary[]; string count ary[].name; // attribute name
int count ary[].count; // distinct value count

// array to represent concept hierarchy (as an ordered list of values)

array concept hierarchy[];

for each attribute ’A’ in schema {

distinct count = count distinct ’A’;
insert (’A’, ’distinct count’) into count ary[];
}

sort count ary[] ascending by count;

for (i = 0; i < count ary[].length; i++) {

// generate concept hierarchy nodes
concept hierarchy[i] = count ary[i].name;
} end
28 CHAPTER 3. DATA PREPROCESSING

Figure 3.4: An equiwidth histogram of width 10 for age.

To indicate a minimal count threshold necessary for generating another level in the concept
hierarchy, the user could specify an additional parameter.
(b) The automatic generation of a concept hierarchy for numeric data based on the equiwidth parti-
tioning rule
begin
// numerical attribute to be used to generate concept hierarchy
string concept attb;

// array to represent concept hierarchy (as an ordered list of values)

array concept hierarchy[];

string concept hierarchy[].name; // attribute name

int concept hierarchy[].max; // max value of bin
int concept hierarchy[].min; // min value of bin
int concept hierarchy[].mean; // mean value of bin
int concept hierarchy[].sum; // sum of bin
int concept hierarchy[].count; // tuple count of bin

int range min; // min data value − user speciﬁed

int range max; // max data value − user speciﬁed
int step; // width of bins − user speciﬁed
int j=0;

// initialize concept hierarchy array

for (i=0; i < range max; i+=step) {
concept hierarchy[j].name = ’level ’ + j;
concept hierarchy[j].min = i;
concept hierarchy[j].max = i + step − 1;
j++;
3.1. EXERCISES 29

// initialize ﬁnal max value if necessary

if (i ≥ range max) {
concept hierarchy[j].max = i + step − 1;
}

// assign each value to a bin by incrementing the appropriate sum and count values
for each tuple T in task relevant data set {
int k=0;
while (T.concept attb > concept hierarchy[k].max) { k++; }
concept hierarchy[k].sum += T.concept attb;
concept hierarchy[k].count++;
}

// calculate the bin metric used to represent the value of each level
// in the concept hierarchy
for i=0; i < concept hierarchy[].length; i++) {
concept hierarchy[i].mean = concept hierarchy[i].sum / concept hierarchy[i].count;
} end
The user can specify more meaningful names for the concept hierarchy levels generated by review-
ing the maximum and minimum values of the bins, with respect to background knowledge about
the data (i.e., assigning the labels young, middle-aged and old to a three level hierarchy generated
for age.) Also, an alternative binning method could be implemented, such as smoothing by bin
modes.
(c) The automatic generation of a concept hierarchy for numeric data based on the equidepth parti-
tioning rule
Pseudocode for the automatic generation of a concept hierarchy for numeric data based on the
equidepth partitioning rule:
begin
// numerical attribute to be used to generate concept hierarchy
string concept attb;

// array to represent concept hierarchy (as an ordered list of values)

array concept hierarchy[];
string concept hierarchy[].name; // attribute name
int concept hierarchy[].max; // max value of bin
int concept hierarchy[].min; // min value of bin
int concept hierarchy[].mean; // mean value of bin
int concept hierarchy[].sum; // sum of bin
int concept hierarchy[].count; // tuple count of bin

int bin depth; // depth of bins to be used − user speciﬁed

int range min; // min data value − user speciﬁed
int range max; // max data value − user speciﬁed

// initialize concept hierarchy array

for (i=0; i < (range max/bin depth(; i++) {
concept hierarchy[i].name = ’level ’ + i;
concept hierarchy[i].min = 0;
30 CHAPTER 3. DATA PREPROCESSING

concept hierarchy[i].max = 0;
}

// sort the task-relevant data set sort data set ascending by concept attb;

int j=1; int k=0;

// assign each value to a bin by incrementing the appropriate sum,

// min and max values as necessary
for each tuple T in task relevant data set {
concept hierarchy[k].sum += T.concept attb;
concept hierarchy[k].count++;
if (T.concept attb <= concept hierarchy[k].min) {
concept hierarchy[k].min = T.concept attb;
}
if (T.concept attb >= concept hierarchy[k].max) {
concept hierarchy[k].max = T.concept attb;
};
j++;
if (j > bin depth) {
k++; j=1;
}
}

// calculate the bin metric used to represent the value of each level
// in the concept hierarchy
for i=0; i < concept hierarchy[].length; i++) {
concept hierarchy[i].mean = concept hierarchy[i].sum / concept hierarchy[i].count;
}
end
This algorithm does not attempt to distribute data values across multiple bins in order to smooth
out any diﬀerence between the actual depth of the ﬁnal bin and the desired depth to be imple-
mented. Also, the user can again specify more meaningful names for the concept hierarchy levels
generated by reviewing the maximum and minimum values of the bins, with respect to background
knowledge about the data.

14. Robust data loading poses a challenge in database systems because the input data are often dirty.
In many cases, an input record may miss multiple values, some records could be contaminated, with
some data values out of range or of a diﬀerent data type than expected. Work out an automated data
cleaning and loading algorithm so that the erroneous data will be marked, and contaminated data will
not be mistakenly inserted into the database during data loading.
Answer:
We can tackle this automated data cleaning and loading problem from the following perspectives:
• Use metadata (e.g., domain, range, dependency, distribution).
• Check unique rule, consecutive rule and null rule.
• Check ﬁeld overloading.
• Spell-checking.
3.2. SUPPLEMENTARY EXERCISES 31

• Detect diﬀerent attribute names which actually have the same meaning.
• Use domain knowledge to detect errors and make corrections.

3.2 Supplementary Exercises

1. The following table contains the attributes name, gender, trait-1, trait-2, trait-3, and trait-4, where
name is an object identiﬁer, gender is a symmetric attribute, and the remaining trait attributes are
asymmetric, describing personal traits of individuals who desire a penpal. Suppose that a service exists
that attempts to ﬁnd pairs of compatible penpals.

name gender trait-1 trait-2 trait-3 trait-4

Kevin M N P P N
Caroline F N P P N
Erik M P N N P
.. .. .. .. .. ..
. . . . . .

2MKJiawei, can we please discuss this exercise? There are many ambiguities.
For asymmetric attribute values, let the value P be set to 1 and the value N be set to 0. Suppose that
the distance between objects (potential penpals) is computed based only on the asymmetric variables.

(a) Show the contingency matrix for each pair given Kevin, Caroline, and Erik (based on trait-1 to
trait-4 ).
(b) 2MKBased on our discussion, we no longer refer to simple matching coefficient or Jaccard co-
efficient in Section 7.2.2. Compute the invariant dissimilarity of each pair using Equation (??).
2MKAmbiguity: Why does part (b) use the equation for symmetric binary variables when we
instruct the reader to use only the four asymmetric variables? Note that the answers we get for
parts (b) and (c) are even identical, so I see no point in asking this confusing question??
(c) Compute the noninvariant dissimilarity of each pair using Equation (??).
(d) Who do you suggest would make the best pair of penpals? Which pair of individuals would be
the least compatible?
(e) Suppose that we are to include the symmetric variable gender in our analysis. Based on Equa-
tion (??), who would be the most compatible pair, and why?
2MKAmbiguity: Why are we asking the reader to use the Equation for assymetric variables when
including the symmetric variable gender? (and if so, we would need to specify whether M or F
should be coded as 1)? Shouldn’t they be using the technique for variables of mixed types? I
looked at my copy of the answer book and, based on the calculations, it does appear that the
equation for variables of mixed type is used (which contradicts our question). However, I obtain
different answers than in the answer book (although my copy may be outdated.) I obtained d(K,C)
= 1/1 = 1 (disagrees with answer book); d(K,E) = 4/5 (agrees with answer book); d(C,E) = 5/5
= 1 (different derivation than answer book). Let’s discuss! Thanks.
32 CHAPTER 3. DATA PREPROCESSING

Tuples
T1 13 T10 22 T19 33
T2 15 T11 25 T20 35
T3 16 T12 25 T21 35
T4 16 T13 25 T22 36
T5 19 T14 25 T23 40
T6 20 T15 30 T24 45
T7 20 T16 33 T25 46
T8 21 T17 33 T26 52
T9 22 T18 33 T27 70

SRSWOR vs. SRSWR

SRSWOR (n = 5) SRSWR (n = 5)
T4 16 T7 20
T6 20 T7 20
T10 22 T20 35
T11 25 T21 35
T26 32 T25 46

Clustering sampling: Initial clusters

T1 13 T6 20 T11 25 T16 33 T21 35
T2 15 T7 20 T12 25 T17 33 T22 36
T26 52
T3 16 T8 21 T13 25 T18 33 T23 40
T27 70
T4 16 T9 22 T14 25 T19 33 T24 45
T5 19 T10 22 T15 30 T20 35 T25 46

Cluster sampling (m = 2)
T6 20 T21 35
T7 20 T22 36
T8 21 T23 40
T9 22 T24 45
T10 22 T25 46

Stratified Sampling
T1 13 young T10 22 young T19 33 middle age
T2 15 young T11 25 young T20 35 middle age
T3 16 young T12 25 young T21 35 middle age
T4 16 young T13 25 young T22 36 middle age
T5 19 young T14 25 young T23 40 middle age
T6 20 young T15 30 middle age T24 45 middle age
T7 20 young T16 33 middle age T25 46 middle age
T8 21 young T17 33 middle age T26 52 middle age
T9 22 young T18 33 middle age T27 70 senior

Stratified Sampling (according to age)

T4 16 young
T12 25 young
T17 33 middle age
T25 46 middle age
T27 70 senior

Figure 3.5: Examples of sampling: SRSWOR, SRSWR, cluster sampling, stratiﬁed sampling.

Final - UFP Skills For Sci Eng - Lab Assessment 2 Instructions 2022-23
No ratings yet
Final - UFP Skills For Sci Eng - Lab Assessment 2 Instructions 2022-23
10 pages
LiDAR360 Tutorial Document
100% (4)
LiDAR360 Tutorial Document
130 pages
E-Tivity 2.2 Tharcisse 217010849
No ratings yet
E-Tivity 2.2 Tharcisse 217010849
7 pages
ASTM C900 - Pullout Strength PDF
No ratings yet
ASTM C900 - Pullout Strength PDF
10 pages
Testing Aggregates: BS 812: Part 117: 1988
No ratings yet
Testing Aggregates: BS 812: Part 117: 1988
10 pages
ML Assignment-1
No ratings yet
ML Assignment-1
7 pages
Unit-2 Lecture Notes
No ratings yet
Unit-2 Lecture Notes
33 pages
UNIT-2
No ratings yet
UNIT-2
34 pages
UNIT-2
No ratings yet
UNIT-2
37 pages
DWDM UNIT-II
No ratings yet
DWDM UNIT-II
18 pages
Data Preprocessing Techniques: 1.1 Why Preprocess The Data?
No ratings yet
Data Preprocessing Techniques: 1.1 Why Preprocess The Data?
12 pages
Data Preprocessing 013333
No ratings yet
Data Preprocessing 013333
8 pages
Chap 1 Data Preprocessing
No ratings yet
Chap 1 Data Preprocessing
17 pages
Unit-Ii Data Preprocessing
No ratings yet
Unit-Ii Data Preprocessing
94 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
CS-DM MODULE-2
No ratings yet
CS-DM MODULE-2
30 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
29 pages
Assignment 2
No ratings yet
Assignment 2
2 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
253777
No ratings yet
253777
66 pages
data_mining_unit_3[1]
No ratings yet
data_mining_unit_3[1]
64 pages
DWDM Unit II
No ratings yet
DWDM Unit II
29 pages
Data Cleaning
No ratings yet
Data Cleaning
26 pages
M2 PPT
No ratings yet
M2 PPT
60 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
DMDW_
No ratings yet
DMDW_
14 pages
3-Data Pre-Processing
No ratings yet
3-Data Pre-Processing
18 pages
Data and DW Lab Manual Updated
No ratings yet
Data and DW Lab Manual Updated
44 pages
CIS664-Knowledge Discovery and Data Mining
No ratings yet
CIS664-Knowledge Discovery and Data Mining
52 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
Data Mining Requires Collecting Great Amount of Data (Available in Data Warehouses or Databases) To Achieve The Intended Objective
No ratings yet
Data Mining Requires Collecting Great Amount of Data (Available in Data Warehouses or Databases) To Achieve The Intended Objective
37 pages
DEC_Unit II Data Pre-processing
No ratings yet
DEC_Unit II Data Pre-processing
96 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
Data Preprocessing
No ratings yet
Data Preprocessing
12 pages
DM Chapter 3 Data Preprocessing
No ratings yet
DM Chapter 3 Data Preprocessing
76 pages
Lecture 09 DM
No ratings yet
Lecture 09 DM
14 pages
Final - Unit 3 Data Preprocessing - Phases
No ratings yet
Final - Unit 3 Data Preprocessing - Phases
42 pages
Chapter-3 data processing
No ratings yet
Chapter-3 data processing
54 pages
Data Cleaning Wrangling
No ratings yet
Data Cleaning Wrangling
42 pages
06 Data Mining-Data Preprocessing-Cleaning
No ratings yet
06 Data Mining-Data Preprocessing-Cleaning
6 pages
Que Es Datamin
No ratings yet
Que Es Datamin
52 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
VIPDMTheoryChapter3
No ratings yet
VIPDMTheoryChapter3
87 pages
Preprocessing 935
No ratings yet
Preprocessing 935
68 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
DM-24-DATA-CLEANING
No ratings yet
DM-24-DATA-CLEANING
2 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
Chapter 3 - Data Pre-Processing Notes
No ratings yet
Chapter 3 - Data Pre-Processing Notes
8 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
Data preprocessing (1)
No ratings yet
Data preprocessing (1)
77 pages
CH 2
No ratings yet
CH 2
36 pages
Preprocessing
No ratings yet
Preprocessing
62 pages
Mit401 Unit 10-Slm
No ratings yet
Mit401 Unit 10-Slm
23 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
Week2-2
No ratings yet
Week2-2
25 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
Aiml Data Preprocessing
No ratings yet
Aiml Data Preprocessing
99 pages
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
No ratings yet
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
49 pages
Assignment 02
No ratings yet
Assignment 02
9 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
IT Specialist: Data Analytics Certification Prep - 500 Exam Questions and Explanations
From Everand
IT Specialist: Data Analytics Certification Prep - 500 Exam Questions and Explanations
Steve Brown
No ratings yet
Introduction to Robotics
From Everand
Introduction to Robotics
Swarnalata Verma
No ratings yet
The Secret Of Machine Learning
From Everand
The Secret Of Machine Learning
Mhd Arjunanta
No ratings yet
Machine Learning Lab - Preprocessing
No ratings yet
Machine Learning Lab - Preprocessing
13 pages
Biconnected Components
No ratings yet
Biconnected Components
26 pages
Daa Lab Evaluation
No ratings yet
Daa Lab Evaluation
10 pages
Sorting
No ratings yet
Sorting
21 pages
(eBook PDF) Advanced Mathematical And Computational Tools In Metrology And Testing X 2024 Scribd Download
100% (10)
(eBook PDF) Advanced Mathematical And Computational Tools In Metrology And Testing X 2024 Scribd Download
30 pages
Review Mid-Term Exam
No ratings yet
Review Mid-Term Exam
7 pages
BBA Quiz 1
No ratings yet
BBA Quiz 1
7 pages
Normative Values For The Voice Handicap Index-10: Yzpittsburgh, Pennsylvania
No ratings yet
Normative Values For The Voice Handicap Index-10: Yzpittsburgh, Pennsylvania
4 pages
SIP On Essential Oils
No ratings yet
SIP On Essential Oils
17 pages
food_hub_businees_report
No ratings yet
food_hub_businees_report
15 pages
Unit 5
No ratings yet
Unit 5
27 pages
Augustus 0.4.3.1 Doc
No ratings yet
Augustus 0.4.3.1 Doc
157 pages
Representation of Data
No ratings yet
Representation of Data
27 pages
Worksheet 11 Statistics Grade 11 Mathematics
No ratings yet
Worksheet 11 Statistics Grade 11 Mathematics
5 pages
Fuzzy-Set Qualitative Comparative Analysis (Fsqca) : Guidelines For Research Practice in Information Systems and Marketing
No ratings yet
Fuzzy-Set Qualitative Comparative Analysis (Fsqca) : Guidelines For Research Practice in Information Systems and Marketing
24 pages
Compendium Final
No ratings yet
Compendium Final
159 pages
CPG FMCG Food Benchmark Study
No ratings yet
CPG FMCG Food Benchmark Study
94 pages
Fitting Ellipse
No ratings yet
Fitting Ellipse
3 pages
Discovering Anomalous Aviation Safety Events Using Scalable Data Mining Algorithms
No ratings yet
Discovering Anomalous Aviation Safety Events Using Scalable Data Mining Algorithms
12 pages
DWDM R13 Unit 1 PDF
No ratings yet
DWDM R13 Unit 1 PDF
10 pages
Practice Problems From Levine Stephan KR PDF
No ratings yet
Practice Problems From Levine Stephan KR PDF
94 pages
2 - Errors and Statistical Data in Chemical Analysis
No ratings yet
2 - Errors and Statistical Data in Chemical Analysis
21 pages
Physics 11
No ratings yet
Physics 11
7 pages
Big Data Analysis
100% (1)
Big Data Analysis
30 pages
Analytical Data. Final
No ratings yet
Analytical Data. Final
48 pages
Lesson Plan Template For Mat 3
No ratings yet
Lesson Plan Template For Mat 3
7 pages
217 429 1 SMsdgsdgasf
No ratings yet
217 429 1 SMsdgsdgasf
220 pages
Dr. Margaret Armstrong (Auth.) Basic Linear Geostatistics 1998 (057-068)
No ratings yet
Dr. Margaret Armstrong (Auth.) Basic Linear Geostatistics 1998 (057-068)
12 pages
One Way Anova
No ratings yet
One Way Anova
35 pages

Data Preprocessing Solution-24-37

Uploaded by

Data Preprocessing Solution-24-37

Uploaded by

Chapter 3

4. Discuss issues to consider during data integration.

5. What are the value ranges of the following normalization methods?

(a) min-max normalization

(b) z-score normalization

6. Use the methods below to normalize the following group of data:

j |) < 1 is 3. After normalization by decimal scaling, the

(a) Normalize the two attributes based on z-score normalization.

9. Suppose a group of 12 sales price records has been sorted as follows:

Figure 3.1: Stepwise forward selection.

(a) Stepwise forward selection

11. Using the data for age given in Exercise 3.3,

Figure 3.2: Stepwise backward elimination.

(a) Brieﬂy describe how ChiMerge works.

Figure 3.3: A combination of forward selection and backward elimination.

// array to represent concept hierarchy (as an ordered list of values)

for each attribute ’A’ in schema {

sort count ary[] ascending by count;

for (i = 0; i < count ary[].length; i++) {

Figure 3.4: An equiwidth histogram of width 10 for age.

// array to represent concept hierarchy (as an ordered list of values)

string concept hierarchy[].name; // attribute name

int range min; // min data value − user speciﬁed

// initialize concept hierarchy array

// initialize ﬁnal max value if necessary

// array to represent concept hierarchy (as an ordered list of values)

int bin depth; // depth of bins to be used − user speciﬁed

// initialize concept hierarchy array

int j=1; int k=0;

// assign each value to a bin by incrementing the appropriate sum,

3.2 Supplementary Exercises

name gender trait-1 trait-2 trait-3 trait-4

SRSWOR vs. SRSWR

Clustering sampling: Initial clusters

Stratified Sampling (according to age)

You might also like