0% found this document useful (0 votes)
106 views

Data Preprocessing Solution-24-37

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
106 views

Data Preprocessing Solution-24-37

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Chapter 3

Data Preprocessing

3.1 Exercises
1. Data quality can be assessed in terms of several issues, including accuracy, completeness, and consis-
tency. For each of the above three issues, discuss how the assessment of data quality can depend on
the intended use of the data, giving examples. Propose two other dimensions of data quality.
Answer:
There can be various examples illustrating that the assessment of data quality can depend on the
intended use of the data. Here we just give a few.

• For accuracy, first consider a recommendation system for online purchase of clothes. When it
comes to birth date, the system may only care about in which year the user was born, so that it
can provide the right choices. However, an app in facebook which makes birthday calenders for
friends must acquire the exact day on which a user was born to make a credible calendar.
• For completeness, a product manager may not care much if customers’ address information is
missing while a marketing analyst considers address information essential for analysis.
• For consistency, consider a database manager who is merging two big movie information databases
into one. When he decides whether two entries refer to the same movie, he may check the entry’s
title and release date. Here in either database, the release date must be consistent with the title
or there will be annoying problems. But when a user is searching for a movie’s information just
for entertainment using either database, whether the release date is consistent with the title is
not so important. A user usually cares more about the movie’s content.

Two other dimensions that can be used to assess the quality of data can be taken from the following:
timeliness, believability, value added, interpretability and accessability. These can be used to assess
quality with regard to the following factors:

• Timeliness: Data must be available within a time frame that allows it to be useful for decision
making.
• Believability: Data values must be within the range of possible results in order to be useful for
decision making.
• Value added: Data must provide additional value in terms of information that offsets the cost
of collecting and accessing it.
• Interpretability: Data must not be so complex that the effort to understand the information it
provides exceeds the benefit of its analysis.

19
20 CHAPTER 3. DATA PREPROCESSING

• Accessability: Data must be accessible so that the effort to collect it does not exceed the benefit
from its use.

2. In real-world data, tuples with missing values for some attributes are a common occurrence. Describe
various methods for handling this problem.
Answer:
The various methods for handling the problem of missing values in data tuples include:
(a) Ignoring the tuple: This is usually done when the class label is missing (assuming the mining
task involves classification or description). This method is not very effective unless the tuple
contains several attributes with missing values. It is especially poor when the percentage of
missing values per attribute varies considerably.
(b) Manually filling in the missing value: In general, this approach is time-consuming and may
not be a reasonable task for large data sets with many missing values, especially when the value
to be filled in is not easily determined.
(c) Using a global constant to fill in the missing value: Replace all missing attribute values
by the same constant, such as a label like “Unknown,” or −∞. If missing values are replaced by,
say, “Unknown,” then the mining program may mistakenly think that they form an interesting
concept, since they all have a value in common — that of “Unknown.” Hence, although this
method is simple, it is not recommended.
(d) Using a measure of central tendency for the attribute, such as the mean (for sym-
metric numeric data), the median (for asymmetric numeric data), or the mode (for
nominal data): For example, suppose that the average income of AllElectronics customers is
$28,000 and that the data are symmetric. Use this value to replace any missing values for income.
(e) Using the attribute mean for numeric (quantitative) values or attribute mode for
nominal values, for all samples belonging to the same class as the given tuple: For
example, if classifying customers according to credit risk, replace the missing value with the
average income value for customers in the same credit risk category as that of the given tuple. If
the data are numeric and skewed, use the median value.
(f) Using the most probable value to fill in the missing value: This may be determined
with regression, inference-based tools using Bayesian formalism, or decision tree induction. For
example, using the other customer attributes in your data set, you may construct a decision tree
to predict the missing values for income.

3. Exercise 2.2 gave the following data (in increasing order) for the attribute age: 13, 15, 16, 16, 19, 20,
20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70.

(a) Use smoothing by bin means to smooth the above data, using a bin depth of 3. Illustrate your
steps. Comment on the effect of this technique for the given data.
(b) How might you determine outliers in the data?
(c) What other methods are there for data smoothing?

Answer:

(a) Use smoothing by bin means to smooth the above data, using a bin depth of 3. Illustrate your
steps. Comment on the effect of this technique for the given data.
The following steps are required to smooth the above data using smoothing by bin means with a
bin depth of 3.
3.1. EXERCISES 21

• Step 1: Sort the data. (This step is not required here as the data are already sorted.)
• Step 2: Partition the data into equidepth bins of depth 3.
Bin 1: 13, 15, 16 Bin 2: 16, 19, 20 Bin 3: 20, 21, 22
Bin 4: 22, 25, 25 Bin 5: 25, 25, 30 Bin 6: 33, 33, 35
Bin 7: 35, 35, 35 Bin 8: 36, 40, 45 Bin 9: 46, 52, 70
• Step 3: Calculate the arithmetic mean of each bin.
• Step 4: Replace each of the values in each bin by the arithmetic mean calculated for the bin.
Bin 1: 142/3, 142/3, 142/3 Bin 2: 181/3, 181/3, 181/3 Bin 3: 21, 21, 21
Bin 4: 24, 24, 24 Bin 5: 262/3, 262/3, 262/3 Bin 6: 332/3, 332/3, 332/3
Bin 7: 35, 35, 35 Bin 8: 401/3, 401/3, 401/3 Bin 9: 56, 56, 56
This method smooths a sorted data value by consulting to its ”neighborhood”. It performs local
smoothing.
(b) How might you determine outliers in the data?
Outliers in the data may be detected by clustering, where similar values are organized into groups,
or ‘clusters’. Values that fall outside of the set of clusters may be considered outliers. Alterna-
tively, a combination of computer and human inspection can be used where a predetermined data
distribution is implemented to allow the computer to identify possible outliers. These possible
outliers can then be verified by human inspection with much less effort than would be required
to verify the entire initial data set.
(c) What other methods are there for data smoothing?
Other methods that can be used for data smoothing include alternate forms of binning such as
smoothing by bin medians or smoothing by bin boundaries. Alternatively, equiwidth bins can be
used to implement any of the forms of binning, where the interval range of values in each bin is
constant. Methods other than binning include using regression techniques to smooth the data by
fitting it to a function such as through linear or multiple regression. Also, classification techniques
can be used to implement concept hierarchies that can smooth the data by rolling-up lower level
concepts to higher-level concepts.

4. Discuss issues to consider during data integration.


Answer:
Data integration involves combining data from multiple sources into a coherent data store. Issues that
must be considered during such integration include:

• Schema integration: The metadata from the different data sources must be integrated in order
to match up equivalent real-world entities. This is referred to as the entity identification problem.
• Handling redundant data: Derived attributes may be redundant, and inconsistent attribute
naming may also lead to redundancies in the resulting data set. Also, duplications at the tuple
level may occur and thus need to be detected and resolved.
• Detection and resolution of data value conflicts: Differences in representation, scaling or
encoding may cause the same real-world entity attribute values to differ in the data sources being
integrated.

5. What are the value ranges of the following normalization methods?

(a) min-max normalization


22 CHAPTER 3. DATA PREPROCESSING

(b) z-score normalization


(c) z-score normalization using the mean absolute deviation instead of standard deviation
(d) normalization by decimal scaling
Answer:
(a) min-max normalization can define any value range and linearly map the original data to this
range.
(b) z-score normalization normalize the values for an attribute A based on the mean and standard
deviation. The value range for z-score normalization is [ minσAA−Ā , maxσAA−Ā ].
(c) z-score normalization using the mean absolute deviation is a variation of z-score normalization by
replacing the standard deviation with the mean absolute deviation of A, denoted by sA , which is
1
sA = (|v1 − Ā| + |v2 − Ā| + ... + |vn − Ā|).
n
The value range is [ minsAA−Ā , maxsA
A
−Ā
].
(d) normalization by decimal scaling normalizes by moving the decimal point of values of attribute
A. The value range is
minA maxA
[ , ],
10j 10j
where j is the smallest integer such that M ax(| 10j |) < 1.
vi

6. Use the methods below to normalize the following group of data:


200, 300, 400, 600, 1000
(a) min-max normalization by setting min = 0 and max = 1
(b) z-score normalization
(c) z-score normalization using the mean absolute deviation instead of standard deviation
(d) normalization by decimal scaling
Answer:
(a) min-max normalization by setting min = 0 and max = 1 get the new value by computing
vi − 200
vi′ = (1 − 0) + 0.
1000 − 200
The normalized data are:
0, 0.125, 0.25, 0.5, 1
(b) In z-score normalization, a value vi of A is normalized to vi′ by computing
vi − Ā
vi′ = ,
σA
where
1
Ā = (200 + 300 + 400 + 600 + 1000) = 500,
5

1
σA = (2002 + 3002 + ... + 10002 ) − Ā2 = 282.8.
5
The normalized data are:
−1.06, −0.707, −0.354, 0.354, 1.77
3.1. EXERCISES 23

(c) z-score normalization using the mean absolute deviation instead of standard deviation replaces
σA with sA , where
1
sA = (|200 − 500| + |300 − 500| + ... + |1000 − 500|) = 240
5
The normalized data are:
−1.25, −0.833, −0.417, 0.417, 2.08

j |) < 1 is 3. After normalization by decimal scaling, the


vi
(d) The smallest integer j such that M ax(| 10
data become:
0.2, 0.3, 0.4, 0.6, 1.0

7. Using the data for age given in Exercise 3.3, answer the following:

(a) Use min-max normalization to transform the value 35 for age onto the range [0.0, 1.0].
(b) Use z-score normalization to transform the value 35 for age, where the standard deviation of age
is 12.94 years.
(c) Use normalization by decimal scaling to transform the value 35 for age.
(d) Comment on which method you would prefer to use for the given data, giving reasons as to why.

Answer:

(a) Use min-max normalization to transform the value 35 for age onto the range [0.0, 1.0].
Using the corresponding equation with minA = 13, maxA = 70, new minA = 0, new maxA = 1.0,
then v = 35 is transformed to v ′ = 0.39.
(b) Use z-score normalization to transform the value 35 for age, where the standard deviation of age
is 12.94 years.
Using the corresponding equation where A = 809/27 = 29.96 and σA = 12.94, then v = 35 is
transformed to v ′ = 0.39.
(c) Use normalization by decimal scaling to transform the value 35 for age.
Using the corresponding equation where j = 2, v = 35 is transformed to v ′ = 0.35.
(d) Comment on which method you would prefer to use for the given data, giving reasons as to why.
Given the data, one may prefer decimal scaling for normalization as such a transformation would
maintain the data distribution and be intuitive to interpret, while still allowing mining on spe-
cific age groups. Min-max normalization has the undesired effect of not permitting any future
values to fall outside the current minimum and maximum values without encountering an “out
of bounds error”. As it is probable that such values may be present in future data, this method
is less appropriate. Also, z-score normalization transforms values into measures that represent
their distance from the mean, in terms of standard deviations. It is probable that this type of
transformation would not increase the information value of the attribute in terms of intuitiveness
to users or in usefulness of mining results.

8. Using the data for age and body fat given in Exercise 2.4, answer the following:

(a) Normalize the two attributes based on z-score normalization.


(b) Calculate the correlation coefficient (Pearson’s product moment coefficient). Are these two at-
tributes positively or negatively correlated? Compute their covariance.
24 CHAPTER 3. DATA PREPROCESSING

Answer:
(a) Normalize the two variables based on z-score normalization.

age 23 23 27 27 39 41 47 49 50
z-age -1.83 -1.83 -1.51 -1.51 -0.58 -0.42 0.04 0.20 0.28
%fat 9.5 26.5 7.8 17.8 31.4 25.9 27.4 27.2 31.2
z-%fat -2.14 -0.25 -2.33 -1.22 0.29 -0.32 -0.15 -0.18 0.27
age 52 54 54 56 57 58 58 60 61
z-age 0.43 0.59 0.59 0.74 0.82 0.90 0.90 1.06 1.13
%fat 34.6 42.5 28.8 33.4 30.2 34.1 32.9 41.2 35.7
z-%fat 0.65 1.53 0.0 0.51 0.16 0.59 0.46 1.38 0.77

(b) Calculate the correlation coefficient (Pearson’s product moment coefficient). Are these two vari-
ables positively or negatively correlated?
The correlation coefficient is 0.82. The variables are positively correlated.

9. Suppose a group of 12 sales price records has been sorted as follows:

5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215.
Partition them into three bins by each of the following methods.
(a) equal-frequency (equidepth) partitioning
(b) equal-width partitioning
(c) clustering
Answer:
(a) equal-frequency (equidepth) partitioning
Partition the data into equidepth bins of depth 4:
Bin 1: 1: 5, 10, 11, 13 Bin 2: 15, 35, 50, 55 Bin 3: 72, 92, 204, 215
(b) equal-width partitioning
Partitioning the data into 3 equi-width bins will require the width to be (215 − 5)/3 = 70. We
get:
Bin 1: 5, 10, 11, 13, 15, 35, 50, 55, 72 Bin 2: 92 Bin 3: 204, 215
(c) clustering
Using K-means clustering to partition the data into three bins we get:
Bin 1: 5, 10, 11, 13, 15, 35 Bin 2: 50, 55, 72, 92 Bin 3: 204, 215

10. Use a flowchart to summarize the following procedures for attribute subset selection:
(a) stepwise forward selection
(b) stepwise backward elimination
(c) a combination of forward selection and backward elimination
Answer:
3.1. EXERCISES 25

Figure 3.1: Stepwise forward selection.

(a) Stepwise forward selection


See Figure 3.1.
(b) Stepwise backward elimination
See Figure 3.2.
(c) A combination of forward selection and backward elimination
See Figure 3.3.

11. Using the data for age given in Exercise 3.3,


(a) Plot an equal-width histogram of width 10.
(b) Sketch examples of each of the following sampling techniques: SRSWOR, SRSWR, cluster sam-
pling, stratified sampling. Use samples of size 5 and the strata “youth”, “middle-aged”, and
“senior”.
Answer:
(a) Plot an equiwidth histogram of width 10.
See Figure 3.4.
(b) Sketch examples of each of the following sampling techniques: SRSWOR, SRSWR, cluster sam-
pling, stratified sampling. Use samples of size 5 and the strata “young”, “middle-aged”, and
“senior”.
See Figure 3.5.

12. ChiMerge [Ker92] is a supervised, bottom-up (i.e., merge-based) data discretization method. It relies
on χ2 analysis: adjacent intervals with the least χ2 values are merged together till the chosen stopping
criterion satisfies.
26 CHAPTER 3. DATA PREPROCESSING

Figure 3.2: Stepwise backward elimination.

(a) Briefly describe how ChiMerge works.


(b) Take the IRIS data set, obtained from the UC-Irvine Machine Learning Data Repository
(http://www.ics.uci.edu/∼mlearn/MLRepository.html ), as a data set to be discretized. Perform
data discretization for each of the four numerical attributes using the ChiMerge method. (Let the
stopping criteria be: max-interval = 6). You need to write a small program to do this to avoid
clumsy numerical computation. Submit your simple analysis and your test results: split points,
final intervals, and your documented source program.
Answer:
(a) The ChiMerge algorithm consists of an initialization step and a bottom-up merging process, where
intervals are continuously merged until a termination condition is met. Chimerge is initialized by
first sorting the training examples according to their value for the attribute being discretized and
then constructing the initial discretization, in which each example is put into its own interval (i.e.,
place an interval boundary before and after each example). The interval merging process contains
two steps, repeated continuously: (1) compute the χ2 value for each pair of adjacent intervals, (2)
merge (combine) the pair of adjacent intervals with the lowest χ2 value. Merging continues until
a predefined stopping criterion is met.
(b) According to the description in (a), the ChiMerge algorithm can be easily implemented. Detailed
empirical results and discussions can be found in this paper: Kerber, R. (1992). ChiMerge :
Discretization of numeric attributes, In Proceedings of the Tenth National Conference on Artificial
Intelligence, 123-128.

13. Propose an algorithm, in pseudocode or in your favorite programming language, for the following:
(a) The automatic generation of a concept hierarchy for categorical data based on the number of
distinct values of attributes in the given schema
(b) The automatic generation of a concept hierarchy for numerical data based on the equal-width
partitioning rule
3.1. EXERCISES 27

Figure 3.3: A combination of forward selection and backward elimination.

(c) The automatic generation of a concept hierarchy for numerical data based on the equal-frequency
partitioning rule

Answer:

(a) The automatic generation of a concept hierarchy for categorical data based on the number of
distinct values of attributes in the given schema
Pseudocode for the automatic generation of a concept hierarchy for categorical data based on the
number of distinct values of attributes in the given schema:

begin
// array to hold name and distinct value count of attributes
// used to generate concept hierarchy
array count ary[]; string count ary[].name; // attribute name
int count ary[].count; // distinct value count

// array to represent concept hierarchy (as an ordered list of values)


array concept hierarchy[];

for each attribute ’A’ in schema {


distinct count = count distinct ’A’;
insert (’A’, ’distinct count’) into count ary[];
}

sort count ary[] ascending by count;

for (i = 0; i < count ary[].length; i++) {


// generate concept hierarchy nodes
concept hierarchy[i] = count ary[i].name;
} end
28 CHAPTER 3. DATA PREPROCESSING

Figure 3.4: An equiwidth histogram of width 10 for age.

To indicate a minimal count threshold necessary for generating another level in the concept
hierarchy, the user could specify an additional parameter.
(b) The automatic generation of a concept hierarchy for numeric data based on the equiwidth parti-
tioning rule
begin
// numerical attribute to be used to generate concept hierarchy
string concept attb;

// array to represent concept hierarchy (as an ordered list of values)


array concept hierarchy[];

string concept hierarchy[].name; // attribute name


int concept hierarchy[].max; // max value of bin
int concept hierarchy[].min; // min value of bin
int concept hierarchy[].mean; // mean value of bin
int concept hierarchy[].sum; // sum of bin
int concept hierarchy[].count; // tuple count of bin

int range min; // min data value − user specified


int range max; // max data value − user specified
int step; // width of bins − user specified
int j=0;

// initialize concept hierarchy array


for (i=0; i < range max; i+=step) {
concept hierarchy[j].name = ’level ’ + j;
concept hierarchy[j].min = i;
concept hierarchy[j].max = i + step − 1;
j++;
3.1. EXERCISES 29

// initialize final max value if necessary


if (i ≥ range max) {
concept hierarchy[j].max = i + step − 1;
}

// assign each value to a bin by incrementing the appropriate sum and count values
for each tuple T in task relevant data set {
int k=0;
while (T.concept attb > concept hierarchy[k].max) { k++; }
concept hierarchy[k].sum += T.concept attb;
concept hierarchy[k].count++;
}

// calculate the bin metric used to represent the value of each level
// in the concept hierarchy
for i=0; i < concept hierarchy[].length; i++) {
concept hierarchy[i].mean = concept hierarchy[i].sum / concept hierarchy[i].count;
} end
The user can specify more meaningful names for the concept hierarchy levels generated by review-
ing the maximum and minimum values of the bins, with respect to background knowledge about
the data (i.e., assigning the labels young, middle-aged and old to a three level hierarchy generated
for age.) Also, an alternative binning method could be implemented, such as smoothing by bin
modes.
(c) The automatic generation of a concept hierarchy for numeric data based on the equidepth parti-
tioning rule
Pseudocode for the automatic generation of a concept hierarchy for numeric data based on the
equidepth partitioning rule:
begin
// numerical attribute to be used to generate concept hierarchy
string concept attb;

// array to represent concept hierarchy (as an ordered list of values)


array concept hierarchy[];
string concept hierarchy[].name; // attribute name
int concept hierarchy[].max; // max value of bin
int concept hierarchy[].min; // min value of bin
int concept hierarchy[].mean; // mean value of bin
int concept hierarchy[].sum; // sum of bin
int concept hierarchy[].count; // tuple count of bin

int bin depth; // depth of bins to be used − user specified


int range min; // min data value − user specified
int range max; // max data value − user specified

// initialize concept hierarchy array


for (i=0; i < (range max/bin depth(; i++) {
concept hierarchy[i].name = ’level ’ + i;
concept hierarchy[i].min = 0;
30 CHAPTER 3. DATA PREPROCESSING

concept hierarchy[i].max = 0;
}

// sort the task-relevant data set sort data set ascending by concept attb;

int j=1; int k=0;

// assign each value to a bin by incrementing the appropriate sum,


// min and max values as necessary
for each tuple T in task relevant data set {
concept hierarchy[k].sum += T.concept attb;
concept hierarchy[k].count++;
if (T.concept attb <= concept hierarchy[k].min) {
concept hierarchy[k].min = T.concept attb;
}
if (T.concept attb >= concept hierarchy[k].max) {
concept hierarchy[k].max = T.concept attb;
};
j++;
if (j > bin depth) {
k++; j=1;
}
}

// calculate the bin metric used to represent the value of each level
// in the concept hierarchy
for i=0; i < concept hierarchy[].length; i++) {
concept hierarchy[i].mean = concept hierarchy[i].sum / concept hierarchy[i].count;
}
end
This algorithm does not attempt to distribute data values across multiple bins in order to smooth
out any difference between the actual depth of the final bin and the desired depth to be imple-
mented. Also, the user can again specify more meaningful names for the concept hierarchy levels
generated by reviewing the maximum and minimum values of the bins, with respect to background
knowledge about the data.

14. Robust data loading poses a challenge in database systems because the input data are often dirty.
In many cases, an input record may miss multiple values, some records could be contaminated, with
some data values out of range or of a different data type than expected. Work out an automated data
cleaning and loading algorithm so that the erroneous data will be marked, and contaminated data will
not be mistakenly inserted into the database during data loading.
Answer:
We can tackle this automated data cleaning and loading problem from the following perspectives:
• Use metadata (e.g., domain, range, dependency, distribution).
• Check unique rule, consecutive rule and null rule.
• Check field overloading.
• Spell-checking.
3.2. SUPPLEMENTARY EXERCISES 31

• Detect different attribute names which actually have the same meaning.
• Use domain knowledge to detect errors and make corrections.

3.2 Supplementary Exercises


1. The following table contains the attributes name, gender, trait-1, trait-2, trait-3, and trait-4, where
name is an object identifier, gender is a symmetric attribute, and the remaining trait attributes are
asymmetric, describing personal traits of individuals who desire a penpal. Suppose that a service exists
that attempts to find pairs of compatible penpals.

name gender trait-1 trait-2 trait-3 trait-4


Kevin M N P P N
Caroline F N P P N
Erik M P N N P
.. .. .. .. .. ..
. . . . . .

2MKJiawei, can we please discuss this exercise? There are many ambiguities.
For asymmetric attribute values, let the value P be set to 1 and the value N be set to 0. Suppose that
the distance between objects (potential penpals) is computed based only on the asymmetric variables.

(a) Show the contingency matrix for each pair given Kevin, Caroline, and Erik (based on trait-1 to
trait-4 ).
(b) 2MKBased on our discussion, we no longer refer to simple matching coefficient or Jaccard co-
efficient in Section 7.2.2. Compute the invariant dissimilarity of each pair using Equation (??).
2MKAmbiguity: Why does part (b) use the equation for symmetric binary variables when we
instruct the reader to use only the four asymmetric variables? Note that the answers we get for
parts (b) and (c) are even identical, so I see no point in asking this confusing question??
(c) Compute the noninvariant dissimilarity of each pair using Equation (??).
(d) Who do you suggest would make the best pair of penpals? Which pair of individuals would be
the least compatible?
(e) Suppose that we are to include the symmetric variable gender in our analysis. Based on Equa-
tion (??), who would be the most compatible pair, and why?
2MKAmbiguity: Why are we asking the reader to use the Equation for assymetric variables when
including the symmetric variable gender? (and if so, we would need to specify whether M or F
should be coded as 1)? Shouldn’t they be using the technique for variables of mixed types? I
looked at my copy of the answer book and, based on the calculations, it does appear that the
equation for variables of mixed type is used (which contradicts our question). However, I obtain
different answers than in the answer book (although my copy may be outdated.) I obtained d(K,C)
= 1/1 = 1 (disagrees with answer book); d(K,E) = 4/5 (agrees with answer book); d(C,E) = 5/5
= 1 (different derivation than answer book). Let’s discuss! Thanks.
32 CHAPTER 3. DATA PREPROCESSING

Tuples
T1 13 T10 22 T19 33
T2 15 T11 25 T20 35
T3 16 T12 25 T21 35
T4 16 T13 25 T22 36
T5 19 T14 25 T23 40
T6 20 T15 30 T24 45
T7 20 T16 33 T25 46
T8 21 T17 33 T26 52
T9 22 T18 33 T27 70

SRSWOR vs. SRSWR


SRSWOR (n = 5) SRSWR (n = 5)
T4 16 T7 20
T6 20 T7 20
T10 22 T20 35
T11 25 T21 35
T26 32 T25 46

Clustering sampling: Initial clusters


T1 13 T6 20 T11 25 T16 33 T21 35
T2 15 T7 20 T12 25 T17 33 T22 36
T26 52
T3 16 T8 21 T13 25 T18 33 T23 40
T27 70
T4 16 T9 22 T14 25 T19 33 T24 45
T5 19 T10 22 T15 30 T20 35 T25 46

Cluster sampling (m = 2)
T6 20 T21 35
T7 20 T22 36
T8 21 T23 40
T9 22 T24 45
T10 22 T25 46

Stratified Sampling
T1 13 young T10 22 young T19 33 middle age
T2 15 young T11 25 young T20 35 middle age
T3 16 young T12 25 young T21 35 middle age
T4 16 young T13 25 young T22 36 middle age
T5 19 young T14 25 young T23 40 middle age
T6 20 young T15 30 middle age T24 45 middle age
T7 20 young T16 33 middle age T25 46 middle age
T8 21 young T17 33 middle age T26 52 middle age
T9 22 young T18 33 middle age T27 70 senior

Stratified Sampling (according to age)


T4 16 young
T12 25 young
T17 33 middle age
T25 46 middle age
T27 70 senior

Figure 3.5: Examples of sampling: SRSWOR, SRSWR, cluster sampling, stratified sampling.

You might also like