Unit 2 Quantitative Techniques
Unit 2 Quantitative Techniques
Techniques
Unit 2: Classification of Data
Introduction to Classification Data
2. The basis of classification is the similarity of characteristics or features inherent in the collected
data.
5. To facilitate comparison.
3. Mutually Exclusive: When a classification is mutually exclusive, each item of the data can be
placed only in one of the classes.
4. Flexibility: A good classification should be capable of being adjusted according to the changed
situations and conditions.
5. Stability: The principle of classification, once decided, should remain same throughout the analysis, otherwise it
will not be possible to get meaningful results. In the absence of stability, the results of the same type of
investigation at different time periods may not be comparable.
8. Revealing: A classification is said to be revealing if it brings out essential features of the collected data. This
can be done by selecting a suitable number of classes. Making few classes means over summarization while large
number classes fail to reveal any pattern of behaviour of the variable.
• The nature of classification depends upon the purpose and objective of investigation. The following are some
very common types of classification:
• 1. Geographical (or spatial) classification: When the data are classified according to geographical location or
region, it is called a geographical classification.
• 2. Chronological classification: When the data are classified on the basis of its time of occurrence, it is called
a chronological classification. Various time series such as National Income figures (annual), annual output of
wheat, monthly expenditure of a household, daily consumption of milk, etc., are some examples of chronological
classification.
• 3. Conditional classification: When the data are classified according to certain conditions, other than
geographical or chronological, it is called a conditional classification.
7 Types of Classification
Types of Classification
8 Types of Classification
Types of Classification
9 Types of Classification
Types of Classification
(a) Discrete Variable: A discrete variable can assume only some specific values in a given interval. For example,
the number of children in a family, the number of rooms on each floor of a multistore building, etc.
(b) Continuous Variable: A continuous variable can assume any value in a given interval. For example, monthly
income of a worker can take any value, say, between 1,000 to 2,500. The income of a worker can be 1,500.25,
etc.
10 Types of Classification
Formation of Frequency Distribution
1.
Construction of a Discreate Frequency Distribution
A discrete frequency distribution may be ungrouped or grouped. In an ungrouped frequency distribution, various
values of the variable are shown along with their corresponding frequencies. If this distribution fails to reveal any
pattern, grouping of various observations become necessary. The resulting distribution is known as grouped
frequency distribution of a discrete variable. Furthermore, a grouped frequency distribution is also constructed
when the possible values that a variable can take are large.
Ungrouped Frequency Distribution of a Discreate Variable
2.
Construction of Continuous Frequency Distribution
• As opposed to a discrete variable, a continuous variable can take any value in an interval. Measurements like
height, age, income, time, etc., are some examples of a continuous variable. As mentioned earlier, when data
are collected regarding these variables, it will show discreteness, which depends upon the degree of precision
of the measuring instrument. Therefore, in such a situation, even if the recorded data appear to be discrete, it
should be treated as continuous. Since a continuous variable can take any value in a given interval, therefore,
the frequency distribution of a continuous variable is always a grouped frequency distribution.
• To construct a grouped frequency distribution, the whole interval of the continuous variable, given by the
difference of its largest and the smallest possible values, is divided into various mutually exclusive and
exhaustive sub-intervals. These sub-intervals are termed as class intervals. Then, the frequency of each class
interval is determined by counting the number of observations falling under it.
2.
Construction of Continuous Frequency Distribution
The construction of such a distribution is explained below:
Several variables need to be taken into consideration of any frequency distribution of a continuous variable. These
are explained in the following slides.
2.
Construction of Continuous Frequency Distribution
The approximate number of classes can also be determined by Struge’s formula: n = 1 + 3.322 × log10N, where n
(rounded to the next whole number) denotes the number of classes and N denotes the total number of
observations.
2.
Construction of Continuous Frequency Distribution
2.
Construction of Continuous Frequency Distribution
Designation of Class Limits
• The designation of class limits for various class intervals can be done in two ways: (1) Exclusive Method and
(2) Inclusive Method.
• Exclusive Method: In this method the upper limit of a class is taken to be equal to the lower limit of the
following class. To keep various class intervals as mutually exclusive, the observations with magnitude greater
than or equal to lower limit but less than the upper limit of a class are included in it. For example, if the lower
limit of a class is 10 and its upper limit is 20, then this class, written as 10-20, includes all the observations
which are greater than or equal to 10 but less than 20. The observations with magnitude 20 will be included in
the next class.
• Inclusive Method: Here all observations with magnitude greater than or equal to the lower limit and less than or
equal to the upper limit of a class are included in it
2.
Construction of Continuous Frequency Distribution
In exclusive types of class intervals, the mid-value of a class is defined as the arithmetic mean of its lower and
upper limits. However, in the case of inclusive types of class intervals, there is a gap between the upper limit of a
class and the lower limit of the following class which is eliminated by determining the class boundaries. Here, the
mid-value of a class is defined as the arithmetic mean of its lower and upper boundaries. To find class boundaries,
we note that the given data on the measurements of diameter of a wire is expressed in terms of millimetres,
approximated up to two places after decimal.
3.
Relative or Percentage Frequency Distribution
If instead of frequencies of various classes their relative or percentage frequencies are written, we get a relative or
percentage frequency distribution.
4.
Cumulative Frequency Distribution
• More than cumulative frequency distribution: It is obtained by finding the cumulate total of frequencies
starting from the highest to the lowest class. These frequency distributions, for the data on the measurements of
diameter of a wire, are shown in Table I and Table II respectively.
5.
Frequency Density
Frequency density in a class is defined as the number of observations per unit of its width. Frequency density
gives the rate of concentration of observations in a class
Bivariate
In the frequency distributions, discussed so far, the data are classified according to only one characteristic. These
distributions are known as univariate frequency distributions. There may be a situation where it is necessary to
classify data, simultaneously, according to two characteristics. A frequency distribution obtained by the
simultaneous classification of data according to two characteristics, is known as a bivariate frequency distribution.
An example of such a classification is given below, where 100 couples are classified according to the two
characteristics, Age of Husband and Age of Wife. The tabular representation of the bivariate frequency distribution
is known as a contingency table.
Multivariate
If the classification is done, simultaneously, according to more than two characteristics, the resulting frequency
distribution is known as a multivariate frequency distribution.
• Example: Find the lower and upper limits of the classes when their mid-values are given as 15, 25, 35, 45, 55,
65, 75, 85 and 95.
• Solution: Note that the difference between two successive mid-values is same, i.e., 10. Half of this difference is
subtracted and added to the mid value of a class in order to get lower limit and the upper limit, respectively.
Hence, the required class intervals are 10 - 20, 20 - 30, 30 - 40, 40 - 50, 50 - 60, 60 - 70, 70 - 80, 80 - 90, 90 -
100.
Multivariate
• Example: Find the lower and upper limits of the classes if their mid-values are 10, 20, 35, 55, 85.
• Solution: Here the difference of two successive mid-values are different. In order to find the limits of the first
class, half of the difference between the second and first mid-value is subtracted and added. Therefore, the
first-class limits are 5 - 15. The lower limit of second class is taken as equal to upper limit of the first class.
• The upper limit of a class = lower limit + width, where width = 2(Mid-value - lower limit). The upper limit of the
second class = 15 + 2(20 - 15) = 25. Thus, second class interval will be 15 - 25. Similarly, we can find the limits
of third, fourth and fifth classes as 25 - 45, 45 - 65 and 65 - 105, respectively.
• Data sampling provides a collection of techniques that transform a training dataset in order to balance or better
balance the class distribution.
• Once balanced, standard machine learning algorithms can be trained directly on the transformed dataset
without any modification.
• This allows the challenge of imbalanced classification, even with severely imbalanced class distributions, to be
addressed with a data preparation method.
• There are many different types of data sampling methods that can be used, and there is no single best method
to use on all classification problems and with all classification models. Like choosing a predictive model, careful
experimentation is required to discover what works best for your project.
• A chief problem with imbalanced classification datasets is that standard machine learning algorithms do not
perform well on them. Many machine learning algorithms rely upon the class distribution in the training dataset
to gauge the likelihood of observing examples in each class when the model will be used to make predictions.
• As such, many machine learning algorithms, like decision trees, k-nearest neighbors, and neural networks, will
therefore learn that the minority class is not as important as the majority class and put more attention and
perform better on the majority class.
• The hitch with imbalanced datasets is that standard classification learning algorithms are often biased towards
the majority classes (known as “negative”) and therefore there is a higher misclassification rate in the minority
class instances (called the “positive” class).
• — Page 79, Learning from Imbalanced Data Sets, 2018.
• This is a problem because the minority class is exactly the class that we care most about in imbalanced
classification problems.
• The reason for this is because the majority class often reflects a normal case, whereas the minority class
represents a positive case for a diagnostic, fault, fraud, or other types of exceptional circumstance.
• The most popular solution to an imbalanced classification problem is to change the composition of the training
dataset.
• Techniques designed to change the class distribution in the training dataset are generally referred to as
sampling methods or resampling methods as we are sampling an existing data sample.
• Sampling methods seem to be the dominate type of approach in the community as they tackle imbalanced
learning in a straightforward manner.
• — Page 3, Imbalanced Learning: Foundations, Algorithms, and Applications, 2013.
• The reason that sampling methods are so common is because they are simple to understand and implement,
and because once applied to transform the training dataset, a suite of standard machine learning algorithms can
then be used directly.
• This means that any from tens or hundreds of machine learning algorithms developed for balanced (or mostly
balanced) classification can then be fit on the training dataset without any modification adapting them for the
imbalance in observations.
• Sampling methods are a very popular method for dealing with imbalanced data. These methods are primarily employed
to address the problem with relative rarity but do not address the issue of absolute rarity.
There are two main types of data sampling used on the training dataset: oversampling and undersampling. In the next
section, we will take a tour of popular methods from each type, as well as methods that combine multiple approaches.
• Oversampling Techniques
• Oversampling methods duplicate examples in the minority class or synthesize new examples from the examples in the
minority class.
• Some of the more widely used and implemented oversampling methods include:
• Random Oversampling
• Synthetic Minority Oversampling Technique (SMOTE)
• Borderline-SMOTE
• Borderline Oversampling with SVM
• Adaptive Synthetic Sampling (ADASYN)
• There are many extensions to the SMOTE method that aim to be more selective for the types of examples in
the majority class that are synthesized.
• Borderline-SMOTE involves selecting those instances of the minority class that are misclassified, such as with
a k-nearest neighbor classification model, and only generating synthetic samples that are “difficult” to classify.
• Borderline Oversampling is an extension to SMOTE that fits an SVM to the dataset and uses the decision
boundary as defined by the support vectors as the basis for generating synthetic examples, again based on the
idea that the decision boundary is the area where more minority examples are required.
• Adaptive Synthetic Sampling (ADASYN) is another extension to SMOTE that generates synthetic samples
inversely proportional to the density of the examples in the minority class. It is designed to create synthetic
examples in regions of the feature space where the density of minority examples is low, and fewer or none
where the density is high
• Undersampling Techniques
• Undersampling methods delete or select a subset of examples from the majority class.
• Some of the more widely used and implemented undersampling methods include:
• Random Undersampling
• Condensed Nearest Neighbor Rule (CNN)
• Near Miss Undersampling
• Tomek Links Undersampling
• Edited Nearest Neighbors Rule (ENN)
• One-Sided Selection (OSS)
• Neighborhood Cleaning Rule (NCR)
• Combinations of Techniques
• Although an oversampling or undersampling method when used alone on a training dataset can be effective,
experiments have shown that applying both types of techniques together can often result in better overall
performance of a model fit on the resulting transformed dataset.
• Some of the more widely used and implemented combinations of data sampling methods include:
• SMOTE and Random Undersampling
• SMOTE and Tomek Links
• SMOTE and Edited Nearest Neighbors Rule
33