0% found this document useful (0 votes)
104 views33 pages

Unit 2 Quantitative Techniques

for mba students
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
104 views33 pages

Unit 2 Quantitative Techniques

for mba students
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Subject Name: Quantitative

Techniques
Unit 2: Classification of Data
Introduction to Classification Data

• The collected data are a complex and


unorganised mass of figures which is very
difficult to analyse and interpret. Therefore, it
becomes necessary to organise this so that it
becomes easier to grasp its broad features.

• Further, in order to apply the tools of analysis


and interpretation, it is essential that the data are
arranged in a definite form. This task is
accomplished by the process of classification
and tabulation.

2 Introduction to Classification of Data


Chief Characteristics of Classification

1. The collected data are arranged into homogeneous groups.

2. The basis of classification is the similarity of characteristics or features inherent in the collected
data.

3. Classification of data signifies unity in diversity.

4. Classification of data may be actual or notional.

5. Classification of data may be according to certain measurable or non-measurable


characteristics or according to some combination of both.

3 Chief Characteristics of Classification


Objectives of Good Classification

1. To present a mass of data in a condensed form.

2. To highlight the points of similarity and dissimilarity.

3. To bring out the relationship between variables.

4. To highlight the effect of one variable by eliminating the effect of others.

5. To facilitate comparison.

6. To prepare data for tabulation and analysis.

4 Objectives of Good Classification


Requisites of Good Classification

1. Unambiguous: The classification should not lead to any ambiguity or confusion.

2. Exhaustive: A classification is said to be exhaustive if there is no item that cannot be allotted a


class.

3. Mutually Exclusive: When a classification is mutually exclusive, each item of the data can be
placed only in one of the classes.

4. Flexibility: A good classification should be capable of being adjusted according to the changed
situations and conditions.

5 Requisites of Good Classification


Requisites of Good Classification

5. Stability: The principle of classification, once decided, should remain same throughout the analysis, otherwise it
will not be possible to get meaningful results. In the absence of stability, the results of the same type of
investigation at different time periods may not be comparable.

6. Suitability: The classification should be suitable to the objective(s) of investigation.

7. Homogeneity: A classification is said to be homogeneous if similar items are placed in a class.

8. Revealing: A classification is said to be revealing if it brings out essential features of the collected data. This
can be done by selecting a suitable number of classes. Making few classes means over summarization while large
number classes fail to reveal any pattern of behaviour of the variable.

6 Requisites of Good Classification


Types of Classification

• The nature of classification depends upon the purpose and objective of investigation. The following are some
very common types of classification:

• 1. Geographical (or spatial) classification: When the data are classified according to geographical location or
region, it is called a geographical classification.

• 2. Chronological classification: When the data are classified on the basis of its time of occurrence, it is called
a chronological classification. Various time series such as National Income figures (annual), annual output of
wheat, monthly expenditure of a household, daily consumption of milk, etc., are some examples of chronological
classification.

• 3. Conditional classification: When the data are classified according to certain conditions, other than
geographical or chronological, it is called a conditional classification.

7 Types of Classification
Types of Classification

4. Qualitative classification or classification according to attributes:


When the characteristics of the data are non-measurable, it is called a qualitative data. The examples of non-
measurable characteristics are sex of a person, marital status, colour, honesty, intelligence, etc. These
characteristics are also known as attributes. When qualitative data are given, various items can be classified into
two or more groups according to a characteristic. If the data are classified only into two categories according to the
presence or absence of an attribute, the classification is termed as dichotomous or twofold classification.
It should be noted here that in a two-way classification, it is possible to have simultaneous classification according
to an attribute and a variable. On the other hand, if the data are classified into more than two categories according
to an attribute, it is called a manifold classification. For example, classification of various students of a college
according to the colour of their eyes like black, brown, grey, blue, etc. The conditional classification, given above,
is also an example of a manifold classification.
If the classification is done according to a single attribute, it is known as a one-way classification. On the other
hand, the classification done according to two or more attributes is known as a two-way or multiway classification,
respectively.

8 Types of Classification
Types of Classification

Examples of Two- and Three-way Classification

9 Types of Classification
Types of Classification

• Quantitative classification or classification according to variables: In case of quantitative data, the


characteristic is measurable in terms of numbers and is termed as variable, e.g., weight, height, income, the
number of children in a family, the number of crime cases in a city, life of an electric bulb of a company, etc. A
variable can take a different value corresponding to a different item of the population or universe.

• Variables can be of two types (a) Discrete and (b) Continuous.

(a) Discrete Variable: A discrete variable can assume only some specific values in a given interval. For example,
the number of children in a family, the number of rooms on each floor of a multistore building, etc.
(b) Continuous Variable: A continuous variable can assume any value in a given interval. For example, monthly
income of a worker can take any value, say, between 1,000 to 2,500. The income of a worker can be 1,500.25,
etc.

10 Types of Classification
Formation of Frequency Distribution

1.
Construction of a Discreate Frequency Distribution
A discrete frequency distribution may be ungrouped or grouped. In an ungrouped frequency distribution, various
values of the variable are shown along with their corresponding frequencies. If this distribution fails to reveal any
pattern, grouping of various observations become necessary. The resulting distribution is known as grouped
frequency distribution of a discrete variable. Furthermore, a grouped frequency distribution is also constructed
when the possible values that a variable can take are large.
Ungrouped Frequency Distribution of a Discreate Variable

11 Formation of Frequency Distribution


Formation of Frequency Distribution

Counting of Frequency using Tally Marks


The method of tally marks is used to count the number of observations or the frequency of each value of the
variable. Each possible value of the variable is written in a column. For every observation, a tally mark denoted by
| is noted against its corresponding value. Five observations are denoted as i.e., the fifth tally mark crosses the
earlier four marks and so on. The method of tally marks is used below to determine the frequencies of various
values of the variable for the data given above.

12 Formation of Frequency Distribution


Formation of Frequency Distribution

Grouped Frequency Distribution of Discreate Variable

13 Formation of Frequency Distribution


Formation of Frequency Distribution

2.
Construction of Continuous Frequency Distribution
• As opposed to a discrete variable, a continuous variable can take any value in an interval. Measurements like
height, age, income, time, etc., are some examples of a continuous variable. As mentioned earlier, when data
are collected regarding these variables, it will show discreteness, which depends upon the degree of precision
of the measuring instrument. Therefore, in such a situation, even if the recorded data appear to be discrete, it
should be treated as continuous. Since a continuous variable can take any value in a given interval, therefore,
the frequency distribution of a continuous variable is always a grouped frequency distribution.

• To construct a grouped frequency distribution, the whole interval of the continuous variable, given by the
difference of its largest and the smallest possible values, is divided into various mutually exclusive and
exhaustive sub-intervals. These sub-intervals are termed as class intervals. Then, the frequency of each class
interval is determined by counting the number of observations falling under it.

14 Formation of Frequency Distribution


Formation of Frequency Distribution

2.
Construction of Continuous Frequency Distribution
The construction of such a distribution is explained below:

Several variables need to be taken into consideration of any frequency distribution of a continuous variable. These
are explained in the following slides.

15 Formation of Frequency Distribution


Formation of Frequency Distribution

2.
Construction of Continuous Frequency Distribution

• Number of Class Intervals


Though there is no hard and fast rule regarding the number of classes to be formed, yet their number should be
neither very large nor very small. If there are too many classes, the frequency distribution appears to be too
fragmented to reveal the pattern of behaviour of characteristics. Fewer classes imply that the width of the class
intervals will be broad and accordingly it would include a large number of observations.

The approximate number of classes can also be determined by Struge’s formula: n = 1 + 3.322 × log10N, where n
(rounded to the next whole number) denotes the number of classes and N denotes the total number of
observations.

16 Formation of Frequency Distribution


Formation of Frequency Distribution

2.
Construction of Continuous Frequency Distribution

• Width of Class Intervals


After determining the number of class intervals, one has to determine their width. The problem of determining the
width of a class interval is closely related to the number of class intervals. As far as possible, all the class intervals
should be of equal width. However, there can be situations where it may not be possible to have equal width of all
the classes. Suppose that there is a frequency distribution, having all classes of equal width, in which the pattern
of behaviour of the observations is not regular, i.e., there are nil or very few observations in some classes while
there is concentration of observations in other classes. In such a situation, one may be compelled to have unequal
class intervals in order that the frequency distribution becomes regular.

17 Formation of Frequency Distribution


Formation of Frequency Distribution

2.
Construction of Continuous Frequency Distribution
Designation of Class Limits
• The designation of class limits for various class intervals can be done in two ways: (1) Exclusive Method and
(2) Inclusive Method.
• Exclusive Method: In this method the upper limit of a class is taken to be equal to the lower limit of the
following class. To keep various class intervals as mutually exclusive, the observations with magnitude greater
than or equal to lower limit but less than the upper limit of a class are included in it. For example, if the lower
limit of a class is 10 and its upper limit is 20, then this class, written as 10-20, includes all the observations
which are greater than or equal to 10 but less than 20. The observations with magnitude 20 will be included in
the next class.

• Inclusive Method: Here all observations with magnitude greater than or equal to the lower limit and less than or
equal to the upper limit of a class are included in it

18 Formation of Frequency Distribution


Formation of Frequency Distribution

2.
Construction of Continuous Frequency Distribution

Lastly, we consider the mid-value of the class

In exclusive types of class intervals, the mid-value of a class is defined as the arithmetic mean of its lower and
upper limits. However, in the case of inclusive types of class intervals, there is a gap between the upper limit of a
class and the lower limit of the following class which is eliminated by determining the class boundaries. Here, the
mid-value of a class is defined as the arithmetic mean of its lower and upper boundaries. To find class boundaries,
we note that the given data on the measurements of diameter of a wire is expressed in terms of millimetres,
approximated up to two places after decimal.

19 Formation of Frequency Distribution


Formation of Frequency Distribution

3.
Relative or Percentage Frequency Distribution
If instead of frequencies of various classes their relative or percentage frequencies are written, we get a relative or
percentage frequency distribution.

20 Formation of Frequency Distribution


Formation of Frequency Distribution

4.
Cumulative Frequency Distribution

There are two types of cumulative frequency distributions.


• Less than cumulative frequency distribution: It is obtained by adding successively the frequencies of all the
previous classes including the class against which it is written. The cumulate is started from the lowest to the
highest size.

• More than cumulative frequency distribution: It is obtained by finding the cumulate total of frequencies
starting from the highest to the lowest class. These frequency distributions, for the data on the measurements of
diameter of a wire, are shown in Table I and Table II respectively.

21 Formation of Frequency Distribution


Formation of Frequency Distribution

5.
Frequency Density

Frequency density in a class is defined as the number of observations per unit of its width. Frequency density
gives the rate of concentration of observations in a class

22 Formation of Frequency Distribution


Bivariate and Multivariate Frequency Distribution

Bivariate
In the frequency distributions, discussed so far, the data are classified according to only one characteristic. These
distributions are known as univariate frequency distributions. There may be a situation where it is necessary to
classify data, simultaneously, according to two characteristics. A frequency distribution obtained by the
simultaneous classification of data according to two characteristics, is known as a bivariate frequency distribution.
An example of such a classification is given below, where 100 couples are classified according to the two
characteristics, Age of Husband and Age of Wife. The tabular representation of the bivariate frequency distribution
is known as a contingency table.

23 Bivariate and Multivariate Frequency Distribution


Bivariate and Multivariate Frequency Distribution

Multivariate

If the classification is done, simultaneously, according to more than two characteristics, the resulting frequency
distribution is known as a multivariate frequency distribution.
• Example: Find the lower and upper limits of the classes when their mid-values are given as 15, 25, 35, 45, 55,
65, 75, 85 and 95.

• Solution: Note that the difference between two successive mid-values is same, i.e., 10. Half of this difference is
subtracted and added to the mid value of a class in order to get lower limit and the upper limit, respectively.
Hence, the required class intervals are 10 - 20, 20 - 30, 30 - 40, 40 - 50, 50 - 60, 60 - 70, 70 - 80, 80 - 90, 90 -
100.

24 Bivariate and Multivariate Frequency Distribution


Bivariate and Multivariate Frequency Distribution

Multivariate

• Example: Find the lower and upper limits of the classes if their mid-values are 10, 20, 35, 55, 85.

• Solution: Here the difference of two successive mid-values are different. In order to find the limits of the first
class, half of the difference between the second and first mid-value is subtracted and added. Therefore, the
first-class limits are 5 - 15. The lower limit of second class is taken as equal to upper limit of the first class.

• The upper limit of a class = lower limit + width, where width = 2(Mid-value - lower limit). The upper limit of the
second class = 15 + 2(20 - 15) = 25. Thus, second class interval will be 15 - 25. Similarly, we can find the limits
of third, fourth and fifth classes as 25 - 45, 45 - 65 and 65 - 105, respectively.

25 Bivariate and Multivariate Frequency Distribution


Data Sampling Methods for Imbalanced Classification

• Data sampling provides a collection of techniques that transform a training dataset in order to balance or better
balance the class distribution.

• Once balanced, standard machine learning algorithms can be trained directly on the transformed dataset
without any modification.

• This allows the challenge of imbalanced classification, even with severely imbalanced class distributions, to be
addressed with a data preparation method.
• There are many different types of data sampling methods that can be used, and there is no single best method
to use on all classification problems and with all classification models. Like choosing a predictive model, careful
experimentation is required to discover what works best for your project.

26 Bivariate and Multivariate Frequency Distribution


Data Sampling Methods for Imbalanced Classification

• A chief problem with imbalanced classification datasets is that standard machine learning algorithms do not
perform well on them. Many machine learning algorithms rely upon the class distribution in the training dataset
to gauge the likelihood of observing examples in each class when the model will be used to make predictions.
• As such, many machine learning algorithms, like decision trees, k-nearest neighbors, and neural networks, will
therefore learn that the minority class is not as important as the majority class and put more attention and
perform better on the majority class.
• The hitch with imbalanced datasets is that standard classification learning algorithms are often biased towards
the majority classes (known as “negative”) and therefore there is a higher misclassification rate in the minority
class instances (called the “positive” class).
• — Page 79, Learning from Imbalanced Data Sets, 2018.
• This is a problem because the minority class is exactly the class that we care most about in imbalanced
classification problems.
• The reason for this is because the majority class often reflects a normal case, whereas the minority class
represents a positive case for a diagnostic, fault, fraud, or other types of exceptional circumstance.

27 Bivariate and Multivariate Frequency Distribution


Balance the Class Distribution With Data Sampling

• The most popular solution to an imbalanced classification problem is to change the composition of the training
dataset.
• Techniques designed to change the class distribution in the training dataset are generally referred to as
sampling methods or resampling methods as we are sampling an existing data sample.
• Sampling methods seem to be the dominate type of approach in the community as they tackle imbalanced
learning in a straightforward manner.
• — Page 3, Imbalanced Learning: Foundations, Algorithms, and Applications, 2013.
• The reason that sampling methods are so common is because they are simple to understand and implement,
and because once applied to transform the training dataset, a suite of standard machine learning algorithms can
then be used directly.
• This means that any from tens or hundreds of machine learning algorithms developed for balanced (or mostly
balanced) classification can then be fit on the training dataset without any modification adapting them for the
imbalance in observations.

28 Bivariate and Multivariate Frequency Distribution


Balance the Class Distribution With Data Sampling

• Sampling methods are a very popular method for dealing with imbalanced data. These methods are primarily employed
to address the problem with relative rarity but do not address the issue of absolute rarity.
There are two main types of data sampling used on the training dataset: oversampling and undersampling. In the next
section, we will take a tour of popular methods from each type, as well as methods that combine multiple approaches.

• Oversampling Techniques
• Oversampling methods duplicate examples in the minority class or synthesize new examples from the examples in the
minority class.
• Some of the more widely used and implemented oversampling methods include:
• Random Oversampling
• Synthetic Minority Oversampling Technique (SMOTE)
• Borderline-SMOTE
• Borderline Oversampling with SVM
• Adaptive Synthetic Sampling (ADASYN)

29 Bivariate and Multivariate Frequency Distribution


Balance the Class Distribution With Data Sampling

• There are many extensions to the SMOTE method that aim to be more selective for the types of examples in
the majority class that are synthesized.
• Borderline-SMOTE involves selecting those instances of the minority class that are misclassified, such as with
a k-nearest neighbor classification model, and only generating synthetic samples that are “difficult” to classify.
• Borderline Oversampling is an extension to SMOTE that fits an SVM to the dataset and uses the decision
boundary as defined by the support vectors as the basis for generating synthetic examples, again based on the
idea that the decision boundary is the area where more minority examples are required.
• Adaptive Synthetic Sampling (ADASYN) is another extension to SMOTE that generates synthetic samples
inversely proportional to the density of the examples in the minority class. It is designed to create synthetic
examples in regions of the feature space where the density of minority examples is low, and fewer or none
where the density is high

30 Bivariate and Multivariate Frequency Distribution


Balance the Class Distribution With Data Sampling

• Undersampling Techniques
• Undersampling methods delete or select a subset of examples from the majority class.
• Some of the more widely used and implemented undersampling methods include:
• Random Undersampling
• Condensed Nearest Neighbor Rule (CNN)
• Near Miss Undersampling
• Tomek Links Undersampling
• Edited Nearest Neighbors Rule (ENN)
• One-Sided Selection (OSS)
• Neighborhood Cleaning Rule (NCR)

31 Bivariate and Multivariate Frequency Distribution


Balance the Class Distribution With Data Sampling

• Combinations of Techniques
• Although an oversampling or undersampling method when used alone on a training dataset can be effective,
experiments have shown that applying both types of techniques together can often result in better overall
performance of a model fit on the resulting transformed dataset.
• Some of the more widely used and implemented combinations of data sampling methods include:
• SMOTE and Random Undersampling
• SMOTE and Tomek Links
• SMOTE and Edited Nearest Neighbors Rule

32 Bivariate and Multivariate Frequency Distribution


Thank You

33

You might also like