0% found this document useful (0 votes)
11 views

Unit I & II_FDS_II AI&DS

Uploaded by

sridharan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Unit I & II_FDS_II AI&DS

Uploaded by

sridharan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 48

Fundamentals of Data Science

Unit I

Need for data science –benefits and uses –facets of data – data
science process – setting the research goal – retrieving data –cleansing,
integrating and transforming data – exploratory data analysis –build the
models – presenting and building applications.

Unit II

Frequency distributions –Outliers –relative frequency distributions –


cumulative frequency distributions – frequency distributions for nominal
data –interpreting distributions –graphs – averages –mode –median –
mean –averages for qualitative and ranked data.

Unit III

Normal distributions –z scores –normal curve problems – finding


proportions – finding scores – more about z scores –correlation –scatter
plots –correlation coefficient for quantitative data – computational
formula for correlation coefficient

Unit IV

Basics of NumPy arrays, aggregations, computations on arrays,


comparisons, structured arrays, Data manipulation, data indexing and
selection, operating on data, missing data, hierarchical indexing,
combining datasets –aggregation and grouping, pivot tables.

Unit V

Visualization with matplotlib, line plots, scatter plots, visualizing


errors, density and contour plots, histograms, binning’s, and density,
three-dimensional plotting, geographic data.

1
Data Science
Data Science is a multi-disciplinary science with an objective to
perform data analysis to generate knowledge that can be used for
decision making. This knowledge can be in the form of similar patterns
or predictive planning models, forecasting models etc.
A data science application collects data and information from
multiple heterogenous sources, cleans, integrates, processes and
analyses this data using various tools and presents information and
knowledge in various visual forms.

Benefit and Uses of Data Science


Benefits of Data Science
Improved Decision Making
 Data-driven insights: By analyzing large datasets, organizations can
make more informed, evidence-based decisions rather than relying
on intuition or gut feeling.
 Predictive analytics: Data science can help forecast future trends,
such as sales forecasts, customer behavior, or stock prices, which
supports better strategic planning.
Automation of Processes
 Operational efficiency: Machine learning algorithms and AI can
automate repetitive tasks, reducing human error and increasing
operational speed.
2
 Optimizing workflows: Data science can be used to streamline
processes, reducing costs and time spent on manual tasks.
Personalization
 Customized experiences: Data science allows companies to analyze
customer preferences and behaviors, enabling them to deliver
personalized products, services, and marketing strategies (e.g.,
recommendation engines on streaming platforms like Netflix or
Amazon).
 Targeted marketing: Businesses can segment customers based on
data insights, resulting in more effective advertising and higher
customer engagement.
Cost Reduction and Resource Optimization
 Resource allocation: Data science can help identify inefficiencies in
resource usage (e.g., energy consumption or labor distribution),
enabling businesses to optimize resource allocation and reduce
operational costs.
 Fraud detection: In industries like banking, data science is used to
detect unusual patterns that may indicate fraudulent activity,
saving businesses from costly losses.
Risk Management and Fraud Prevention
 Anomaly detection: Advanced algorithms can identify irregular
patterns in large datasets, allowing businesses to detect potential
fraud, cybersecurity threats, or operational risks.
 Financial risk modeling: In the financial industry, data science is
used to assess the risk of investments, loans, and insurance
policies.
Enhanced Customer Insights and Engagement
 Customer segmentation: Data science helps businesses categorize
customers into meaningful groups based on their behavior, needs,
and preferences, allowing for better customer relationship
management.

3
 Voice of the customer: Text analytics on reviews, surveys, and
social media data can provide actionable insights into customer
satisfaction and areas for improvement.

Uses of Data Science


Healthcare and Medicine
 Predictive healthcare analytics: Data science can be used to predict
disease outbreaks, patient outcomes, and the likelihood of certain
medical conditions, enabling more proactive healthcare.
 Medical imaging: Machine learning algorithms help in the analysis
of medical images (X-rays, MRIs, etc.), assisting doctors in
diagnosing conditions more accurately and efficiently.
 Drug discovery: Data science aids in identifying potential drug
compounds, understanding clinical trial data, and optimizing the
development of new treatments.
Finance and Banking
 Algorithmic trading: Financial markets use data science techniques
to develop trading algorithms that can predict market movements
and execute trades in real-time.
 Credit scoring and risk assessment: Banks use data science to
evaluate the creditworthiness of individuals and businesses by
analyzing historical financial data and transaction behavior.
 Fraud detection: Data science helps identify potentially fraudulent
activity in real time by analyzing transaction patterns.
E-commerce and Retail
 Recommendation systems: E-commerce platforms use data science
to suggest products to users based on browsing and purchase
history, increasing sales and improving user experience.
 Inventory management: Data science can help retailers forecast
demand and optimize inventory, reducing the costs associated with
overstocking or stockouts.

4
 Dynamic pricing: Retailers can adjust prices in real time based on
data insights such as demand fluctuations, competitor pricing, or
supply chain disruptions.

Transportation and Logistics


 Route optimization: Logistics companies use data science to
optimize delivery routes, reduce fuel costs, and improve delivery
times.
 Demand forecasting for ride-sharing: Companies like Uber and Lyft
use data science to predict ride demand, adjusting pricing and
driver availability to balance supply and demand.
 Autonomous vehicles: Data science, particularly machine learning,
is essential for the development of self-driving cars, enabling
vehicles to understand and navigate their environment.
Manufacturing and Supply Chain
 Predictive maintenance: Sensors on machines collect data that,
when analyzed, can predict when equipment will fail, allowing for
maintenance to be done before problems occur, thus minimizing
downtime.
 Supply chain optimization: Data science is used to optimize the
entire supply chain by predicting demand, tracking inventory, and
identifying bottlenecks in production and distribution.
 Quality control: Data analysis helps detect anomalies in
manufacturing processes, ensuring the quality of products.
Energy and Environment
 Energy consumption forecasting: Utilities use data science to
predict energy demand and optimize the distribution of energy
resources.
 Renewable energy optimization: Data science can be used to
predict weather patterns and optimize the production of renewable
energy sources such as wind and solar power.

5
 Environmental monitoring: Analyzing data from sensors and
satellites helps in monitoring environmental conditions, such as air
quality, deforestation, or ocean pollution.
Sports and Entertainment
 Sports analytics: Data science is widely used in sports for
performance analysis, player tracking, injury prevention, and
strategy optimization.
 Fan engagement: Sports teams and entertainment platforms use
data to understand audience preferences, offering personalized
content, experiences, and marketing strategies.
 Game development: Video game developers use data science to
improve gameplay, balance game mechanics, and provide a more
personalized user experience.
Government and Public Sector
 Crime prediction and prevention: Data science helps law
enforcement agencies predict and prevent crimes by analyzing
patterns of criminal activity and identifying areas at high risk.
 Urban planning: Governments use data science to analyze traffic
patterns, environmental factors, and population demographics to
inform urban development and planning.
 Public health: During epidemics or pandemics, data science is used
to model disease spread, optimize resource allocation, and support
policy decisions.

Facets of data
Very large amount of data will generate in big data and data
science. These data are various types and main categories of data are as
follows:
 Structured
 Unstructured
 Natural language
 Machine-generated
6
 Graph-based
 Audio, video, and images
 Streaming

Structured
Structured data is arranged in rows and column format. It helps
for application to retrieve and process data easily. Database
management system is used for storing structured data.
The most common form of structured data or records is a database
where specific information is stored based on a methodology of columns
and rows.
Structured data is also searchable by data type within content.
Structured data is understood by computers and is also efficiently
organized for human readers.

Unstructured Data
Unstructured data is data that does not follow a specified format.
Row and columns are not used for unstructured data. Therefore, it
is difficult to retrieve required information. Unstructured data has no
identifiable structure.

7
The unstructured data can be in the form of Text: (Documents,
email messages, customer feedbacks), audio, video, images.
Email is an example of unstructured data.
Data can be of any type. Unstructured data does not follow any
structural rules.

Natural language
Natural language is a special type of unstructured data. Natural
language processing enables machines to recognize characters,
words and sentences, then apply meaning and understanding to that
information.
For natural language processing to help machines understand
human language, it must go through speech recognition, natural
language understanding and machine translation. It is an iterative
process comprised of several layers of text analysis.
Machine - Generated Data
Machine-generated data is an information that is created without
human interaction as a result of a computer process or application
activity. This means that data entered manually by an end-user is not
recognized to be machine-generated.
Machine data contains a definitive record of all activity and
behavior of our customers, users, transactions, applications, servers,
networks, factory machinery and so on.

8
Examples of machine data are web server logs, call detail
records, network event logs and telemetry.
Both Machine-to-Machine (M2M) and Human-to-Machine (H2M)
interactions generate machine data. Machine data is generated
continuously by every processor-based system, as well as many
consumer-oriented systems.

Graph-based or Network Data


Graphs are data structures to describe relationships and
interactions between entities in complex systems. In general, a graph
contains a collection of entities called nodes and another collection of
interactions between a pair of nodes called edges.
Nodes represent entities, which can be of any object type that is
relevant to our problem domain. By connecting nodes with edges, we will
end up with a graph (network) of nodes.
Graph databases can also help user easily detect relationship
patterns such as multiple people associated with a personal email
address or multiple people sharing the same IP address but residing in
different physical addresses.

Graph databases are used to store graph-based data and are


queried with specialized query languages such as SPARQL.
Audio, Image and Video
Audio, image and video are data types that pose specific challenges
to a data scientist. Tasks that are trivial for humans, such as recognizing
objects in pictures, turn out to be challenging for computers.
9
The terms audio and video commonly refers to the time-based
media storage format for sound/music and moving pictures information.
Audio and video digital recording, also referred as audio and video
codecs, can be uncompressed, lossless compressed or lossy
compressed depending on the desired quality and use cases.
Data Science is playing an important role to address these
challenges in multimedia data. Multimedia data usually contains various
forms of media, such as text, image, video, geographic coordinates and
even pulse waveforms, which come from multiple sources.

Streaming Data
Streaming data is data that is generated continuously by thousands
of data sources, which typically send in the data records simultaneously
and in small sizes (order of Kilobytes).
Streaming data includes a wide variety of data such as log files
generated by customers using mobile or web applications,
ecommerce purchases, in-game player activity, information from
social networks, financial trading floors or geospatial services and
telemetry from connected devices or instrumentation in data centers.

Data Science Process


The data science process is a systematic approach to analyzing and
interpreting data to extract meaningful insights and solve real-world
problems.
The data science process typically consists of six steps.

10
Setting the research goal
This step involves acquiring data from all the identified internal and
external sources, which helps to answer the business question.

Retrieving data
Its collection of data which required for project. This is the
process of gaining a business understanding of the data user have and
translating what each piece of data means.
Data can also be delivered by third-party companies and takes
many forms ranging from Excel spreadsheets to different types of
databases.

Data preparation
Data can have many inconsistencies like missing values, blank
columns, an incorrect data format, which needs to be cleaned. We
need to process, explore and condition data before modeling. The clean
data, gives the better predictions.
Data exploration
Data exploration is related to deeper understanding of data. Try to
understand how variables interact with each other, the distribution of the
data and whether there are outliers.

11
To achieve this use descriptive statistics, visual techniques and
simple modeling. This step is also called as Exploratory Data Analysis.

Data modeling
In this step, the actual model building process starts. Here, Data
scientist distributes datasets for training and testing. Techniques like
association, classification and clustering are applied to the training data
set. The model, once prepared, is tested against the "testing" dataset.

Presentation and automation


Deliver the final baselined model with reports, code and technical
documents in this stage. Model is deployed into a real-time production
environment after thorough testing.
In this stage, the key findings are communicated to all
stakeholders. This helps to decide if the project results are a success or a
failure based on the inputs from the model.

Setting the research goal


A project starts by understanding the what, the why, and the how
of your project. The outcome should be a clear research goal, a good
understanding of the context, well-defined deliverables, and a
plan of action with a timetable. This information is then best placed in a
project charter.
The length and formality can, of course, differ between projects and
companies. In this early phase of the project, people skills and business
acumen are more important than great technical prowess, which is why
this part will often be guided by more senior personnel.

12
Spend time understanding the goals and context of your
research:
 An essential outcome is the research goal that states the purpose
of your assignment in a clear and focused manner.
 Understanding the business goals and context is critical for project
success.
 Continue asking questions and devising examples until you grasp
the exact business expectations, identify how your project fits in
the bigger picture, appreciate how your research is going to change
the business, and understand how they’ll use your results.
Create a project charter
A project charter requires teamwork, and your input covers at least the
following:
 A clear research goals
 The project mission and context
 How you’re going to perform your analysis
 What resources you expect to use
 Proof that it’s an achievable project, or proof of concepts
 Deliverables and a measure of success
 A timeline

13
Retrieving data
Retrieving required data is second phase of data science project.
Sometimes Data scientists need to go into the field and design a data
collection process..
 Many companies will have already collected and stored the data
and what they don't have can often be bought from third parties.
 Most of the high-quality data is freely available for public and
commercial use. Data can be stored in various format. It is in text
file format and tables in database. Data may be internal or external.

Start working on internal data, (i.e. data stored within the


company)
 First step of data scientists is to verify the internal data. Assess
the relevance and quality of the data that's readily in company.
Most companies have a program for maintaining key data, so much
of the cleaning work may already be done.
 This data can be stored in official data repositories such as
databases, data marts, data warehouses and data lakes
maintained by a team of IT professionals.

14
 Data repository is also known as a data library or data archive.
This is a general term to refer to a data set isolated to be mined for
data reporting and analysis.
 The data repository is a large database infrastructure, several
databases that collect, manage and store data sets for data
analysis, sharing and reporting.
Data repository can be used to describe several ways to collect and store
data:
 Data warehouse is a large data repository that aggregates
data usually from multiple sources or segments of a business,
without the data being necessarily related.
 Data lake is a large data repository that stores unstructured
data that is classified and tagged with metadata.
 Data marts are subsets of the data repository. These data
marts are more targeted to what the data user needs and easier
to use.
Do not be afraid to shop around
 If required data is not available within the company, take the help
of other company, which provides such types of database. For
example, Nielsen and GFK are provides data for retail industry. Data
scientists also take help of Twitter, LinkedIn and Facebook.
 Government's organizations share their data for free with the world.
This data can be of excellent quality; it depends on the institution
that creates and manages it. The information they share covers a
broad range of topics such as the number of accidents or amount of
drug abuse in a certain region and its demographics.

15
Cleansing, integrating, and transforming data
The model needs the data in a specific format, so data transformation
will be the step. It’s a good habit to correct data errors as early on in the
process as possible.

16
Cleansing data
Data cleansing is a subprocess of the data science process. It focuses
on removing errors in the data. Then the data becomes a true and
consistent representation of the processes.
Types of errors:
 Interpretation error - data for granted, like saying that a person’s
age is greater than 300 years.
 Inconsistencies - class of errors is putting “Female” in one table
and “F” in another when they represent the same thing.
An overview of common errors

Data Entry Errors


Data collection and data entry are error-prone processes. They
often require human intervention, and because humans are only human.
But data collected by machines or computers isn’t free from errors
either. Errors can arise from human sloppiness, whereas others are due
to machine or hardware failure.
Most errors of this type are easy to fix with simple assignment
statements and if-then else rules:

17
if x == “Godo”:
x = “Good”
if x == “Bade”:
x = “Bad”
Redundant Whitespace:
Whitespaces tend to be hard to detect but cause errors like other
redundant characters.
Example: a mismatch of keys such as “FR ” – “FR”
Fixing redundant whitespaces - Python can use the strip() function to
remove leading and trailing spaces.

Fixing Capital Letter Mismatches:


 Capital letter mismatches - distinction between “Brazil” and “brazil”
 strings in lowercase, such as.lower() in Python.
“Brazil”.lower() ==“brazil”.lower() should result in true

Combining data from different data sources


Data comes from several different places, and in this substep we
focus on integrating these different sources. Data varies in size, type,
and structure, ranging from databases and Excel files to text documents .
The Different Ways of Combining Data:
Two operations to combine information from different data.
Joining: enriching an observation from one table with information from
another table.
The second operation is appending or stacking: adding the
observations of one table to those of another table.
Joining Tables
Joining tables allows to combine the information of one observation
found in one table with the information find in another table.

18
Appending Table
Appending or stacking tables is effectively adding observations
from one table to another table.

Transforming data
In data transformation, the data are transformed or consolidated
into forms appropriate for mining. Relationships between an input
variable and an output variable aren't always linear.

19
Reducing the Number of Variables
Having too many variables in the model makes the model difficult
to handle and certain techniques don't perform well when user overload
them with too many input variables.
All the techniques based on a Euclidean distance perform well
only up to 10 variables. Data scientists use special methods to reduce
the number of variables but retain the maximum amount of data.
Euclidean distance :
Euclidean distance is used to measure the similarity between
observations. It is calculated as the square root of the sum of differences
between each point.

Euclidean distance = √(X1-X2)2 + (Y1-Y2)2

Turning Variables into Dummies


Variables can be turned into dummy. Dummy variables can only
take two values: true (1) or false (0).
They’re used to indicate the absence of a categorical effect that
may explain the observation. In this case you’ll make separate columns
for the classes stored in one variable and indicate it with 1 if the class is
present and 0 otherwise.

20
Example

Exploratory Data Analysis


Exploratory Data Analysis (EDA) is a general approach to exploring
datasets by means of simple summary statistics and graphic
visualizations in order to gain a deeper understanding of data.

EDA is used by data scientists to analyze and investigate data sets


and summarize their main characteristics, often employing data
visualization methods.

21
Example
Bar chart, a line plot, and a distribution are some of the graphs
used in exploratory analysis.

Brushing and linking.


Brushing and linking you combine and link different graphs and
tables (or views) so changes in one graph are automatically transferred
to the other graphs.

22
Pareto diagram
Example
 A Pareto diagram is a combination of the values and a cumulative
distribution.
 It’s easy to see from this diagram that the first 50% of the countries
contain slightly less than 80% of the total amount.
 If this graph represented customer buying power and we sell
expensive products, we probably don’t need to spend our
marketing budget in every country; we could start with the first
50%.

Box plot
23
A box plot is a type of chart often used in explanatory data analysis
to visually show the distribution of numerical data and skewness through
displaying the data quartiles or percentile and averages.
Example

Build the models


To build the model, data should be clean and understand the
content properly.

24
The components of model building are as follows:
 Selection of model and variable
 Execution of model
 Model diagnostic and model comparison
Model and Variable Selection
For this phase, consider model performance and whether project
meets all the requirements to use model, as well as other factors:
 Must the model be moved to a production environment and, if so,
would it be easy to implement?
 How difficult is the maintenance on the model: how long will it
remain relevant if left untouched?
 Does the model need to be easy to explain?
Model Execution
Various programming language is used for implementing the model. For
model execution, Python provides libraries like StatsModels or Scikit-
learn. These packages use several of the most popular techniques.
Coding a model is a nontrivial task in most cases, so having these
libraries available can speed up the process. Following are the remarks
on output:

25
a) Model fit: R-squared or adjusted R-squared is used.
b) Predictor variables have a coefficient: For a linear model this is
easy to interpret.
c) Predictor significance: Coefficients are great, but sometimes not
enough evidence exists to show that the influence is there.
• Linear regression works if we want to predict a value, but for classify
something, classification models are used. The k-nearest neighbors
method is one of the best method.
Example

Following commercial tools are used :


1. SAS enterprise miner: This tool allows users to run predictive and
descriptive models based on large volumes of data from across the
enterprise.
2. SPSS modeler: It offers methods to explore and analyze data through
a GUI.
3. Matlab: Provides a high-level language for performing a variety of
data analytics, algorithms and data exploration.
4. Alpine miner: This tool provides a GUI front end for users to develop
analytic workflows and interact with Big Data tools and platforms on the
back end.
• Open Source tools:
1. R and PL/R: PL/R is a procedural language for PostgreSQL with R.
2. Octave: A free software programming language for computational
modeling, has some of the functionality of Matlab.

26
3. WEKA: It is a free data mining software package with an analytic
workbench. The functions created in WEKA can be executed within Java
code.
4. Python is a programming language that provides toolkits for machine
learning and analysis.
5. SQL in-database implementations, such as MADlib provide an
alterative to in memory desktop analytical tools.

Model Diagnostics and Model Comparison


Try to build multiple models and then select best one based on
multiple criteria. Working with a holdout sample helps user pick the best-
performing model.
In Holdout Method, the data is split into two different datasets labeled
as a training and a testing dataset. This can be a 60/40 or 70/30 or 80/20
split. This technique is called the hold-out validation technique.
Mean square error is a simple measure: check for every prediction
how far it was from the truth, square this error, and add up the error of
every prediction.

Presenting Findings and Building Applications

27
The team delivers final reports, briefings, code and technical documents.
 In addition, team may run a pilot project to implement the models
in a production environment.
 The last stage of the data science process is where user soft skills
will be most useful.
 Presenting your results to the stakeholders and industrializing your
analysis process for repetitive reuse and integration with other
tools.

28
Unit – II

Frequency Distribution
A frequency distribution is a representation, either in a graphical or
tabular format, that displays the number of observations within a
given interval.
There are two main types of frequency distributions for quantitative data:
1. Ungrouped Frequency Distribution
2. Grouped Frequency Distribution
Ungrouped Frequency Distribution
An ungrouped frequency distribution is used when the dataset
is small and consists of individual data points (usually discrete). It lists
each unique data value along with its corresponding frequency (i.e., the
number of times it occurs).
Steps to Create an Ungrouped Frequency Distribution:
 Step 1: Arrange the data in ascending or descending order.
 Step 2: Count how many times each unique value appears in the
dataset.
 Step 3: Organize the data into a table with two columns: one for
the data values and the other for their corresponding frequencies.

29
Example:
Consider the following dataset of ages: 18,21,18,22,22,23,21,22,24,21

Grouped Frequency Distribution


A grouped frequency distribution is used for larger datasets or
when the data is continuous. In this case, data is divided into
intervals (or classes), and the frequencies represent how many data
points fall within each class. This method simplifies the dataset, making it
easier to understand the overall distribution.
Steps to Create a Grouped Frequency Distribution:
Step 1: Determine the number of intervals (also called classes). A
common guideline is to use Sturges' Rule, which provides a formula for
the number of classes:

Step 2: Calculate the class width. The class width is the range of values
within each interval. It is determined by dividing the range of the data by
the number of intervals:

Step 3: Create intervals (or classes). The intervals should cover the full
range of the data and be mutually exclusive (no overlaps).

Step 4: Count the frequencies for each class.

30
Step 5: Construct a table showing the intervals (classes) and the
corresponding frequencies.

Example

Consider the following dataset of test scores:


55,67,78,89,92,85,56,76,90,83,77,66,60,73,88,80,65,95,79,8
2

Step 1: Determine the number of classes.


Using Sturges' Rule:

Rounding this gives us 6 intervals.


Step 2: Calculate the class width.
The range of the data is:

The class width is:

Step 3: Create the intervals.

The class intervals will be:


 55 - 61
 62 - 68
 69 - 75
 76 - 82
 83 - 89
 90 – 96

Step 4: Count the frequencies for each interval.

31
Relative Frequency Distributions
A relative frequency distribution is a way of showing how often
each value or class interval occurs relative to the total number of data
points. It expresses the frequency of each data value or class as a
proportion or percentage of the total dataset.
Formula for Relative Frequency:
The relative frequency of a value or class interval is calculated by
dividing the frequency of that value or class by the total number of
observations in the dataset.

To express the relative frequency as a percentage, multiply the


result by 100:

Steps to Create a Relative Frequency Distribution:


1. Organize the data: List the data values or class intervals (for
grouped data).
2. Calculate the frequency: For each value (or class), count how
many times it occurs in the dataset.
3. Calculate the total number of observations: Sum all the
frequencies to determine the total number of data points.
4. Calculate the relative frequency: For each value or class, divide
its frequency by the total number of observations.
32
5. Optional: Multiply by 100 to express the relative frequency as a
percentage.

Example 1: Relative Frequency Distribution for Ungrouped


Data
Let’s say we have the following dataset representing the number of
students in different age groups:
18, 21, 18, 22, 22, 23, 21 ,22, 24, 21
Step 1: Count the frequency of each value.

Step 2: Calculate the total number of data points.


The total number of students is:
Total=2+3+3+1+1=10
Step 3: Calculate the relative frequency for each age.

Step 4: Construct the relative frequency table.

33
Example 2: Relative Frequency Distribution for Grouped
Data
Consider the following dataset of test scores:
55, 67, 78, 89, 92, 85, 56, 76, 90, 83, 77, 66, 60, 73, 88, 80,
65, 95, 79, 82
Step 1: Group the data into intervals:
55 – 61 , 62 – 68 , 69 – 75 , 76 – 82, 83 – 89, 90 – 96
Step 2: Count the frequencies for each interval.

Step 3: Calculate the total number of data points.


The total number of test scores is 20.
Step 4: Calculate the relative frequency for each class interval.

34
Step 5: Construct the relative frequency table.

Cumulative frequency distributions


Cumulative frequency distributions show the total number of
observations in each class and in all lower-ranked classes.
What is Cumulative Frequency?
 Frequency: The number of occurrences of each value or range in a
dataset.
 Cumulative Frequency: The total number of values that are less
than or equal to a given value or fall within a certain class interval.

Steps to Construct a Cumulative Frequency Distribution:


1. Sort the Data: If the data is raw, organize it in order (ascending or
descending).
2. Create Class Intervals (if needed): For large datasets, you can
group the data into intervals (or bins).

35
3. Find the Frequency: Count how many data points fall into each
class interval.
4. Calculate the Cumulative Frequency: Start from the first-class
interval and add the frequency of each interval to the sum of the
previous intervals.

Example of a Cumulative Frequency Distribution:


Let’s say you have the following dataset of exam scores (out of 100):
56,62,70,72,85,88,90,90,92,95,97,99
We can organize the data into intervals and calculate cumulative
frequencies.
Step 1: Organize the Data into Intervals

Step 2: Calculate Cumulative Frequency


 For the interval 50-59, the cumulative frequency is 2.
 For 60-69, the cumulative frequency is 2+2=4
 For 70-79, the cumulative frequency is 4+2=6
 For 80-89, the cumulative frequency is 6+4=10
 For 90-99, the cumulative frequency is 10+4=14
Cumulative Frequency Distribution:

36
Frequency distributions for nominal Data
Frequency distributions for nominal data are used to summarize
and display the counts or frequencies of categories within a variable
that represents distinct groups or categories with no inherent order.
Nominal Data: This is data that consists of categories or groups that are
mutually exclusive. Examples include:
 Gender: Male, Female, Other
 Color: Red, Blue, Green
 Country: USA, Canada, Mexico
Frequency Distribution for Nominal Data
A frequency distribution for nominal data shows how many times
each category appears in the dataset. It typically includes both:
 Frequency: The count of observations in each category.
 Relative Frequency: The proportion or percentage of the total
that each category represents.
Components of a Frequency Distribution for Nominal Data:
 Categories (Levels): These are the different groups or labels that
make up the nominal variable.
Example: For the variable "favorite color," the categories
might be Red, Blue, and Green.
 Frequency: The count of how many times each category occurs in
the dataset.
Example: How many people in the survey like the color Red,
Blue, or Green.
 Relative Frequency (optional): The proportion of observations
that fall into each category. It is calculated by dividing the
frequency of a category by the total number of observations.
Formula:

37
Example: If there are 20 responses and 5 of those are for the color
Red, the relative frequency for Red would be 520=0.25\frac{5}{20} =
0.25205=0.25 or 25%.

Example: Frequency Distribution for Nominal Data


Let’s assume we have a survey of 15 people asking about their
favorite color. The responses are as follows:
 Red, Blue, Red, Green, Blue, Red, Green, Blue, Red, Red, Blue,
Green, Green, Blue, Red.
1. Categories:
The categories (favorite colors) are:
 Red
 Blue
 Green
2. Frequency:
We count how many times each color appears:
 Red: 5 times
 Blue: 5 times
 Green: 5 times
Relative Frequency:
There are 15 total observations. The relative frequency for each
color is calculated as follows:

1. Create a Frequency Distribution Table:

38
Graphical Representation:
To visualize the frequency distribution of nominal data,

Interpreting distributions
GRAPHS
Data can be described clearly and concisely with the aid of a well-
constructed frequency distribution.
Graphs for Quantitative Data
Histograms
A histogram is a special kind of bar graph that applies to
quantitative data (discrete or continuous).
The horizontal axis represents the range of data values. The bar
height represents the frequency of data values falling within the interval
formed by the width of the bar.
Some of the more important features of histograms
 Equal units along the horizontal axis (the X axis) reflect the various
class intervals of the frequency distribution.
 Equal units along the vertical axis (the Y axis, or ordinate) reflect
increases in frequency. (The units along the vertical axis do not
have to be the same width as those along the horizontal axis.)
 The intersection of the two axes defines the origin at which both
numerical scales equal 0.

39
 Numerical scales always increase from left to right along the
horizontal axis and from bottom to top along the vertical axis.
 The body of the histogram consists of a series of bars whose
heights reflect the frequencies for the various classes.
Frequency polygon
Frequency polygons are a graphical device for understanding the
shapes of distributions. They serve the same purpose as histograms, but
are especially helpful for comparing sets of data.
Frequency polygons are also a good choice for displaying
cumulative frequency distributions.
We can say that frequency polygon depicts the shapes and trends
of data. It can be drawn with or without a histogram.

The midpoints will be used for the position on the horizontal axis
and the frequency for the vertical axis.

40
A line indicates that there is a continuous movement. A frequency
polygon should therefore be used for scale variables that are binned, but
sometimes a frequency polygon is also used for ordinal variables.
Frequency polygons are useful for comparing distributions. This is
achieved by overlaying the frequency polygons drawn for different data
sets.

Steam and Leaf diagram:


 Stem and leaf diagrams allow to display raw data visually. Each raw
score is divided into a stem and a leaf. The leaf is typically the last
digit of the raw value. The stem is the remaining digits of the raw
value.
 Data points are split into a leaf (usually the ones digit) and a stem
(the other digits)
 To generate a stem and leaf diagram, first create a vertical column
that contains all of the stems. Then list each leaf next to the

41
corresponding stem. In these diagrams, all of the scores are
represented in the diagram without the loss of any information.
 A stem-and-leaf plot retains the original data. The leaves are
usually the last digit in each data value and the stems are the
remaining digits.

Graph for Qualitative (Nominal) Data


There are a couple of graphs that are appropriate for qualitative
data that has no natural ordering.
Bar graphs
 Bar Graphs are like histograms, but the horizontal axis has the
name of each category and there are spaces between the bars.
 Usually, the bars are ordered with the categories in alphabetical
order. One variant of a bar graph is called a Pareto Chart. These are
bar graphs with the categories ordered by frequency, from largest
to smallest.

 Bars of a bar graph can be represented both vertically and


horizontally.
42
 In bar graph, bars are used to represent the amount of data in each
category; one axis displays the categories of qualitative data and
the other axis displays the frequencies.
MODE
The mode is the value that appears most frequently in the data set.
Characteristics of the Mode:
 Most Frequent Value: The mode is simply the value that appears
the most frequently.
 Applicable to Any Type of Data: The mode can be used with
numerical, categorical, or even ordinal data.
 One, More Than One, or None:
A unimodal distribution has one mode.
A bimodal distribution has two modes.
A multimodal distribution has more than two modes.
A distribution with no mode means no value repeats.
How to Calculate the Mode:
 Organize the data: First, arrange the data in ascending or
descending order (although this step is not strictly necessary for
finding the mode).
 Count the frequency of each value: Identify which value
appears most frequently.
 Identify the mode: The value that appears the most is the mode.
If multiple values tie for the most frequent, the dataset has more
than one mode.
Examples:
Single Mode (Unimodal):
Data: {1, 2, 3, 3, 4, 5}
Frequency: 1 occurs once, 2 occurs once, 3 occurs twice, 4 occurs
once, 5 occurs once.
Mode: 3, because it appears twice, more than any other number.
Multiple Modes (Bimodal):
Data: {1, 2, 3, 3, 4, 4, 5}
43
Frequency: 1 occurs once, 2 occurs once, 3 occurs twice, 4 occurs
twice, 5 occurs once.
Mode: 3 and 4, because both appear twice.
No Mode:
Data: {1, 2, 3, 4, 5}
Frequency: All values appear once.
Mode: There is no mode, because no value repeats.

MEDIAN
The median is the middle value when the data points are arranged
in ascending or descending order.
How to Calculate the Median?
 Sort the Data: Arrange the data in ascending (or descending)
order.
 Determine the Position of the Median:
o If the dataset contains an odd number of values, the median
is the middle value.
o If the dataset contains an even number of values, the median
is the average of the two middle values.
Steps for Finding the Median:
1. Sort the data in numerical order.
2. Find the position of the median:
o If the number of data points n is odd, the median is the value
at position (n+1)/2
o If the number of data points n is even, the median is the
average of the two middle values, which are located at
position

Examples:
Odd Number of Data Points (Simple Median):
Data: {1, 3, 5, 7, 9}
44
Sorted data: {1, 3, 5, 7, 9}
The median is the middle value, which is 5.

Even Number of Data Points (Average of Two Middle Values):


Data: {1, 3, 5, 7}
Sorted data: {1, 3, 5, 7}
There are 4 numbers, so the median is the average of the two
middle values:

The median is 4.

MEAN
The mean is one of the most fundamental statistical concepts. The
basic idea is to sum all the data points and divide by the number of
points in the dataset.

Step-by-Step Process to Calculate the Mean


o Sum all the data points: Add together every number in the
dataset.
o Divide the sum by the number of data points: Take the sum
from Step 1 and divide by how many numbers are in the dataset.

Example 1: Simple Calculation of the Mean


Let's take a smaller dataset to demonstrate the process of calculating the
mean:

45
Data: {4, 6, 8, 10, 12}
Step 1: Sum all the values
4+6+8+10+12=40
Step 2: Divide by the number of values
There are 5 values in the dataset, so divide 40 by 5:
Mean=40/5=8
So, the mean of this dataset is 8.
Example 2: Using Larger Numbers
Consider a larger dataset with more complex numbers, such as test
scores or salaries in thousands:
Data: {1200, 1500, 1800, 2100, 2400, 2700, 3000}
Step 1: Sum all the values
1200+1500+1800+2100+2400+2700+3000=14100
Step 2: Divide by the number of values
There are 7 values in this dataset, so:
Mean=14100 / 7 = 2014.29

Averages for Qualitative and Ranked Data


1. Averages for Qualitative (Categorical) Data:
Qualitative data represents categories, so calculating an average in
the traditional sense (like the mean) isn't directly applicable. Instead, we
typically look at the mode.
 Mode: The mode is the most frequent category or value in the
dataset. It is the most common or most observed category.
For example:
In a survey where participants choose their favorite color, you may have
categories like red, blue, and green. If blue is chosen by most
participants, the mode is "blue."
 Frequency Distribution: This method counts how often each
category occurs. While it's not technically an "average," it can help
identify the most common categories and the proportion of each
category in the dataset.
46
Example: In a survey of favorite colors:
o Red: 10 responses
o Blue: 15 responses
o Green: 5 responses
o Total respondents = 30
 Proportions or Percentages: This method calculates the
proportion of each category relative to the total number of
responses. You can find percentages by dividing the frequency of
each category by the total number of data points.
Example: If 15 out of 30 people choose "Blue" as their favorite color,
the percentage of people who prefer blue is:

2. Averages for Ranked (Ordinal) Data:


Ranked or ordinal data represents categories with a meaningful
order but no fixed intervals between them. While the arithmetic mean
may not be appropriate, there are methods to summarize and find
central tendencies in ordinal data:
 Median: The middle value when the data is arranged in order. It is
a good measure of central tendency for ordinal data because it
divides the data into two equal halves.
 Mode: As with qualitative data, the mode (the most frequent rank)
can also be used for ordinal data.
 Mean Rank: The mean of the ranks can sometimes be used,
though it assumes equal spacing between ranks, which might not
be a valid assumption for all ordinal data. To calculate the mean
rank:
1. Assign numerical values to the ranks.
2. Compute the arithmetic mean of the numerical
values.

47
Example:
Ranking of customer satisfaction on a scale from 1 to 5, where 1 is
"very dissatisfied" and 5 is "very satisfied."
 Customer ratings: [3, 5, 4, 2, 5]
 To find the mean rank:
o Sum the ranks: 3+5+4+2+5=19
o Divide by the number of observations: 19/5=3.8
o The mean rank is 3.8, which suggests an average level of
satisfaction slightly above "neutral."

48

You might also like