0% found this document useful (0 votes)
26 views84 pages

Fundamentals of Data Science

Fundamental of data science

Uploaded by

ddugkyhr1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views84 pages

Fundamentals of Data Science

Fundamental of data science

Uploaded by

ddugkyhr1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 84

DEPARTMENT OF CS(AI & DS)

Fundamentals of Data Science


SYLLABUS

UNIT -I
Need for data science –benefits and uses –facets of data – data science process –setting the research
goal – retrieving data –cleansing, integrating and transforming data –exploratory data analysis –build
the models – presenting and building applications..
UNIT-II
Frequency distributions – Outliers –relative frequency distributions –cumulative frequency
distributions – frequency distributions for nominal data –interpreting distributions –graphs – averages
–mode –median –mean
UNIT-III
Normal distributions –z scores –normal curve problems – finding proportions –finding scores – more
about z scores –correlation –scatter plots –correlation coefficient for quantitative data – computational
formula for correlation coefficient-averages for qualitative and ranked data.
UNIT-IV
Basics of Numpy arrays, aggregations, computations on arrays, comparisons, structured arrays, Data
manipulation, data indexing and selection, operating on data, missing data, hierarchical indexing,
combining datasets –aggregation and grouping, pivot tables
UNIT-V
Visualization with matplotlib, line plots, scatter plots, visualizing errors, density and contour plots,
histograms, binnings, and density, three dimensional plotting, geographic data

Text Books:
1. David Cielen, Arno D. B. Meysman, and Mohamed Ali, ―Introducing Data Science‖, Manning
Publications, 2016.

2. Robert S. Witte and John S. Witte, ―Statistics‖, Eleventh Edition, Wiley Publications, 2017. 3. Jake
VanderPlas, ―Python Data Science Handbook‖, O‟Reilly, 2016.
References :
1. Allen B. Downey, ―Think Stats: Exploratory Data Analysis in Python‖, Green Tea Press,
2014.
Web Resources
● https://www.w3schools.com/datascience/
● https://www.geeksforgeeks.org/data-science-tutorial/
● https://www.coursera.org/
Mapping with Programme Outcomes:

S-Strong-3 M-Medium-2 L-Low-1

CO/ PSO PSO 1 PSO PSO 3 PSO 4 PSO 5 PSO 6


2

CO1 3 3 3 3 3 2

CO2 3 3 3 2 2 3

CO3 2 2 2 3 3 3

CO4 3 3 3 3 3 2

CO5 3 3 3 3 3 1

Weightage of course 14 14 14 14 14 11
contributed to each PSO

UNIT -I
Need for data science –benefits and uses –facets of data – data science process –setting the research
goal – retrieving data –cleansing, integrating and transforming data –exploratory data analysis –build
the models – presenting and building applications..
What is Data Science?
Data science is an interdisciplinary field that uses scientific methods, processes, algorithms, and systems
to extract knowledge and insights from structured and unstructured data. In simpler terms, data science
is about obtaining, processing, and analyzing data to gain insights for many purposes.
Data Science is a combination of multiple disciplines that uses statistics, data analysis, and machine
learning to analyze data and to extract knowledge and insights from it.

What is Data Science?


Data Science is about data gathering, analysis and decision-making.
Data Science is about finding patterns in data, through analysis, and make future predictions.
By using Data Science, companies are able to make:
● Better decisions (should we choose A or B)
● Predictive analysis (what will happen next?)
● Pattern discoveries (find pattern, or maybe hidden information in the data)

Where is Data Science Needed?


Data Science is used in many industries in the world today, e.g. banking, consultancy, healthcare, and
manufacturing.
Examples of where Data Science is needed:
● For route planning: To discover the best routes to ship
● To foresee delays for flight/ship/train etc. (through predictive analysis)
● To create promotional offers
● To find the best suited time to deliver goods
● To forecast the next years revenue for a company
● To analyze health benefit of training
● To predict who will win elections
Data Science can be applied in nearly every part of a business where data is available. Examples are:
● Consumer goods
● Stock markets
● Industry
● Politics
● Logistic companies
● E-commerce

The data science lifecycle


The data science lifecycle refers to the various stages a data science project generally undergoes, from
initial conception and data collection to communicating results and insights.

Despite every data science project being unique—depending on the problem, the industry it's applied in,
and the data involved—most projects follow a similar lifecycle.

This lifecycle provides a structured approach for handling complex data, drawing accurate conclusions,
and making data-driven decisions.
The data science lifecycle

Here are the five main phases that structure the data science lifecycle:

Data collection and storage

This initial phase involves collecting data from various sources, such as databases, Excel files, text files,
APIs, web scraping, or even real-time data streams. The type and volume of data collected largely
depend on the problem you’re addressing.

Once collected, this data is stored in an appropriate format ready for further processing. Storing the data
securely and efficiently is important to allow quick retrieval and processing.

Data preparation

Often considered the most time-consuming phase, data preparation involves cleaning and transforming
raw data into a suitable format for analysis. This phase includes handling missing or inconsistent data,
removing duplicates, normalization, and data type conversions. The objective is to create a clean,
high-quality dataset that can yield accurate and reliable analytical results.

Exploration and visualization


During this phase, data scientists explore the prepared data to understand its patterns, characteristics, and
potential anomalies. Techniques like statistical analysis and data visualization summarize the data's main
characteristics, often with visual methods.

Visualization tools, such as charts and graphs, make the data more understandable, enabling stakeholders
to comprehend the data trends and patterns better.

Experimentation and prediction

Data scientists use machine learning algorithms and statistical models to identify patterns, make
predictions, or discover insights in this phase. The goal here is to derive something significant from the
data that aligns with the project's objectives, whether predicting future outcomes, classifying data, or
uncovering hidden patterns.

Data Storytelling and communication

The final phase involves interpreting and communicating the results derived from the data analysis. It's
not enough to have insights; you must communicate them effectively, using clear, concise language and
compelling visuals. The goal is to convey these findings to non-technical stakeholders in a way that
influences decision-making or drives strategic initiatives.

Understanding and implementing this lifecycle allows for a more systematic and successful approach to
data science projects. Let's now delve into why data science is so important.

Why is Data Science Important?

Data science has emerged as a revolutionary field that is crucial in generating insights from data and
transforming businesses. It's not an overstatement to say that data science is the backbone of modern
industries. But why has it gained so much significance?

● Data volume. Firstly, the rise of digital technologies has led to an explosion of data. Every
online transaction, social media interaction, and digital process generates data. However, this
data is valuable only if we can extract meaningful insights from it. And that's precisely where
data science comes in.
● Value-creation. Secondly, data science is not just about analyzing data; it's about interpreting
and using this data to make informed business decisions, predict future trends, understand
customer behavior, and drive operational efficiency. This ability to drive decision-making based
on data is what makes data science so essential to organizations.
● Career options. Lastly, the field of data science offers lucrative career opportunities. With the
increasing demand for professionals who can work with data, jobs in data science are among the
highest paying in the industry. As per Glassdoor, the average salary for a data scientist in the
United States is $137,984, making it a rewarding career choice.

What is Data Science Used For?


Data science is used for an array of applications, from predicting customer behavior to optimizing
business processes. The scope of data science is vast and encompasses various types of analytics.

● Descriptive analytics. Analyzes past data to understand current state and trend identification. For
instance, a retail store might use it to analyze last quarter's sales or identify best-selling products.
● Diagnostic analytics. Explores data to understand why certain events occurred, identifying
patterns and anomalies. If a company's sales fall, it would identify whether poor product quality,
increased competition, or other factors caused it.
● Predictive analytics. Uses statistical models to forecast future outcomes based on past data, used
widely in finance, healthcare, and marketing. A credit card company may employ it to predict
customer default risks.
● Prescriptive analytics. Suggests actions based on results from other types of analytics to mitigate
future problems or leverage promising trends. For example, a navigation app advising the fastest
route based on current traffic conditions.

The increasing sophistication from descriptive to diagnostic to predictive to prescriptive analytics can
provide companies with valuable insights to guide decision-making and strategic planning. You can read
more about the four types of analytics in a separate article.

What are the Benefits of Data Science?

Data science can add value to any business which uses its data effectively. From statistics to predictions,
effective data-driven practices can put a company on the fast track to success. Here are some ways in
which data science is used:

Optimize business processes

Data Science can significantly improve a company's operations in various departments, from logistics
and supply chain to human resources and beyond. It can help in resource allocation, performance
evaluation, and process automation. For example, a logistics company can use data science to optimize
routes, reduce delivery times, save fuel costs, and improve customer satisfaction.

Unearth new insights

Data Science can uncover hidden patterns and insights that might not be evident at first glance. These
insights can provide companies with a competitive edge and help them understand their business better.
For instance, a company can use customer data to identify trends and preferences, enabling them to tailor
their products or services accordingly.

Create innovative products and solutions

Companies can use data science to innovate and create new products or services based on customer
needs and preferences. It also allows businesses to predict market trends and stay ahead of the
competition. For example, streaming services like Netflix use data science to understand viewer
preferences and create personalized recommendations, enhancing user experience.

Which Industries Use Data Science?

The implications of data science span across all industries, fundamentally changing how organizations
operate and make decisions. While every industry stands to gain from implementing data science, it's
especially influential in data-rich sectors.

Let's delve deeper into how data science is revolutionizing these key industries:

Data science applications in finance

The finance sector has been quick to harness the power of data science. From fraud detection and
algorithmic trading to portfolio management and risk assessment, data science has made complex
financial operations more efficient and precise. For instance, credit card companies utilize data science
techniques to detect and prevent fraudulent transactions, saving billions of dollars annually.

Learn more about the finance fundamentals in Python and how you can make data-driven financial
decisions with our skill track.

Data science applications in healthcare

Healthcare is another industry where data science has a profound impact. Applications range from
predicting disease outbreaks and improving patient care quality to enhancing hospital management and
drug discovery. Predictive models help doctors diagnose diseases early, and treatment plans can be
customized according to the patient's specific needs, leading to improved patient outcomes.

You can discover more about how data science is transforming healthcare in a DataFramed Podcast
episode.

Data science applications in marketing

Marketing is a field that has been significantly transformed by the advent of data science. The
applications in this industry are diverse, ranging from customer segmentation and targeted advertising to
sales forecasting and sentiment analysis. Data science allows marketers to understand consumer
behavior in unprecedented detail, enabling them to create more effective campaigns. Predictive analytics
can also help businesses identify potential market trends, giving them a competitive edge.
Personalization algorithms can tailor product recommendations to individual customers, thereby
increasing sales and customer satisfaction.

We have a separate blog post on five ways to use data science in marketing, exploring some of the
methods used in the industry. You can also learn more in our Marketing Analytics with Python skill
track.

Data science applications in technology


Technology companies are perhaps the most significant beneficiaries of data science. From powering
recommendation engines to enhancing image and speech recognition, data science finds applications in
diverse areas. Ride-hailing platforms, for example, rely on data science for connecting drivers with ride
hailers and optimizing the supply of drivers depending on the time of day.

How is Data Science Different from Other Data-Related Fields?

While data science overlaps with many fields that also work with data, it carries a unique blend of
principles, tools, and techniques designed to extract insightful patterns from data.

Distinguishing between data science and these related fields can give a better understanding of the
landscape and help in setting the right career path. Let's demystify these differences.

Data science vs data analytics

Data science and data analytics both serve crucial roles in extracting value from data, but their focuses
differ. Data science is an overarching field that uses methods including machine learning and predictive
analytics, to draw insights from data. In contrast, data analytics concentrates on processing and
performing statistical analysis on existing datasets to answer specific questions.

Data science vs business analytics

While business analytics also deals with data analysis, it is more centered on leveraging data for
strategic business decisions. It is generally less technical and more business-focused than data science.
Data science, though it can inform business strategies, often dives deeper into the technical aspects, like
programming and machine learning.

Data science vs data engineering

Data engineering focuses on building and maintaining the infrastructure for data collection, storage, and
processing, ensuring data is clean and accessible. Data science, on the other hand, analyzes this data,
using statistical and machine learning models to extract valuable insights that influence business
decisions. In essence, data engineers create the data 'roads', while data scientists 'drive' on them to derive
meaningful insights. Both roles are vital in a data-driven organization.

Data science vs machine learning

Machine learning is a subset of data science, concentrating on creating and implementing algorithms
that let machines learn from and make decisions based on data. Data science, however, is broader and
incorporates many techniques, including machine learning, to extract meaningful information from data.

Data Science vs Statistics

Statistics, a mathematical discipline dealing with data collection, analysis, interpretation, and
organization, is a key component of data science. However, data science integrates statistics with other
methods to extract insights from data, making it a more multidisciplinary field.
Industry Focus Technical Emphasis

Data Science Driving value with data across the 4 Programming, ML, Statistics
levels of analytics

Data Analytics Perform statistical analysis on existing Statistical analysis


datasets

Business Analytics Leverage data for strategic business Business strategies, data
decisions analysis

Data Engineering Build and maintain data infrastructure Data collection, storage,
processing

Machine Learning Creating and implementing algorithms Algorithm development,


for machine learning model implementation

Statistics Data collection, analysis, interpretation, Statistical analysis,


and organization mathematical principles

Having understood these distinctions, we can now delve into the key concepts every data scientist needs
to master.
Key Data Science Concepts
A successful data scientist doesn't just need technical skills but also an understanding of core concepts
that form the foundation of the field. Here are some key concepts to grasp:
Statistics and probability
These are the bedrock of data science. Statistics is used to derive meaningful insights from data, while
probability allows us to make predictions about future events based on available data. Understanding
distributions, statistical tests, and probability theories is essential for any data scientist.

OVERVIEW OF THE DATA SCIENCE PROCESS


The typical data science process consists of six steps through which you’ll iterate, as shown in figure
➢ The first step of this process is setting a research goal. The main purpose here is making sure all the
stakeholders understand the what, how, and why of the project. In every serious project this will result in
a project charter.
➢ The second phase is data retrieval. You want to have data available for analysis, so this step includes
finding suitable data and getting access to the data from the data owner. The result is data in its raw
form, which probably needs polishing and transformation before it becomes usable.
➢ Now that you have the raw data, it’s time to prepare it. This includes transforming the data from a
raw form into data that’s directly usable in your models. To achieve this, you’ll detect and correct
different kinds of errors in the data, combine data from different data sources, and transform it. If you
have successfully completed this step, you can progress to data visualization and modeling.
➢ The fourth step is data exploration. The goal of this step is to gain a deep understanding of the data.
You’ll look for patterns, correlations, and deviations based on visual and descriptive techniques. The
insights you gain from this phase will enable you to start modeling.
➢ Finally, we get to model building (often referred to as “data modeling” throughout this book). It is
now that you attempt to gain the insights or make the predictions stated in your project charter. Now is
the time to bring out the heavy guns, but remember research has taught us that often (but not always) a
combination of simple models tends to outperform one complicated model. If you’ve done this phase
right, you’re almost done.
➢ The last step of the data science model is presenting your results and automating the analysis, if
needed. One goal of a project is to change a process and/or make better decisions. You may still need to
convince the business that your findings will indeed change the business process as expected. This is
where you can shine in your influencer role. The importance of this step is more apparent in projects on
a strategic and tactical level. Certain projects require you to perform the business process over and over
again, so automating the project will save time.
DEFINING RESEARCH GOALS
A project starts by understanding the what, the why, and the how of your project. The outcome should be
a clear research goal, a good understanding of the context, well-defined deliverables, and a plan of action
with a timetable. This information is then best placed in a project charter. Spend time understanding the
goals and context of your research:
➢ An essential outcome is the research goal that states the purpose of your assignment in a clear and
focused manner.
➢ Understanding the business goals and context is critical for project success.
➢ Continue asking questions and devising examples until you grasp the exact business expectations,
identify how your project fits in the bigger picture, appreciate how your research is going to change the
business, and understand how they’ll use your results Create a project charter A project charter
requires teamwork, and your input covers at least the following:
➢ A clear research goal
➢ The project mission and context
➢ How you’re going to perform your analysis
➢ What resources you expect to use
➢ Proof that it’s an achievable project, or proof of concepts
➢ Deliverables and a measure of success
➢ A timeline
RETRIEVING DATA
➢ The next step in data science is to retrieve the required data. Sometimes you need to go into the field
and design a data collection process yourself, but most of the time you won’t be involved in this step.
➢ Many companies will have already collected and stored the data for you, and what they don’t have
can often be bought from third parties.
➢ More and more organizations are making even high-quality data freely available for public and
commercial use.
➢ Data can be stored in many forms, ranging from simple text files to tables in a database. The
objective now is acquiring all the data you need. Start with data stored within the company (Internal
data)
➢ Most companies have a program for maintaining key data; so much of the cleaning work may already
be done. This data can be stored in official data repositories such as databases, data marts, data
warehouses, and data lakes maintained by a team of IT professionals.
➢ Data warehouses and data marts are home to pre-processed data, data lakes contain data in its natural
or raw format.
➢ Finding data even within your own company can sometimes be a challenge. As companies grow, their
data becomes scattered around many places. The data may be dispersed as people change positions and
leave the company.
➢ Getting access to data is another difficult task. Organizations understand the value and sensitivity of
data and often have policies in place so everyone has access to what they need and nothing more.
➢ These policies translate into physical and digital barriers called Chinese walls. These “walls” are
mandatory and well-regulated for customer data in most countries. External Data
➢ If data isn’t available inside your organization, look outside your organizations. Companies provide
data so that you, in turn, can enrich their services and ecosystem. Such is the case with Twitter,
LinkedIn, and Facebook.
➢ More and more governments and organizations share their data for free with the world.
➢ A list of open data providers that should get you started.

Data cleaning
Data cleaning, also known as data cleansing or data scrubbing, is a crucial step in the data science
pipeline. It involves identifying and correcting errors, inconsistencies, and inaccuracies in datasets to
improve their quality and reliability. This process ensures that the data used for analysis and modeling is
accurate, complete, and suitable for its intended purpose.
In this article, we’ll explore the importance of data cleaning, common issues that data scientists
encounter, and various techniques and best practices for effective data cleaning.
The Importance of Data Cleaning
Data cleaning plays a vital role in the data science process for several reasons:
Data Quality: Clean data leads to more accurate analyses and reliable insights. Poor data quality can
result in flawed conclusions and misguided decisions.
Model Performance: Machine learning models trained on clean data tend to perform better and
generalize more effectively to new, unseen data.
Efficiency: Clean data reduces the time and resources spent on troubleshooting and fixing issues during
later stages of analysis or model development.
Consistency: Data cleaning helps ensure consistency across different data sources and formats, making
it easier to integrate and analyze data from multiple origins.
Compliance: In many industries, clean and accurate data is essential for regulatory compliance and
reporting purposes.

Exploratory data analysis is one of the basic and essential steps of a data science project. A data
scientist involves almost 70% of his work in doing the EDA of the dataset. In this article, we will discuss
what is Exploratory Data Analysis (EDA) and the steps to perform EDA.

What is Exploratory Data Analysis (EDA)?


Exploratory Data Analysis (EDA) is a crucial initial step in data science projects. It involves analyzing
and visualizing data to understand its key characteristics, uncover patterns, and identify relationships
between variables refers to the method of studying and exploring record sets to apprehend their
predominant traits, discover patterns, locate outliers, and identify relationships between variables. EDA
is normally carried out as a preliminary step before undertaking extra formal statistical analyses or
modeling.
Key aspects of EDA include:
Distribution of Data: Examining the distribution of data points to understand their range, central
tendencies (mean, median), and dispersion (variance, standard deviation).
Graphical Representations: Utilizing charts such as histograms, box plots, scatter plots, and bar charts to
visualize relationships within the data and distributions of variables.
Outlier Detection: Identifying unusual values that deviate from other data points. Outliers can influence
statistical analyses and might indicate data entry errors or unique cases.
Correlation Analysis: Checking the relationships between variables to understand how they might
affect each other. This includes computing correlation coefficients and creating correlation matrices.
Handling Missing Values: Detecting and deciding how to address missing data points, whether by
imputation or removal, depending on their impact and the amount of missing data.
Summary Statistics: Calculating key statistics that provide insight into data trends and nuances.
Testing Assumptions: Many statistical tests and models assume the data meet certain conditions (like
normality or homoscedasticity). EDA helps verify these assumptions.
Why Exploratory Data Analysis is Important?
Exploratory Data Analysis (EDA) is important for several reasons, especially in the context of data
science and statistical modeling. Here are some of the key reasons why EDA is a critical step in the
data analysis process:
Understanding Data Structures: EDA helps in getting familiar with the dataset, understanding the
number of features, the type of data in each feature, and the distribution of data points. This
understanding is crucial for selecting appropriate analysis or prediction techniques.
Identifying Patterns and Relationships: Through visualizations and statistical summaries, EDA can
reveal hidden patterns and intrinsic relationships between variables. These insights can guide further
analysis and enable more effective feature engineering and model building.
Detecting Anomalies and Outliers: EDA is essential for identifying errors or unusual data points that
may adversely affect the results of your analysis. Detecting these early can prevent costly mistakes in
predictive modeling and analysis.
Testing Assumptions: Many statistical models assume that data follow a certain distribution or that
variables are independent. EDA involves checking these assumptions. If the assumptions do not hold, the
conclusions drawn from the model could be invalid.
Informing Feature Selection and Engineering: Insights gained from EDA can inform which features
are most relevant to include in a model and how to transform them (scaling, encoding) to improve model
performance.
Optimizing Model Design: By understanding the data’s characteristics, analysts can choose appropriate
modeling techniques, decide on the complexity of the model, and better tune model parameters.
Facilitating Data Cleaning: EDA helps in spotting missing values and errors in the data, which are
critical to address before further analysis to improve data quality and integrity.
Enhancing Communication: Visual and statistical summaries from EDA can make it easier to
communicate findings and convince others of the validity of your conclusions, particularly when
explaining data-driven insights to stakeholders without technical backgrounds.
Types of Exploratory Data Analysis
EDA, or Exploratory Data Analysis, refers back to the method of analyzing and analyzing information
units to uncover styles, pick out relationships, and gain insights. There are various sorts of EDA
strategies that can be hired relying on the nature of the records and the desires of the evaluation.
Depending on the number of columns we are analyzing we can divide EDA into three types: Univariate,
bivariate and multivariate.
1. Univariate Analysis
Univariate analysis focuses on a single variable to understand its internal structure. It is primarily
concerned with describing the data and finding patterns existing in a single feature. This sort of
evaluation makes a speciality of analyzing character variables inside the records set. It involves
summarizing and visualizing a unmarried variable at a time to understand its distribution, relevant
tendency, unfold, and different applicable records. Common techniques include:
Histograms: Used to visualize the distribution of a variable.
Box plots: Useful for detecting outliers and understanding the spread and skewness of the data.
Bar charts: Employed for categorical data to show the frequency of each category.
Summary statistics: Calculations like mean, median, mode, variance, and standard deviation that
describe the central tendency and dispersion of the data.
2. Bivariate Analysis
Bivariate evaluation involves exploring the connection between variables. It enables find associations,
correlations, and dependencies between pairs of variables. Bivariate analysis is a crucial form of
exploratory data analysis that examines the relationship between two variables. Some key techniques
used in bivariate analysis:
Scatter Plots: These are one of the most common tools used in bivariate analysis. A scatter plot helps
visualize the relationship between two continuous variables.
Correlation Coefficient: This statistical measure (often Pearson’s correlation coefficient for linear
relationships) quantifies the degree to which two variables are related.
Cross-tabulation: Also known as contingency tables, cross-tabulation is used to analyze the relationship
between two categorical variables. It shows the frequency distribution of categories of one variable in
rows and the other in columns, which helps in understanding the relationship between the two variables.
Line Graphs: In the context of time series data, line graphs can be used to compare two variables over
time. This helps in identifying trends, cycles, or patterns that emerge in the interaction of the variables
over the specified period.
Covariance: Covariance is a measure used to determine how much two random variables change
together. However, it is sensitive to the scale of the variables, so it’s often supplemented by the
correlation coefficient for a more standardized assessment of the relationship.
3. Multivariate Analysis
Multivariate analysis examines the relationships between two or more variables in the dataset. It aims to
understand how variables interact with one another, which is crucial for most statistical modeling
techniques. Techniques include:
Pair plots: Visualize relationships across several variables simultaneously to capture a comprehensive
view of potential interactions.
Principal Component Analysis (PCA): A dimensionality reduction technique used to reduce the
dimensionality of large datasets, while preserving as much variance as possible.
Specialized EDA Techniques
In addition to univariate and multivariate analysis, there are specialized EDA techniques tailored for
specific types of data or analysis needs:
Spatial Analysis: For geographical data, using maps and spatial plotting to understand the geographical
distribution of variables.
Text Analysis: Involves techniques like word clouds, frequency distributions, and sentiment analysis to
explore text data.
Time Series Analysis: This type of analysis is mainly applied to statistics sets that have a temporal
component. Time collection evaluation entails inspecting and modeling styles, traits, and seasonality
inside the statistics through the years. Techniques like line plots, autocorrelation analysis, transferring
averages, and ARIMA (AutoRegressive Integrated Moving Average) fashions are generally utilized in
time series analysis.
Tools for Performing Exploratory Data Analysis
Exploratory Data Analysis (EDA) can be effectively performed using a variety of tools and software,
each offering unique features suitable for handling different types of data and analysis requirements.
1. Python Libraries
Pandas: Provides extensive functions for data manipulation and analysis, including data structure
handling and time series functionality.
Matplotlib: A plotting library for creating static, interactive, and animated visualizations in Python.
Seaborn: Built on top of Matplotlib, it provides a high-level interface for drawing attractive and
informative statistical graphics.
Plotly: An interactive graphing library for making interactive plots and offers more sophisticated
visualization capabilities.
2. R Packages
ggplot2: Part of the tidyverse, it’s a powerful tool for making complex plots from data in a data frame.
dplyr: A grammar of data manipulation, providing a consistent set of verbs that help you solve the most
common data manipulation challenges.
tidyr: Helps to tidy your data. Tidying your data means storing it in a consistent form that matches the
semantics of the dataset with the way it is stored.
Steps for Performing Exploratory Data Analysis
Performing Exploratory Data Analysis (EDA) involves a series of steps designed to help you understand
the data you’re working with, uncover underlying patterns, identify anomalies, test hypotheses, and
ensure the data is clean and suitable for further analysis.
Step 1: Understand the Problem and the Data
The first step in any information evaluation project is to sincerely apprehend the trouble you are trying to
resolve and the statistics you have at your disposal. This entails asking questions consisting of:
What is the commercial enterprise goal or research question you are trying to address?
What are the variables inside the information, and what do they mean?
What are the data sorts (numerical, categorical, textual content, etc.) ?
Is there any known information on first-class troubles or obstacles?
Are there any relevant area-unique issues or constraints?
By thoroughly knowing the problem and the information, you can better formulate your evaluation
technique and avoid making incorrect assumptions or drawing misguided conclusions. It is also vital to
contain situations and remember specialists or stakeholders to this degree to ensure you have complete
know-how of the context and requirements.
Step 2: Import and Inspect the Data
Once you have clean expertise of the problem and the information, the following step is to import the
data into your evaluation environment (e.g., Python, R, or a spreadsheet program). During this step,
looking into the statistics is critical to gain initial know-how of its structure, variable kinds, and
capability issues.
Here are a few obligations you could carry out at this stage:
Load the facts into your analysis environment, ensuring that the facts are imported efficiently and
without errors or truncations.
Examine the size of the facts (variety of rows and columns) to experience its length and complexity.
Check for missing values and their distribution across variables, as missing information can notably
affect the quality and reliability of your evaluation.
Identify facts sorts and formats for each variable, as these records may be necessary for the following
facts manipulation and evaluation steps.
Look for any apparent errors or inconsistencies in the information, such as invalid values, mismatched
units, or outliers, that can indicate exceptional issues with information.
Step 3: Handle Missing Data
Missing records is a joint project in many datasets, and it can significantly impact the quality and
reliability of your evaluation. During the EDA method, it’s critical to pick out and deal with lacking
information as it should be, as ignoring or mishandling lacking data can result in biased or misleading
outcomes.
Here are some techniques you could use to handle missing statistics:
Understand the styles and capacity reasons for missing statistics: Is the information lacking entirely at
random (MCAR), lacking at random (MAR), or lacking not at random (MNAR)? Understanding the
underlying mechanisms can inform the proper method for handling missing information.
Decide whether to eliminate observations with lacking values (listwise deletion) or attribute (fill in)
missing values: Removing observations with missing values can result in a loss of statistics and
potentially biased outcomes, specifically if the lacking statistics are not MCAR. Imputing missing values
can assist in preserving treasured facts. However, the imputation approach needs to be chosen cautiously.
Use suitable imputation strategies, such as mean/median imputation, regression imputation, a couple of
imputations, or device-getting-to-know-based imputation methods like k-nearest associates (KNN) or
selection trees. The preference for the imputation technique has to be primarily based on the
characteristics of the information and the assumptions underlying every method.
Consider the effect of lacking information: Even after imputation, lacking facts can introduce
uncertainty and bias. It is important to acknowledge those limitations and interpret your outcomes with
warning.
Handling missing information nicely can improve the accuracy and reliability of your evaluation and
save you biased or deceptive conclusions. It is likewise vital to record the techniques used to address
missing facts and the motive in the back of your selections.
Step 4: Explore Data Characteristics
After addressing the facts that are lacking, the next step within the EDA technique is to explore the traits
of your statistics. This entails examining your variables’ distribution, crucial tendency, and variability
and identifying any ability outliers or anomalies. Understanding the characteristics of your information
is critical in deciding on appropriate analytical techniques, figuring out capability information first-rate
troubles, and gaining insights that may tell subsequent evaluation and modeling decisions.
Calculate summary facts (suggest, median, mode, preferred deviation, skewness, kurtosis, and many
others.) for numerical variables: These facts provide a concise assessment of the distribution and critical
tendency of each variable, aiding in the identification of ability issues or deviations from expected
patterns.
Step 5: Perform Data Transformation
Data transformation is a critical step within the EDA process because it enables you to prepare your
statistics for similar evaluation and modeling. Depending on the traits of your information and the
necessities of your analysis, you may need to carry out various ameliorations to ensure that your records
are in the most appropriate layout.
Here are a few common records transformation strategies:
Scaling or normalizing numerical variables to a standard variety (e.g., min-max scaling, standardization)
Encoding categorical variables to be used in machine mastering fashions (e.g., one-warm encoding, label
encoding)
Applying mathematical differences to numerical variables (e.g., logarithmic, square root) to correct for
skewness or non-linearity
Creating derived variables or capabilities primarily based on current variables (e.g., calculating ratios,
combining variables)
Aggregating or grouping records mainly based on unique variables or situations
By accurately transforming your information, you could ensure that your evaluation and modeling
strategies are implemented successfully and that your results are reliable and meaningful.
Step 6: Visualize Data Relationships
Visualization is an effective tool in the EDA manner, as it allows to discover relationships between
variables and become aware of styles or trends that may not immediately be apparent from summary
statistics or numerical outputs. To visualize data relationships, explore univariate, bivariate, and
multivariate analysis.
Create frequency tables, bar plots, and pie charts for express variables: These visualizations can help you
apprehend the distribution of classes and discover any ability imbalances or unusual patterns.
Generate histograms, container plots, violin plots, and density plots to visualize the distribution of
numerical variables. These visualizations can screen critical information about the form, unfold, and
ability outliers within the statistics.
Examine the correlation or association among variables using scatter plots, correlation matrices, or
statistical assessments like Pearson’s correlation coefficient or Spearman’s rank correlation:
Understanding the relationships between variables can tell characteristic choice, dimensionality
discount, and modeling choices.
Step 7: Handling Outliers
An Outlier is a data item/object that deviates significantly from the rest of the (so-called normal)objects.
They can be caused by measurement or execution errors. The analysis for outlier detection is referred to
as outlier mining. There are many ways to detect outliers, and the removal process of these outliers from
the dataframe is the same as removing a data item from the panda’s dataframe.
Identify and inspect capability outliers through the usage of strategies like the interquartile range (IQR),
Z-scores, or area-specific regulations: Outliers can considerably impact the results of statistical analyses
and gadget studying fashions, so it’s essential to perceive and take care of them as it should be.
Step 8: Communicate Findings and Insights
The final step in the EDA technique is effectively discussing your findings and insights. This includes
summarizing your evaluation, highlighting fundamental discoveries, and imparting your outcomes
cleanly and compellingly.
Here are a few hints for effective verbal exchange:
Clearly state the targets and scope of your analysis
Provide context and heritage data to assist others in apprehending your approach
Use visualizations and photos to guide your findings and make them more reachable
Highlight critical insights, patterns, or anomalies discovered for the duration of the EDA manner
Discuss any barriers or caveats related to your analysis
Suggest ability next steps or areas for additional investigation
Effective conversation is critical for ensuring that your EDA efforts have a meaningful impact and that
your insights are understood and acted upon with the aid of stakeholders.

Data science has proved to be the leading support in making decisions, increased automation, and
provision of insight across the industry in today's fast-paced, technology-driven world. In essence, the
nuts and bolts of data science involve very large data set handling, pattern searching from the data,
predicting specific outcomes based on the patterns found, and finally, acting or making informed
decisions on such data sets. This is operationalized through data science modeling that, in a way,
involves designing the algorithms and statistical models that have the purpose of processing and
analyzing data. This is quite a process that is challenging to learners who are only beginning their steps
in the field. Understanding this in crystal clear steps, even a person who is a beginner will be able to
follow in this journey of data science to create models effectively.
What is Data Science Modelling
Data science modeling is a set of steps from defining the problem to deploying the model in reality. The
main aim of this paper is to, in turn, demystify and come up with a very simple, stepwise guide that any
person with a basic grasp of ideas in data science should be able to follow with minimal ease. This
guideline ensures that each of these steps is explicated using the simplest of languages that even a
beginner can easily follow in applying such practices in their projects.
Data Science Modelling Steps
1. Define Your Objective
2. Collect Data
3. Clean Your Data
4. Explore Your Data
5. Split Your Data
6. Choose a Model
7. Train Your Model
8. Evaluate Your Model
9. Improve Your Model
10. Deploy Your Model
The 10 easy steps would guide a beginner through the modeling process in data science and are meant to
be an easily readable guide for beginners who want to build models that can analyze data and give
insights. Each step is crucial and builds upon the previous one, ensuring a comprehensive understanding
of the entire process. Designed for students, professionals who would like to switch their career paths,
and even curious minds out there in pursuit of knowledge, this guide gives the perfect foundation for
delving deeper into the world of data science models.
1. Define Your Objective
First, define very clearly what problem you are going to solve. Whether that is a customer churn
prediction, better product recommendations, or patterns in data, you first need to know your direction.
This should bring clarity to the choice of data, algorithms, and evaluation metrics.
2. Collect Data
Gather data relevant to your objective. This can include internal data from your company, publicly
available datasets, or data purchased from external sources. Ensure you have enough data to train your
model effectively.
3. Clean Your Data
Data cleaning is a critical step to prepare your dataset for modeling. It involves handling missing values,
removing duplicates, and correcting errors. Clean data ensures the reliability of your model's predictions.
4. Explore Your Data
Data exploration, or exploratory data analysis (EDA), involves summarizing the main characteristics of
your dataset. Use visualizations and statistics to uncover patterns, anomalies, and relationships between
variables.
5. Split Your Data
Divide your dataset into training and testing sets. The training set is used to train your model, while the
testing set evaluates its performance. A common split ratio is 80% for training and 20% for testing.
6. Choose a Model
Select a model that suits your problem type (e.g., regression, classification) and data. Beginners can start
with simpler models like linear regression or decision trees before moving on to more complex models
like neural networks.
7. Train Your Model
Feed your training data into the model. This process involves the model learning from the data, adjusting
its parameters to minimize errors. Training a model can take time, especially with large datasets or
complex models.
8. Evaluate Your Model
After training, assess your model's performance using the testing set. Common evaluation metrics
include accuracy, precision, recall, and F1 score. Evaluation helps you understand how well your model
will perform on unseen data.
9. Improve Your Model
Based on the evaluation, you may need to refine your model. This can involve tuning hyperparameters,
choosing a different model, or going back to data cleaning and preparation for further improvements.
10. Deploy Your Model
Once satisfied with your model's performance, deploy it for real-world use. This could mean integrating
it into an application or using it for decision-making within your organization.

Presenting Findings and Building Applications


• The team delivers final reports, briefings, code and technical documents.
• In addition, team may run a pilot project to implement the models in a production environment.
• The last stage of the data science process is where user soft skills will be most useful.
• Presenting your results to the stakeholders and industrializing your analysis process for repetitive reuse
and integration with other tools.
Build the Models
• To build the model, data should be clean and understand the content properly. The components of
model building are as follows:
a) Selection of model and variable
b) Execution of model
c) Model diagnostic and model comparison
• Building a model is an iterative process. Most models consist of the following main steps:
1. Selection of a modeling technique and variables to enter in the model
2. Execution of the model
3. Diagnosis and model comparison
Model and Variable Selection
• For this phase, consider model performance and whether project meets all the requirements to use
model, as well as other factors:
1. Must the model be moved to a production environment and, if so, would it be easy to implement?
2. How difficult is the maintenance on the model: how long will it remain relevantif left untouched?
3. Does the model need to be easy to explain?
Model Execution
• Various programming language is used for implementing the model. For model execution, Python
provides libraries like StatsModels or Scikit-learn. These packages use several of the most popular
techniques.
• Coding a model is a nontrivial task in most cases, so having these libraries available can speed up the
process. Following are the remarks on output:
a) Model fit: R-squared or adjusted R-squared is used.
b) Predictor variables have a coefficient: For a linear model this is easy to interpret.
c) Predictor significance: Coefficients are great, but sometimes not enough evidence exists to show that
the influence is there.
• Linear regression works if we want to predict a value, but for classify something, classification models
are used. The k-nearest neighbors method is one of the best method.
• Following commercial tools are used :
1. SAS enterprise miner: This tool allows users to run predictive and descriptive models based on large
volumes of data from across the enterprise.
2. SPSS modeler: It offers methods to explore and analyze data through a GUI.
3. Matlab: Provides a high-level language for performing a variety of data analytics, algorithms and data
exploration.
4. Alpine miner: This tool provides a GUI front end for users to develop analytic workflows and interact
with Big Data tools and platforms on the back end.
• Open Source tools:
1. R and PL/R: PL/R is a procedural language for PostgreSQL with R.
2. Octave: A free software programming language for computational modeling, has some of the
functionality of Matlab.
3. WEKA: It is a free data mining software package with an analytic workbench. The functions created
in WEKA can be executed within Java code.
4. Python is a programming language that provides toolkits for machine learning and analysis.
5. SQL in-database implementations, such as MADlib provide an alterative to in memory desktop
analytical tools.
Model Diagnostics and Model Comparison
Try to build multiple model and then select best one based on multiple criteria. Working with a holdout
sample helps user pick the best-performing model.
• In Holdout Method, the data is split into two different datasets labeled as a training and a testing
dataset. This can be a 60/40 or 70/30 or 80/20 split. This technique is called the hold-out validation
technique.
Suppose we have a database with house prices as the dependent variable and two independent variables
showing the square footage of the house and the number of rooms. Now, imagine this dataset has 30
rows. The whole idea is that you build a model that can predict house prices accurately.
• To 'train' our model or see how well it performs, we randomly subset 20 of those rows and fit the
model. The second step is to predict the values of those 10 rows that we excluded and measure how well
our predictions were.
• As a rule of thumb, experts suggest randomly sampling 80% of the data into the training set and 20%
into the test set.
• The holdout method has two, basic drawbacks :
1. It requires an extra dataset.
2. It is a single train-and-test experiment, the holdout estimate of error rate will be misleading if we
happen to get an "unfortunate" split.
___________________________________________________________________________________
UNIT-II

Frequency Distribution is a tool in statistics that helps us organize the data and also helps us reach
meaningful conclusions. It tells us how often any specific values occur in the dataset. A frequency
distribution in a tabular form organizes data by showing the frequencies (the number of times values
occur) within a dataset.
A frequency distribution represents the pattern of how frequently each value of a variable appears in a
dataset. It shows the number of occurrences for each possible value within the dataset.
Let’s learn about Frequency Distribution including its definition, graphs, solved examples, and frequency
distribution table in detail.
Frequency Distribution

What is Outlier?
Outliers, in the context of information evaluation, are information points that deviate significantly from
the observations in a dataset. These anomalies can show up as surprisingly high or low values, disrupting
the distribution of data. For instance, in a dataset of monthly sales figures, if the income for one month
are extensively higher than the sales for all of the different months, that high sales determine would be
considered an outlier.
Why Removing Outliers is Necessary?
● Impact on Analysis: Outliers will have a disproportionate influence on statistical measures
like the suggest, skewing the general outcomes and leading to misguided conclusions.
Removing outliers can help ensure the analysis is based totally on a more representative
sample of the information.
● Statistical Significance: Outliers can have an effect on the validity and reliability of statistical
inferences drawn from the facts. Removing outliers, when appropriate, can assist maintain
the statistical importance of the analysis.
Identifying and accurately dealing with outliers is critical in data analysis to make certain the integrity
and accuracy of the results.
Types of Outliers
Outliers manifest in different forms, each presenting unique challenges:
● Univariate Outliers: These outliers occur when the point in a single variable substantially
deviates from the relaxation of the dataset. For example, if you're reading the heights of
adults in a sure place and most fall in the variety of 5 feet 5 inches to 6 ft, an person who
measures 7 toes tall might be taken into consideration a univariate outlier.
● Multivariate Outliers: In assessment to univariate outliers, multivariate outliers contain
observations which include outliers in multiple variables concurrently, highlighting
complicated relationships in the information. Continuing with our example, consider
evaluating height and weight, and you discover an character who's especially tall and
relatively heavy in comparison to the relaxation of the populace. This character would be
taken into consideration a multivariate outlier, as their characteristics in each height and
weight concurrently deviate from the normal.
● Point Outliers: These are the points which might be far eliminated from the rest of the points.
For instance, in a dataset of common household energy utilization, a price this is
exceptionally excessive or low as compared to the relaxation is a point outlier.
● Contextual Outliers: Sometimes known as conditional outliers, these are facts factors that
deviate from the normal only in a specific context or condition. For instance, a very low
temperature might be regular in wintry weather but unusual in summer.
● Collective Outliers: These outliers consist of a set of data factors that might not be excessive
by means of themselves however are unusual as an entire. This type of outlier regularly
shows a change in information behavior or emergent phenomena.
Main Causes of Outliers
Outliers can arise from various sources, making their detection vital:
● Data Entry Errors: Simple human errors in entering data can create extreme values.
● Measurement Error: Faulty device or experimental setup problems can cause abnormally
high or low readings.
● Experimental Errors: Flaws in experimental design might produce facts factors that do not
represent what they're presupposed to degree.
● Intentional Outliers: In some cases, data might be manipulated deliberately to produce outlier
effects, often seen in fraud cases.
● Data Processing Errors: During the collection and processing stages, technical glitches can
introduce erroneous data.
● Natural Variation: Inherent variability in the underlying data can also lead to outliers.
How Outliers can be Identified?
Identifying outliers is a vital step in records evaluation, supporting to discover anomalies, errors, or
valuable insights inside datasets. One common approach for figuring out outliers is through
visualizations, where records is graphically represented to highlight any points that deviate appreciably
from the overall pattern. Techniques like box plots and scatter plots offer intuitive visual cues for
recognizing outliers primarily based on their function relative to the rest of the facts.
Another method involves the usage of statistical measures, including the Z-score, DBSCAN algorithm,
or isolation forest algorithm which quantitatively determine the deviation of statistics factors from the
imply or discover outliers primarily based on their density inside the information area.
By combining visible inspection with statistical evaluation, analysts can efficiently identify outliers and
benefit deeper insights into the underlying traits of the facts.
1. Outlier Identification Using Visualizations
Visualizations offers insights into information distributions and anomalies. Visual tools like with scatter
plots and box plots, can efficaciously spotlight information factors that deviate notably from the
majority. In a scatter plot, outliers often seem as records factors mendacity far from the primary cluster
or displaying unusual styles as compared to the relaxation. Box plots offer a clean depiction of the facts's
central tendency and spread, with outliers represented as person factors beyond the whiskers.
1.1 Identifying outliers with box plots
Box plots Box plots are valuable equipment in statistics analysis for visually summarizing the
distribution of a dataset. Box plots are useful in outlier identification offer a concise illustration of key
statistical measures such as the median, quartiles, and variety. A box plot includes a rectangular "field"
that spans the interquartile range (IQR), with a line indicating the median. "Whiskers" enlarge from the
box to the minimum and most values inside a specific range, often set at 1.5 times the IQR. Any records
points beyond those whiskers are considered potential outliers. These outliers, represented as points, can
provide essential insights into the dataset's variability and capacity anomalies. Thus, box plots serve as a
visual useful resource in outlier detection, permitting analysts to pick out data points that deviate notably
from the general sample and warrant similarly research.
1.2 Identifying outliers with Scatter Plots
Scatter plots serve as vital tools in figuring out outliers inside datasets, mainly when exploring
relationships between two non-stop variables. These visualizations plot person facts points as dots on a
graph, with one variable represented on each axis. Outliers in scatter plots often take place as factors that
deviate extensively from the overall sample or fashion discovered most of the majority of statistics
factors.
They might appear as isolated dots, lying far from the main cluster, or exhibiting unusual patterns
compared to the bulk of the data. By visually inspecting scatter plots, analysts can fast pinpoint capacity
outliers, prompting further investigation into their nature and capability impact on the evaluation. This
preliminary identity lays the groundwork for deeper exploration and know-how of the records's conduct
and distribution.
2. Outlier Identification using Statistical Methods
2.1 Identifying outliers with Z-Score
Z-score, a extensively-used statistical approach, quantifies how many popular deviations a records factor
is from the suggest of the dataset. Outlier detection using Z-score, points information with Z-scores
beyond a positive threshold (usually set at
±3
±3) are considered outliers. A excessive high-quality or negative Z-score suggests that the statistics
factor is strangely far from the mean, signaling its capacity outlier fame. By calculating Z-score for each
statistics factor, analysts can systematically discover outliers primarily based on their deviation from the
imply, imparting a sturdy quantitative method to outlier detection.
2.2 Identifying outliers with DBSCAN
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a clustering algorithm that
identifies outliers based totally on the density of records factors in their area. Unlike traditional
clustering algorithms that require specifying the variety of clusters in advance, DBSCAN mechanically
determines clusters based on facts density. Data points that fall outside dense clusters or fail to satisfy
density criteria are labeled as outliers. By reading the neighborhood density of records points, DBSCAN
successfully identifies outliers in datasets with complex systems and varying densities, making it
specially appropriate for outlier detection in spatial information analysis and other packages.
2.3 Identifying outliers with Isolation Forest algorithm
The Isolation Forest algorithm is an anomaly detection method based totally on the idea of isolating
outliers in a dataset. It constructs a random forest of decision trees and isolates outliers with the aid of
recursively partitioning the dataset into subsets. Outliers are identified as instances that require fewer
partitions to isolate them from the relaxation of the facts. Since outliers are usually fewer in wide variety
and have attributes that vary drastically from ordinary instances, they're more likely to be isolated early
in the tree-building method. The Isolation Forest algorithm gives a scalable and green approach for
outlier detection, specially in excessive-dimensional datasets, and is powerful in opposition to the
presence of irrelevant capabilities.
When Should You Remove Outliers?
Deciding when to put off outliers depends on the context of the evaluation. Outliers should be removed
whilst they are due to errors or anomalies that do not constitute the real nature of the information. Few
Considerations for Removing Outliers are:
● Impact on Analysis: Removing outliers can have an effect on statistical measures and model
accuracy.
● Statistical Significance: Consider the consequences of outlier elimination on the validity of
the evaluation.
Frequency Distribution is a tool in statistics that helps us organize the data and also helps us reach
meaningful conclusions. It tells us how often any specific values occur in the dataset. A frequency
distribution in a tabular form organizes data by showing the frequencies (the number of times values
occur) within a dataset.
A frequency distribution represents the pattern of how frequently each value of a variable appears in a
dataset. It shows the number of occurrences for each possible value within the dataset.
Let’s learn about Frequency Distribution including its definition, graphs, solved examples, and frequency
distribution table in detail.

Frequency Distribution

What is Frequency Distribution in Statistics?


A frequency distribution is an overview of all values of some variable and the number of times they
occur. It tells us how frequencies are distributed over the values. That is how many values lie between
different intervals. They give us an idea about the range where most values fall and the ranges where
values are scarce.
Frequency Distribution Graphs
To represent the Frequency Distribution, there are various methods such as Histogram, Bar Graph,
Frequency Polygon, and Pie Chart.
A brief description of all these graphs is as follows:

Graph Type Description Use Cases

Represents the frequency


of each interval of Continuous data
Histogram
continuous data using bars distribution analysis.
of equal width.
Represents the frequency
of each interval using bars Comparing discrete data
Bar Graph
of equal width; can also categories.
represent discrete data.

Connects midpoints of
class frequencies using
Comparing various
Frequency Polygon lines, similar to a
datasets.
histogram but without
bars.

Circular graph showing


data as slices of a circle,
Showing relative sizes of
Pie Chart indicating the proportional
data portions.
size of each slice relative
to the whole dataset.

Frequency Distribution Table


A frequency distribution table is a way to organize and present data in a tabular form which helps us
summarize the large dataset into a concise table. In the frequency distribution table, there are two
columns one representing the data either in the form of a range or an individual data set and the other
column shows the frequency of each interval or individual.
For example, let’s say we have a dataset of students’ test scores in a class.

Test Score Frequency

0-20 6

20-40 12
40-60 22

60-80 15

80-100 5

Check: Difference between Frequency Array and Frequency Distribution


Types of Frequency Distribution
There are four types of frequency distribution:

1. Grouped Frequency Distribution

2. Ungrouped Frequency Distribution

3. Relative Frequency Distribution

4. Cumulative Frequency Distribution

Grouped Frequency Distribution

In Grouped Frequency Distribution observations are divided between different intervals known as class
intervals and then their frequencies are counted for each class interval. This Frequency Distribution is
used mostly when the data set is very large.
Example: Make the Frequency Distribution Table for the ungrouped data given as follows:
23, 27, 21, 14, 43, 37, 38, 41, 55, 11, 35, 15, 21, 24, 57, 35, 29, 10, 39, 42, 27, 17, 45, 52, 31, 36, 39, 38,
43, 46, 32, 37, 25
Solution:As there are observations in between 10 and 57, we can choose class intervals as 10-20, 20-30,
30-40, 40-50, and 50-60. In these class intervals all the observations are covered and for each interval
there are different frequency which we can count for each interval.
Thus, the Frequency Distribution Table for the given data is as follows:

Class Interval Frequency

10 – 20 5
20 – 30 8

30 – 40 12

40 – 50 6

50 – 60 3

Ungrouped Frequency Distribution

In Ungrouped Frequency Distribution, all distinct observations are mentioned and counted individually.
This Frequency Distribution is often used when the given dataset is small.
Example: Make the Frequency Distribution Table for the ungrouped data given as follows:
10, 20, 15, 25, 30, 10, 15, 10, 25, 20, 15, 10, 30, 25

Solution:
As unique observations in the given data are only 10, 15, 20, 25, and 30 with each having a different
frequency.
Thus the Frequency Distribution Table of the given data is as follows:

Value Frequency

10 4

15 3

20 2

25 3
30 2

Relative Frequency Distribution

This distribution displays the proportion or percentage of observations in each interval or class. It is
useful for comparing different data sets or for analyzing the distribution of data within a set.
Relative Frequency is given by:
Relative Frequency = (Frequency of Event)/(Total Number of Events)

Example: Make the Relative Frequency Distribution Table for the following data:

Score Range 0-20 21-40 41-60 61-80 81-100

Frequency 5 10 20 10 5

Solution:
To Create the Relative Frequency Distribution table, we need to calculate Relative Frequency for each
class interval. Thus Relative Frequency Distribution table is given as follows:

Score Range Frequency Relative Frequency

0-20 5 5/50 = 0.10

21-40 10 10/50 = 0.20


41-60 20 20/50 = 0.40

61-80 10 10/50 = 0.20

81-100 5 5/50 = 0.10

Total 50 1.00

Cumulative Frequency Distribution

Cumulative frequency is defined as the sum of all the frequencies in the previous values or intervals up
to the current one. The frequency distributions which represent the frequency distributions using
cumulative frequencies are called cumulative frequency distributions. There are two types of cumulative
frequency distributions:

● Less than Type: We sum all the frequencies before the current interval.
● More than Type: We sum all the frequencies after the current interval.

Check:

● Cumulative Frequency
● How to Calculate Cumulative Frequency table in Excel

Let’s see how to represent a cumulative frequency distribution through an example,


Example: The table below gives the values of runs scored by Virat Kohli in the last 25 T-20 matches.
Represent the data in the form of less-than-type cumulative frequency distribution:
45 34 50 75 22

56 63 70 49 33

0 8 14 39 86

92 88 70 56 50

57 45 42 12 39

Solution:
Since there are a lot of distinct values, we’ll express this in the form of grouped distributions with
intervals like 0-10, 10-20 and so. First let’s represent the data in the form of grouped frequency
distribution.

Runs Frequency

0-10 2

10-20 2

20-30 1

30-40 4
Now we will
40-50 4 convert this
frequency
distribution
into cumulative
frequency
50-60 5
distribution by
summing up
the values of
current interval
and 60-70 1 all the previous
intervals.

70-80 3

80-90 2

90-100 1

Runs scored by Virat Kohli Cumulative Frequency

Less than 10 2

Less than 20 4

Less than 30 5
Less than 40 9

Less than 50 13

Less than 60 18

Less than 70 19

Less than 80 22

Less than 90 24

Less than 100 25

This table represents the cumulative frequency distribution of less than type.

Runs scored by Virat Kohli Cumulative Frequency

More than 0 25
More than 10 23

More than 20 21

More than 30 20

More than 40 16

More than 50 12

More than 60 7

More than 70 6

More than 80 3

More than 90 1

This table represents the cumulative frequency distribution of more than type.
We can plot both the type of cumulative frequency distribution to make the Cumulative Frequency
Curve.
Frequency Distribution Curve
A frequency distribution curve, also known as a frequency curve, is a graphical representation of a data
set’s frequency distribution. It is used to visualize the distribution and frequency of values or
observations within a dataset. Let’s understand it’s different types based on the shape of it, as follows:

Frequency Distribution Curve Types


Type of Distribution Description

Symmetric and bell-shaped; data


Normal Distribution
concentrated around the mean.

Not symmetric; can be positively


Skewed Distribution skewed (right-tailed) or negatively
skewed (left-tailed).

Two distinct peaks or modes in the


Bimodal Distribution frequency distribution, suggesting data
from different populations.

More than two distinct peaks or modes


Multimodal Distribution
in the frequency distribution.

All values or intervals have roughly the


Uniform Distribution same frequency, resulting in a flat,
constant distribution.

Rapid drop-off in frequency as values


Exponential Distribution increase, resembling an exponential
function.

Logarithm of the data follows a normal


Log-Normal Distribution distribution, often used for
multiplicative data, positively skewed.

Frequency Distribution Formula


There are various formulas which can be learned in the context of Frequency Distribution, one such
formula is the coefficient of variation. This formula for Frequency Distribution is discussed below in
detail.
Coefficient of Variation
We can use mean and standard deviation to describe the dispersion in the values. But sometimes while
comparing the two series or frequency distributions becomes a little hard as sometimes both have
different units.
The coefficient of Variation is defined as,
σxˉ×100
x
ˉ
σ

×100

Where,

● σ represents the standard deviation


● xˉ
● x
● ˉ
● represents the mean of the observations
Note: Data with greater C.V. is said to be more variable than the other. The series having lesser C.V. is
said to be more consistent than the other.

Comparing Two Frequency Distributions with the Same Mean


We have two frequency distributions. Let’s say
σ1 and xˉ1
σ
1

and
x
ˉ
1

are the standard deviation and mean of the first series and
σ2 and xˉ2
σ
2

and
x
ˉ
2

are the standard deviation and mean of the second series. The Coefficeint of Variation(CV) is calculated
as follows
C.V of first series =
σ1xˉ1×100
x
ˉ
1

σ
1


×100
C.V of second series =
σ2xˉ2×100
x
ˉ
2

σ
2


×100
We are given that both series have the same mean, i.e.,
xˉ2=xˉ1=xˉ
x
ˉ
2

=
x
ˉ
1

=
x
ˉ
So, now C.V. for both series are,
C.V. of the first series =
σ1xˉ×100

x
ˉ
σ
1


×100

C.V. of the second series =


σ2xˉ×100
x
ˉ
σ
2


×100

Notice that now both series can be compared with the value of standard deviation only. Therefore, we
can say that for two series with the same mean, the series with a larger deviation can be considered more
variable than the other one.
Frequency Distribution Examples
Example 1: Suppose we have a series, with a mean of 20 and a variance is 100. Find out the Coefficient
of Variation.
Solution:
We know the formula for Coefficient of Variation,
σxˉ×100
x
ˉ
σ

×100
Given mean

x
ˉ
= 20 and variance
σ2
σ
2
= 100.
Substituting the values in the formula,
σxˉ×100=20100×100=2010×100=200
x
ˉ
σ

×100
=
100

20

×100
=
10
20

×100
=200
Example 2: Given two series with Coefficients of Variation 70 and 80. The means are 20 and 30. Find
the values of standard deviation for both series.
Solution:
In this question we need to apply the formula for CV and substitute the given values.
Standard Deviation of first series.
C.V=σxˉ×10070=σ20×1001400=σ×10014=σ
C.V=
x
ˉ
σ

×100
70=
20
σ

×100
1400=σ×100
14=σ
Thus, the standard deviation of first series = 14
Standard Deviation of second series.
C.V=σxˉ×10080=σ30×1002400=σ×10024=σ
C.V=
x
ˉ
σ

×100
80=
30
σ

×100
2400=σ×100
24=σ
Thus, the standard deviation of first series = 24
Example 3: Draw the frequency distribution table for the following data:
2, 3, 1, 4, 2, 2, 3, 1, 4, 4, 4, 2, 2, 2
Solution:
Since there are only very few distinct values in the series, we will plot the ungrouped frequency
distribution.

Value Frequency

1 2

2 6

3 2

4 4

Total 14

Example 4: The table below gives the values of temperature recorded in Hyderabad for 25 days in
summer. Represent the data in the form of less-than-type cumulative frequency distribution:

37 34 36 27 22
25 25 24 26 28

30 31 29 28 30

32 31 28 27 30

30 32 35 34 29

Solution:
Since there are so many distinct values here, we will use grouped frequency distribution. Let’s say the
intervals are 20-25, 25-30, 30-35. Frequency distribution table can be made by counting the number of
values lying in these intervals.

Temperature Number of Days

20-25
2

25-30 10

30-35 13

This is the grouped frequency distribution table. It can be converted into cumulative frequency
distribution by adding the previous values.

Temperature Number of Days

Less than 25 2
Less than 30 12

Less than 35 25

Example 5: Make a Frequency Distribution Table as well as the curve for the data:
{45, 22, 37, 18, 56, 33, 42, 29, 51, 27, 39, 14, 61, 19, 44, 25, 58, 36, 48, 30, 53, 41, 28, 35, 47, 21, 32,
49, 16, 52, 26, 38, 57, 31, 59, 20, 43, 24, 55, 17, 50, 23, 34, 60, 46, 13, 40, 54, 15, 62}
Solution:
To create the frequency distribution table for given data, let’s arrange the data in ascending order as
follows:
{13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39,
40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62}
Now, we can count the observations for intervals: 10-20, 20-30, 30-40, 40-50, 50-60 and 60-70.

Interval Frequency

10 – 20 7

20 – 30 10

30 – 40 10

40 – 50 10

50 – 60 10

60 – 70 3

From this data, we can plot the Frequency Distribution Curve as follows:
A cumulative frequency is defined as the total of frequencies that are distributed over different class
intervals. It means that the data and the total are represented in the form of a table in which the
frequencies are distributed according to the class interval. In this article, we are going to discuss in detail
the cumulative frequency distribution, types of cumulative frequencies, and the construction of the
cumulative frequency distribution table with examples in detail.
What is Meant by Cumulative Frequency Distribution?
The cumulative frequency is the total of frequencies, in which the frequency of the first class interval is
added to the frequency of the second class interval and then the sum is added to the frequency of the
third class interval and so on. Hence, the table that represents the cumulative frequencies that are divided
over different classes is called the cumulative frequency table or cumulative frequency distribution.
Generally, the cumulative frequency distribution is used to identify the number of observations that lie
above or below the particular frequency in the provided data set.
Types of Cumulative Frequency Distribution
The cumulative frequency distribution is classified into two different types namely: less than ogive or
cumulative frequency and more/greater than cumulative frequency.
Less Than Cumulative Frequency:
The Less than cumulative frequency distribution is obtained by adding successively the frequencies of
all the previous classes along with the class against which it is written. In this type, the cumulate begins
from the lowest to the highest size.
Greater Than Cumulative Frequency:
The greater than cumulative frequency is also known as the more than type cumulative frequency. Here,
the greater than cumulative frequency distribution is obtained by determining the cumulative total
frequencies starting from the highest class to the lowest class.
Graphical Representation of Less Than and More Than Cumulative Frequency
Representation of cumulative frequency graphically is easy and convenient as compared to representing
it using a table, bar-graph, frequency polygon etc.
The cumulative frequency graph can be plotted in two ways:
1. Cumulative frequency distribution curve(or ogive) of less than type
2. Cumulative frequency distribution curve(or ogive) of more than type
Steps to Construct Less than Cumulative Frequency Curve
The steps to construct the less than cumulative frequency curve are as follows:
1. Mark the upper limit on the horizontal axis or x-axis.
2. Mark the cumulative frequency on the vertical axis or y-axis.
3. Plot the points (x, y) in the coordinate plane where x represents the upper limit value and y
represents the cumulative frequency.
4. Finally, join the points and draw the smooth curve.
5. The curve so obtained gives a cumulative frequency distribution graph of less than type.
To draw a cumulative frequency distribution graph of less than type, consider the following cumulative
frequency distribution table which gives the number of participants in any level of essay writing
competition according to their age:
Table 1 Cumulative Frequency distribution table of less than type

Level of Age Group Age group Number of Cumulative


Essay (class interval) participants Frequency
(Frequency)

Level 1 10-15 Less than 15 20 20

Level 2 15-20 Less than 20 32 52

Level 3 20-25 Less than 25 18 70

Level 4 25-30 Less than 30 30 100

On plotting corresponding points according to table 1, we have


Steps to Construct Greater than Cumulative Frequency Curve
The steps to construct the more than/greater than cumulative frequency curve are as follows:
1. Mark the lower limit on the horizontal axis.
2. Mark the cumulative frequency on the vertical axis.
3. Plot the points (x, y) in the coordinate plane where x represents the lower limit value, and y
represents the cumulative frequency.
4. Finally, draw the smooth curve by joining the points.
5. The curve so obtained gives the cumulative frequency distribution graph of more than type.
To draw a cumulative frequency distribution graph of more than type, consider the same cumulative
frequency distribution table, which gives the number of participants in any level of essay writing
competition according to their age:
Table 2 Cumulative Frequency distribution table of more than type

Level of Age Group Age group Number of Cumulative


Essay (class interval) participants Frequency
(Frequency)

Level 1 10-30 More than 10 20 100

Level 2 15-30 More than 15 32 80

Level 3 20-30 More than 20 18 48

Level 4 25-30 More than 25 30 30

On plotting these points, we get a curve as shown in the graph 2.


These graphs are helpful in figuring out the median of a given data set. The median can be found by
drawing both types of cumulative frequency distribution curves on the same graph. The value of the
point of intersection of both the curves gives the median of the given set of data. For the given table 1,
the median can be calculated as shown:

Example on Cumulative Frequency


Example:
Create a cumulative frequency table for the following information, which represent the number of hours
per week that Arjun plays indoor games:
Arjun’s game time:

Days No. of Hours

Monday 2 hrs

Tuesday 1 hr

Wednesday 2 hrs

Thursday 3 hrs

Friday 4 hrs

Saturday 2 hrs

Sunday 6 hrs
Solution:
Let the no. of hours be the frequency.
Hence, the cumulative frequency table is calculated as follows:

Days No. of Hours (Frequency) Cumulative Frequency

Monday 2 hrs 2

Tuesday 1 hr 2+1 = 3

Wednesday 2 hrs 3+2 = 5

Thursday 3 hrs 5+3 = 8

Friday 4 hrs 8+4 = 12

Saturday 2 hrs 12+2 = 14

Sunday 6 hrs 14+6 = 20

Therefore, Arjun spends 20 hours in a week to play indoor games.


What is Nominal Data?
Nominal data is a type of data classification used in statistical analysis to categorize variables without
assigning any quantitative value. This form of data is identified by labels or names that serve the sole
purpose of distinguishing one group from another, without suggesting any form of hierarchy or order
among them. The essence of nominal data lies in its ability to organize data into discrete categories,
making it easier for researchers and analysts to sort, identify, and analyze variables based on qualitative
rather than quantitative attributes. Such categorization is fundamental in various research fields,
enabling the collection and analysis of data related to demographics, preferences, types, and other
non-numeric characteristics.

Nominal data example:


Types of Payment Methods - Credit Card, Debit Card, Cash, Electronic Wallet. Each payment method
represents a distinct category that helps in identifying consumer preferences in transactions without
implying any numerical value or order among the options.

Ordinal data example:


Customer Satisfaction Ratings - Very Unsatisfied, Unsatisfied, Neutral, Satisfied, Very Satisfied. This
classification not only categorizes responses but also implies a clear order or ranking from least to most
satisfied, distinguishing it from nominal data by introducing a hierarchy among the categories.
The significance of understanding what is nominal data extends beyond mere classification; it impacts
how data is interpreted and the statistical methods applied to it. Since nominal data does not imply any
numerical relationship or order among its categories, traditional measures of central tendency like mean
or median are not applicable.

Characteristics of Nominal Data


Nominal data, distinguished by its role in categorizing and labeling, has several defining characteristics
that set it apart from other data types. These characteristics are essential for researchers to understand as
they dictate how nominal data can be collected, analyzed, and interpreted. Below are the key
characteristics of nominal data:

● Categorical Classification:
Nominal data is used to categorize variables into distinct groups based on qualitative attributes,
without any numerical significance or inherent order.
● Mutually Exclusive:
Each data point can belong to only one category, ensuring clear and precise classification without
overlap between groups.
● No Order or Hierarchy:
The categories within nominal data do not have a ranked sequence or hierarchy; all categories
are considered equal but different.
● Identified by Labels:
Categories are often identified using names or labels, which can occasionally include numbers
used as identifiers rather than quantitative values.
● Limited Statistical Analysis:
Analysis of nominal data primarily involves counting frequency, determining mode, and using
chi-square tests, as measures of central tendency like mean or median are not applicable.

Analysis of Nominal Data


Analyzing nominal data involves techniques that are tailored to its qualitative nature and the
characteristics that define what is nominal data. Since nominal data categorizes variables without
implying any numerical value or order, the analysis focuses on identifying patterns, distributions, and
relationships within the categorical data. Here's how nominal data is typically analyzed:

● Frequency Distribution:
One of the most common methods of analyzing nominal data is to count the frequency of
occurrences in each category. This helps in understanding the distribution of data across the
different categories. For instance, in a nominal data example like survey responses on preferred
types of cuisine, frequency distribution would reveal how many respondents prefer each type of
cuisine.
● Mode Determination:
The mode, or the most frequently occurring category in the dataset, is a key measure of central
tendency that can be applied to nominal data. It provides insight into the most common or
popular category among the data points. For example, if analyzing nominal data on pet
ownership, the mode would indicate the most common type of pet among participants.
● Cross-tabulation:
Cross-tabulation involves comparing two or more nominal variables to identify relationships
between categories. This analysis can reveal patterns and associations that are not immediately
apparent. For instance, cross-tabulating nominal data on consumers' favorite fast-food chains
with their age groups could uncover preferences trends among different age demographics.
● Chi-square Test:
For more complex analysis involving nominal data, the chi-square test is used to examine the
relationships between two nominal variables. It tests whether the distribution of sample
categorical data matches an expected distribution. As an example, researchers might use a
chi-square test to analyze whether there is a significant association between gender (a nominal
data example) and preference for a particular brand of product.

Examples
To illustrate the concept of nominal data more concretely, here are some practical examples that
showcase its application across various fields and contexts:

● Survey Responses on Favorite Color:


○ Categories:
Red, Blue, Green, Yellow, etc.
○ This nominal data example involves categorizing survey participants based on their
favorite color. Each color represents a distinct category without any implied hierarchy or
numerical value.
● Types of Pets Owned:
○ Categories:
Dog, Cat, Bird, Fish, None.
○ In a study on pet ownership, the types of pets individuals own are classified into separate
categories. Each category is mutually exclusive, highlighting the categorical nature of
nominal data.
● Vehicle Types in a Parking Lot:
○ Categories:
Car, Motorcycle, Bicycle, Truck.
○ Observing a parking lot to categorize vehicles by type is another nominal data example.
This involves identifying vehicles without assigning any order or quantitative assessment
to the categories.
● Nationality of Respondents in a Multinational Survey:
○ Categories:
American, Canadian, British, Australian, etc.
○ When conducting multinational surveys, researchers often categorize participants by
nationality. This classification is based solely on qualitative attributes, underscoring the
essence of what is nominal data.
Nominal Vs Ordinal Data
Understanding the difference between nominal and ordinal data is fundamental in the field of statistics
and research, as it influences the choice of analysis methods and how conclusions are drawn from data.
Here’s a comparison to highlight the key distinctions:

Feature Nominal Data Ordinal Data

Definition Data categorized based on names or Data categorized into ordered


labels without any quantitative categories that indicate a sequence or
significance or inherent order. relative ranking.

Nature Qualitative Qualitative, with an element of order

Order No inherent order among categories Inherent order or ranking among


categories

Examples Gender (Male, Female, Other), Satisfaction level (High, Medium,


Blood type (A, B, AB, O) Low), Education level (High School,
Bachelor's, Master's, PhD)

Quantitative Value None Implied through the order of


categories, but not precise

Analysis Techniques Frequency counts, mode, chi-square Median, percentile, rank correlation,
tests non-parametric tests

Application Used for categorizing data without Used when data classification
any need for ranking. requires a hierarchy or ranking.

Interpreting Distributions

1. Normal Distribution (Gaussian)

- Symmetric, bell-shaped
- Mean = Median = Mode
- Characteristics:
- Most data points cluster around mean
- Tails decrease exponentially
- 68% data within 1 standard deviation
- 95% data within 2 standard deviations
- Examples: Height, IQ scores, measurement errors

2. Skewed Distribution

- Asymmetric, tails on one side


- Types:
- Positive Skew: Tail on right side (e.g., income distribution, wealth distribution)
- Negative Skew: Tail on left side (e.g., failure time distribution, response times)
- Characteristics:
- Mean ≠ Median ≠ Mode
- Tails are longer on one side
- Data is concentrated on one side
- Examples: Income, wealth, failure times

3. Bimodal Distribution

- Two distinct peaks


- Characteristics:
- Two modes (local maxima)
- Valley between peaks
- Data has two distinct groups
- Examples: Customer segmentation, gene expression data

4. Multimodal Distribution

- Multiple peaks
- Characteristics:
- Multiple modes (local maxima)
- Multiple valleys
- Data has multiple distinct groups
- Examples: Gene expression data, text analysis

5. Uniform Distribution

- Equal probability across range


- Characteristics:
- Constant probability density
- No distinct modes or peaks
- Data is evenly distributed
- Examples: Random number generation, simulation studies
6. Exponential Distribution

- Rapid decline, long tail


- Characteristics:
- High probability of small values
- Low probability of large values
- Memoryless property
- Examples: Failure time analysis, reliability engineering

7. Power Law Distribution


- Heavy-tailed, few extreme values
- Characteristics:
- Few very large values
- Many small values
- Scale-free property
- Examples: City population sizes, word frequencies

8. Lognormal Distribution

- Log-transformed normal distribution


- Characteristics:
- Positive values only
- Skewed to right
- Logarithmic transformation yields normal distribution
- Examples: Stock prices, income distribution

9. Binomial Distribution

- Discrete, two outcomes


- Characteristics:
- Fixed number of trials (n)
- Probability of success (p)
- Number of successes (k)
- Examples: Coin toss, medical diagnosis
What are Data Types?
Data is largely divided into two major categories, quantitative and qualitative. They are further divided
into other parts. Refer to the graph given below for reference –
Types of Data
● Quantitative data: This type of data consists of numerical values that can be measured or
counted. Examples include time, speed, temperature, and the number of items.
● Qualitative data: This type includes non-numerical values representing qualities or attributes.
Examples include colors, yes or no responses, and opinions.
Types of Quantitative Data
● Discrete data: This refers to separate and distinct values, typically counting numbers. For
instance, the numbers on a dice or the number of students in a class are discrete data points.
● Continuous data: This type of data can take on any value within a range and be measured with
high precision. Examples include height, weight, and temperature.
Data Types Based on Level of Measurement
Data can be further classified into four types based on the level of measurement: nominal, ordinal,
interval, and ratio.
● Nominal data: This represents categorical information without any inherent order or ranking.
Examples include gender, religion, or marital status.
● Ordinal data: This type has a defined order or ranking among the values. Examples include exam
grades (A, B, C) or positions in a competition (1st place, 2nd place, 3rd place).
● Interval data: Interval data has a defined order and equal intervals between the values. An
example is the Celsius temperature scale, where the difference between 30°C and 20°C is the
same as the difference between 20°C and 10°C.
● Ratio data: Ratio data possesses all the characteristics of interval data but has a meaningful zero
point. In addition to setting up inequalities, ratios can also be formed with this data type.
Examples include height, weight, or income.
What is Measure of Central Tendency?
We should first understand the term Central Tendency. Data tend to accumulate around the average value
of the total data under consideration. Measures of central tendency will help us to find the middle, or the
average, of a data set. If most of the data is centrally located and there is a very small spread it will form
an asymmetric bell curve. In such conditions values of mean, median and mode are equal.
Mean, Median, Mode
Let’s understand the definition and role of mean, median and mode with the help of examples –
Mean
It is the average of values. Consider 3 temperature values 30 oC, 40 oC and 50 oC, then the mean is
(30+40+50)/3=40 oC.
Median
It is the centrally located value of the data set sorted in ascending order. Consider 11 (ODD) values
1,2,3,7,8,3,2,5,4,15,16. We first sort the values in ascending order 1,2,2,3,3,4,5,7,8,15,16 then the
median is 4 which is located at the 6th number and will have 5 numbers on either side.

If the data set is having an even number of values then the median can be found by taking the average of
the two middle values. Consider 10 (EVEN) values 1,2,3,7,8,3,2,5,4,15. We first sort the values in
ascending order 1,2,2,3,3,4,5,7,8,15 then the median is (3+4)/2=3.5 which is the average of the two
middle values i.e. the values which are located at the 5th and 6th number in the sequence and will have 4
numbers on either side.
Mode
It is the most frequent value in the data set. We can easily get the mode by counting the frequency of
occurrence. Consider a data set with the values 1,5,5,6,8,2,6,6. In this data set, we can observe the
following,

The value 6 occurs the most hence the mode of the data set is 6.
We often test our data by plotting the distribution curve, if most of the values are centrally located and
very few values are off from the center then we say that the data is having a normal distribution. At that
time the values of mean, median, and mode are almost equal.
However, when our data is skewed, for example, as with the right-skewed data set below:

We can say that the mean is being dragged in the direction of the skew. In this skewed distribution, mode
< median < mean. The more skewed the distribution, the greater the difference between the median and
mean, here we consider median for the conclusion. The best example of the right-skewed distribution is
salaries of employees, where higher-earners provide a false representation of the typical income if
expressed as mean salaries and not the median salaries.
For left-skewed distribution mean < median < mode. In such a case also, we emphasize the median
value of the distribution.

Mean, Median & Mode Example


To understand this let us consider an example. An OTT platform company has conducted a survey in a
particular region based on the watch time, language of streaming, and age of the viewer. For our
understanding, we have taken a sample of 10 people.
df=pd.read_csv("viewer.csv")
df
df["Watch Time"].mean()
2.5

df["Watch Time"].mode()
0 1.5
dtype: float64

df["Watch Time"].median()
2.0
If we observe the values then we can conclude the value of Mean Watch Time is 2.5 hours and which
appears reasonably correct. For Age of viewer following results can be obtained,
df["Age"].median()
12.5

df["Age"].mean()
19.9

df["Age"].mode()
0 12
1 15
dtype: int64
The value of mean Age is looked somewhat away from the actual data. Most of the viewers are in the
range of 10 to 15 while the value of mean comes 19.9. This is because the outliers present in the data set.
We can easily find the outliers using a boxplot.
sns.boxplot(df['Age'], orient='vertical')

If we observe the value of Median Age then the result looks correct. The value of mean is very sensitive
to outliers.
Now for the most popular language, we can not calculate the mean and median since this is nominal
data.
sns.barplot(x="Language",y="Age",data=df)
sns.barplot(x="Language",y="Watch Time",data=df)

If we observe the graph then it is seen that the Tamil bar is largest for Language vs Age and Language vs
Watch Time graph. But this will mislead the result because there is only one person who watches the
shows in Tamil.
df["Language"].value_counts()
Hindi 4
English 3
Tamil 1
Telgu 1
Marathi 1
Name: Language, dtype: int64

df["Language"].mode()
0 Hindi
dtype: object
Result
From the above result, it is concluded that the most popular language is Hindi. This is observed when we
find the mode of the data set.
Hence from the above observation, it is concluded that in the sample survey average age of viewers is
12.5 years who watch for 2.5 hours daily a show in the Hindi language.
We can say there is no best central tendency measure method because the result is always based on the
types of data. For ordinal, interval, and ratio data (if it is skewed) we can prefer median. For Nominal
data, the model is preferred and for interval and ratio data (if it is not skewed) mean is preferred.
Measures of Central Tendency and Dispersion
Dispersion measures indicate how data values are spread out. The range, which is the difference between
the highest and lowest values, is a simple measure of dispersion. The standard deviation measures the
expected difference between a data value and the mean.
________________________________________________________________________________
UNIT -III

Normal distributions

Normal Distribution is the most common or normal form of distribution of Random Variables, hence the
name “normal distribution.” It is also called Gaussian Distribution in Statistics or Probability. We use
this distribution to represent a large number of random variables. It serves as a foundation for statistics
and probability theory.
It also describes many natural phenomena, forms the basis of the Central Limit Theorem, and also
supports numerous statistical methods.
The normal distribution is the most important and most widely used distribution in statistics. It is
sometimes called the “bell curve,” although the tonal qualities of such a bell would be less than pleasing.
It is also called the “Gaussian curve” of Gaussian distribution after the mathematician Karl Friedrich
Gauss.
Strictly speaking, it is not correct to talk about “the normal distribution” since there are many normal
distributions. Normal distributions can differ in their means and in their standard deviations. Figure 4.1
shows three normal distributions. The blue (left-most) distribution has a mean of −3 and a standard
deviation of 0.5, the distribution in red (the middle distribution) has a mean of 0 and a standard deviation
of 1, and the black (right-most) distribution has a mean of 2 and a standard deviation of 3. These as well
as all other normal distributions are symmetric with relatively more values at the center of the
distribution and relatively few in the tails. What is consistent about all normal distribution is the shape
and the proportion of scores within a given distance along the x-axis. We will focus on the standard
normal distribution (also known as the unit normal distribution), which has a mean of 0 and a standard
deviation of 1 (i.e., the red distribution in Figure 4.1).
Figure 4.1. Normal distributions differing in mean and standard deviation. (“Normal Distributions with
Different Means and Standard Deviations” by Judy Schmitt is licensed under CC BY-NC-SA 4.0.)

Seven features of normal distributions are listed below.

1. Normal distributions are symmetric around their mean.


2. The mean, median, and mode of a normal distribution are equal.
3. The area under the normal curve is equal to 1.0.
4. Normal distributions are denser in the center and less dense in the tails.

5. Normal distributions are defined by two parameters, the mean ( ) and the standard
deviation (s).
6. 68% of the area of a normal distribution is within one standard deviation of the mean.
7. Approximately 95% of the area of a normal distribution is within two standard deviations
of the mean.
These properties enable us to use the normal distribution to understand how scores relate to one another
within and across a distribution. But first, we need to learn how to calculate the standardized score that
makes up a standard normal distribution.

Z Scores
A z score is a standardized version of a raw score (x) that gives information about the relative location
of that score within its distribution. The formula for converting a raw score into a z score is

for values from a population and

for values from a sample.

As you can see, z scores combine information about where the distribution is located (the mean/center)
with how wide the distribution is (the standard deviation/spread) to interpret a raw score (x).
Specifically, z scores will tell us how far the score is away from the mean in units of standard deviations
and in what direction.

The value of a z score has two parts: the sign (positive or negative) and the magnitude (the actual
number). The sign of the z score tells you in which half of the distribution the z score falls: a positive
sign (or no sign) indicates that the score is above the mean and on the right-hand side or upper end of the
distribution, and a negative sign tells you the score is below the mean and on the left-hand side or lower
end of the distribution. The magnitude of the number tells you, in units of standard deviations, how far
away the score is from the center or mean. The magnitude can take on any value between negative and
positive infinity, but for reasons we will see soon, they generally fall between −3 and 3.

Let’s look at some examples. A z score value of −1.0 tells us that this z score is 1 standard deviation
(because of the magnitude 1.0) below (because of the negative sign) the mean. Similarly, a z score value
of 1.0 tells us that this z score is 1 standard deviation above the mean. Thus, these two scores are the
same distance away from the mean but in opposite directions. A z score of −2.5 is two-and-a-half
standard deviations below the mean and is therefore farther from the center than both of the previous
scores, and a z score of 0.25 is closer than all of the ones before. In Unit 2, we will learn to formalize the
distinction between what we consider “close to” the center or “far from” the center. For now, we will use
a rough cut-off of 1.5 standard deviations in either direction as the difference between close scores
(those within 1.5 standard deviations or between z = −1.5 and z = 1.5) and extreme scores (those farther
than 1.5 standard deviations—below z = −1.5 or above z = 1.5).

We can also convert raw scores into z scores to get a better idea of where in the distribution those scores
fall. Let’s say we get a score of 68 on an exam. We may be disappointed to have scored so low, but
perhaps it was just a very hard exam. Having information about the distribution of all scores in the class
would be helpful to put some perspective on ours. We find out that the class got an average score of 54
with a standard deviation of 8. To find out our relative location within this distribution, we simply
convert our test score into a z score.

We find that we are 1.75 standard deviations above the average, above our rough cut-off for close and
far. Suddenly our 68 is looking pretty good!

Figure 4.2 shows both the raw score and the z score on their respective distributions. Notice that the red
line indicating where each score lies is in the same relative spot for both. This is because transforming a
raw score into a z score does not change its relative location, it only makes it easier to know precisely
where it is.
Figure 4.2. Raw and standardized versions of a single score. (“Raw and Standardized Versions of a
Score” by Judy Schmitt is licensed under CC BY-NC-SA 4.0.)
z Scores are also useful for comparing scores from different distributions. Let’s say we take the SAT and
score 501 on both the math and critical reading sections. Does that mean we did equally well on both?
Scores on the math portion are distributed normally with a mean of 511 and standard deviation of 120,
so our z score on the math section is

which is just slightly below average (note the use of “math” as a subscript; subscripts are used when
presenting multiple versions of the same statistic in order to know which one is which and have no
bearing on the actual calculation). The critical reading section has a mean of 495 and standard deviation
of 116, so

So even though we were almost exactly average on both tests, we did a little bit better on the critical
reading portion relative to other people.

Finally, z scores are incredibly useful if we need to combine information from different measures that
are on different scales. Let’s say we give a set of employees a series of tests on things like job
knowledge, personality, and leadership. We may want to combine these into a single score we can use to
rate employees for development or promotion, but look what happens when we take the average of raw
scores from different scales, as shown in Table 4.1.

Table 4.1. Raw test scores on different scales (ranges in parentheses).

Job
Knowledge Personality Leadership
Employee (0–100) (1–5) (1–5) Average

Employee 98 4.2 1.1 34.43


1

Employee 96 3.1 4.5 34.53


2

Employee 97 2.9 3.6 34.50


3

Because the job knowledge scores were so big and the scores were so similar, they overpowered the
other scores and removed almost all variability in the average. However, if we standardize these scores
into z scores, our averages retain more variability and it is easier to assess differences between
employees, as shown in Table 4.2.

Table 4.2. Standardized scores.


Job
Knowledge Personality Leadership
Employee (0–100) (1–5) (1–5) Average

Employee 1.00 1.14 −1.12 0.34


1

Employee −1.00 −0.43 0.81 −0.20


2

Employee 0.00 −0.71 0.30 −0.14


3

Setting the Scale of a Distribution


Another convenient characteristic of z scores is that they can be converted into any “scale” that we
would like. Here, the term scale means how far apart the scores are (their spread) and where they are
located (their central tendency). This can be very useful if we don’t want to work with negative numbers
or if we have a specific range we would like to present. The formulas for transforming z to x are:

for a population and

for a sample. Notice that these are just simple rearrangements of the original formulas for calculating z
from raw scores.

Let’s say we create a new measure of intelligence, and initial calibration finds that our scores have a
mean of 40 and standard deviation of 7. Three people who have scores of 52, 43, and 34 want to know
how well they did on the measure. We can convert their raw scores into z scores:
A problem is that these new z scores aren’t exactly intuitive for many people. We can give people
information about their relative location in the distribution (for instance, the first person scored well
above average), or we can translate these z scores into the more familiar metric of IQ scores, which have
a mean of 100 and standard deviation of 16:

IQ = 1.71(16) + 100 = 127.36

IQ = 0.43(16) + 100 = 106.88

IQ = −0.80(16) + 100 = 87.20

We would also likely round these values to 127, 107, and 87, respectively, for convenience.

Z Scores and the Area under the Curve


z Scores and the standard normal distribution go hand-in-hand. A z score will tell you exactly where in
the standard normal distribution a value is located, and any normal distribution can be converted into a
standard normal distribution by converting all of the scores in the distribution into z scores, a process
known as standardization.

We saw in Chapter 3 that standard deviations can be used to divide the normal distribution: 68% of the
distribution falls within 1 standard deviation of the mean, 95% within (roughly) 2 standard deviations,
and 99.7% within 3 standard deviations. Because z scores are in units of standard deviations, this means
that 68% of scores fall between z = −1.0 and z = 1.0 and so on. We call this 68% (or any percentage we
have based on our z scores) the proportion of the area under the curve. Any area under the curve is
bounded by (defined by, delineated by, etc.) by a single z score or pair of z scores.

An important property to point out here is that, by virtue of the fact that the total area under the curve of
a distribution is always equal to 1.0 (see section on Normal Distributions at the beginning of this
chapter), these areas under the curve can be added together or subtracted from 1 to find the proportion in
other areas. For example, we know that the area between z = −1.0 and z = 1.0 (i.e., within one standard
deviation of the mean) contains 68% of the area under the curve, which can be represented in decimal
form as .6800. (To change a percentage to a decimal, simply move the decimal point 2 places to the left.)
Because the total area under the curve is equal to 1.0, that means that the proportion of the area outside z
= −1.0 and z = 1.0 is equal to 1.0 − .6800 = .3200 or 32% (see Figure 4.3). This area is called the area in
the tails of the distribution. Because this area is split between two tails and because the normal
distribution is symmetrical, each tail has exactly one-half, or 16%, of the area under the curve.
Figure 4.3. Shaded areas represent the area under the curve in the tails. (“Area under the Curve in the
Tails” by Judy Schmitt is licensed under CC BY-NC-SA 4.0.)

We will have much more to say about this concept in the coming chapters. As it turns out, this is a quite
powerful idea that enables us to make statements about how likely an outcome is and what that means
for research questions we would like to answer and hypotheses we would like to test.

Why do normal distributions matter?


All kinds of variables in natural and social sciences are normally or approximately normally

distributed. Height, birth weight, reading ability, job satisfaction, or SAT scores are just a few

examples of such variables.

Because normally distributed variables are so common, many statistical tests are designed for

normally distributed populations.

Understanding the properties of normal distributions means you can use inferential statistics to

compare different groups and make estimates about populations using samples.
What are the properties of normal distributions?
Normal distributions have key characteristics that are easy to spot in graphs:

● The mean, median and mode are exactly the same.

● The distribution is symmetric about the mean—half the values fall below the mean and

half above the mean.

● The distribution can be described by two values: the mean and the standard deviation.

The mean is the location parameter while the standard deviation is the scale parameter.

The mean determines where the peak of the curve is centered. Increasing the mean moves

the curve right, while decreasing it moves the curve left.


The standard deviation stretches or squeezes the curve. A small standard deviation results in

a narrow curve, while a large standard deviation leads to a wide curve.


Empirical rule
The empirical rule, or the 68-95-99.7 rule, tells you where most of your values lie in a normal

distribution:

● Around 68% of values are within 1 standard deviation from the mean.

● Around 95% of values are within 2 standard deviations from the mean.

● Around 99.7% of values are within 3 standard deviations from the mean.

Example: Using the empirical rule in a normal distributionYou collect SAT scores from students in a

new test preparation course. The data follows a normal distribution with a mean score (M) of 1150

and a standard deviation (SD) of 150.

Following the empirical rule:

● Around 68% of scores are between 1,000 and 1,300, 1 standard deviation above and below

the mean.
● Around 95% of scores are between 850 and 1,450, 2 standard deviations above and below

the mean.

● Around 99.7% of scores are between 700 and 1,600, 3 standard deviations above and below

the mean.

The empirical rule is a quick way to get an overview of your data and check for any outliers or

extreme values that don’t follow this pattern.

If data from small samples do not closely follow this pattern, then other distributions like the

t-distribution may be more appropriate. Once you identify the distribution of your variable, you

can apply appropriate statistical tests.

Central limit theorem


The central limit theorem is the basis for how normal distributions work in statistics.
In research, to get a good idea of a population mean, ideally you’d collect data from multiple

random samples within the population. A sampling distribution of the mean is the distribution of

the means of these different samples.

The central limit theorem shows the following:

● Law of Large Numbers: As you increase sample size (or the number of samples), then

the sample mean will approach the population mean.

● With multiple large samples, the sampling distribution of the mean is normally

distributed, even if your original variable is not normally distributed.

Parametric statistical tests typically assume that samples come from normally distributed

populations, but the central limit theorem means that this assumption isn’t necessary to meet

when you have a large enough sample.

You can use parametric tests for large samples from populations with any kind of distribution

as long as other important assumptions are met. A sample size of 30 or more is generally

considered large.

For small samples, the assumption of normality is important because the sampling distribution

of the mean isn’t known. For accurate results, you have to be sure that the population is

normally distributed before you can use parametric tests with small samples.

Formula of the normal curve


Once you have the mean and standard deviation of a normal distribution, you can fit a normal

curve to your data using a probability density function.


In a probability density function, the area under the curve tells you probability. The normal

distribution is a probability distribution, so the total area under the curve is always 1 or 100%.

The formula for the normal probability density function looks fairly complicated. But to use it,

you only need to know the population mean and standard deviation.

For any value of x, you can plug in the mean and standard deviation into the formula to find the

probability density of the variable taking on that value of x.

Normal probability density formula Explanation


● f(x) = probability

● x = value of the variable

● μ = mean

● σ = standard deviation

● σ2 = variance

Example: Using the probability density functionYou want to know the probability that SAT scores in

your sample exceed 1380.

On your graph of the probability density function, the probability is the shaded area under the curve

that lies to the right of where your SAT scores equal 1380.

You can find the probability value of this score using the standard normal distribution.

What is the standard normal distribution?


The standard normal distribution, also called the z-distribution, is a special normal distribution

where the mean is 0 and the standard deviation is 1.

Every normal distribution is a version of the standard normal distribution that’s been stretched

or squeezed and moved horizontally right or left.

While individual observations from normal distributions are referred to as x, they are referred to

as z in the z-distribution. Every normal distribution can be converted to the standard normal

distribution by turning the individual values into z-scores.

Z-scores tell you how many standard deviations away from the mean each value lies.
You only need to know the mean and standard deviation of your distribution to find the z-score

of a value.

Z-score Formula Explanation

● x = individual value

● μ = mean

● σ = standard deviation

We convert normal distributions into the standard normal distribution for several reasons:

● To find the probability of observations in a distribution falling above or below a given

value.
● To find the probability that a sample mean significantly differs from a known population

mean.

● To compare scores on different distributions with different means and standard

deviations.

Finding probability using the z-distribution


Each z-score is associated with a probability, or p-value, that tells you the likelihood of values

below that z-score occurring. If you convert an individual value into a z-score, you can then

find the probability of all values up to that value occurring in a normal distribution.

Example: Finding probability using the z-distributionTo find the probability of SAT scores in your

sample exceeding 1380, you first find the z-score.

The mean of our distribution is 1150, and the standard deviation is 150. The z-score tells you how

many standard deviations away 1380 is from the mean.

ormula alculation

For a z-score of 1.53, the p-value is 0.937. This is the probability of SAT scores being 1380 or less

(93.7%), and it’s the area under the curve left of the shaded area.
To find the shaded area, you take away 0.937 from 1, which is the total area under the curve.

Probability of x > 1380 = 1 – 0.937 = 0.063

That means it is likely that only 6.3% of SAT scores in your sample exceed 1380.

You might also like