Exploratory Data Analysis
Exploratory Data Analysis
Analysis
Understanding your data through visualization and statistics
Presented By:
Prof. Tithirupa Tapaswini
Assistant Professor
Department of CSE
Medicaps University Indore
WHAT IS DATA?
Data is a set of values of subjects with respect to qualitative or
quantitative variables.
Data is raw, unorganized facts that need to be processed. Data can
be something simple and seemingly random and useless until it is
organized.
When data is processed, organized, structured or presented in a
given context so as to make it useful, it is called information.
Information, necessary for research activities are achieved in
different forms.
TYPES OF DATA:
Structured Data:
Unstructured Data:
Semi-structured Data:
Structured Data:
The data which is to the point, factual, and highly organized is referred to as
structured data. It is quantitative in nature, i.e., it is related to quantities that means it
contains measurable numerical values like numbers, dates, and times.
Unstructured Data:
All the unstructured files, log files, audio files, and image files are included in the
unstructured data. Some organizations have much data available, but they did not
know how to derive data value since the data is raw. Unstructured data is the data that
lacks any predefined model or format.
Semi structured Data:
Semi-structured data refers to data that is not captured or formatted in conventional ways.
Semi-structured data does not follow the format of a tabular data model or relational
databases because it does not have a fixed schema.
Structured Data
When we talk about structured data, we are often talking about tabular data(rectangular
data) i.e. rows and columns from a database. These tables further contain mainly two
types of structured data:
1.Numerical Data Data that is expressed on a numerical scale. It is further represented in
two forms:
Continuous — Data that can undertake any value in an interval. For example, the speed
of a car, heart rate, etc.
Discrete — Data that can undertake only integer values, such as counts. For example, the
number of heads in 20 flips of a coin.
2. Categorical Data Data that can undertake only a specific set of values representing
possible categories. These are also called enums, enumerated, factors, or nominal.
Binary — A special case of categorical data where the features are dichotomous i.e.
can accept only 0/1 or True/False.
Ordinal — Categorical data that has an explicit ordering. For example, five-star rating
of a restaurant(1,2,3,4,5)
The next step is to dive deeper into structured data and how we can use third party
packages and libraries to manipulate such structures. We have mainly two types of
structures or data storage models:
Rectangular
Non-Rectangular
Rectangular Data
Mostly all analyses in data science are done with a rectangular two-dimensional data
object like a dataframe, spreadsheet, CSV file, or a database table.This mainly
consists of rows that represent records(observations) and columns(features/variables).
Data frame Rectangular data structure (like a spreadsheet) for efficient manipulation
and application of statistical and machine learning models.Feature A column within a
dataframe is commonly referred to as a feature. Synonyms — attribute, input, predictor,
variable Outcome Many data science projects involve predicting an outcome — often a
yes/no outcome. Synonyms — dependent variable, response, target, output.
Records A row within a dataframe is commonly referred to as a record. Synonyms —
case, example, instance, observation, pattern, sample.
Non-rectangular Data -Besides rectangular data, we have several other data structures
which come under the umbrella of non-rectangular data.
Classification Of Data:
1.Based on Observation-
Cross Sectional Data: Cross-section data is collected in a single time period and is
characterized by individual units - people, companies, countries, etc. Some examples
include:
Student grades at the end of the current semester;
Household data of the previous year - expenditure on food, unemployment, income,
etc.
Time Series Data: Data collected at a number of specific points in time is
called time series data. Such examples include stock prices, interest rates,
exchange rates as well as product prices, GDP, etc
Ordinal: If the values in a variable follows a particular order, then we can call it as
ordinal. This means a lower value present in the feature holds lesser weight compared
to a higher value.
Interval:
In interval type, 0 doesn’t have a true meaning. In the case of temperature, 0 doesn’t
mean no temperature. Instead, it is a valid value. A classic example for interval data is
temperature.
Ratio: If there is a true meaning for 0, then we can call it ratio data type. For example, in
the case of length or income, a value 0 means no length or no income.
3.Based on Availability:
Primary Sources These sources are records of events or evidence as they are
first described or actually happened without any interpretation or
commentary. Ex-Theses, dissertations, scholarly journal articles (research
based).
Tertiary Sources These are sources that index, abstract, organize, compile, or
digest other sources.Ex Dictionaries/encyclopedias (may also be secondary),
Wikipedia, bibliographies.
4.Based On Structural Form:
Structured-
Unstructured-
Semi structured-
5. Based on Inherent
Qualitative Analysis:
data are expressed in numbers and involve statistical methods. They are concise
and measurable. They help to bring conclusions to the research.
Quantitative Analysis:
data is based on properties or characteristics. They are exploratory and may
lead to further evaluations.
Concept Of Sample Data & Population:
A population is the entire group that you want to draw conclusions about.
A sample is the specific group that you will collect data from. The size of the sample is
always less than the total size of the population.
Collecting data from a population
Populations are used when your research question requires, or when you have access to,
data from every member of the population.
Example: Collecting data from a sample ,You want to study political attitudes in
young people. Your population is the 300,000 undergraduate students in the
Netherlands. Because it’s not practical to collect data from all of them, you use
a sample of 300 undergraduate volunteers from three Dutch universities – this is
the group who will complete your online survey.
Reasons for sampling
Necessity: Sometimes it’s simply not possible to study the whole population due to
its size or inaccessibility.
Practicality: It’s easier and more efficient to collect data from a sample.
Cost-effectiveness: There are fewer participant, laboratory, equipment, and
researcher costs involved.
Manageability: Storing and running statistical analyses on smaller datasets is
easier and reliable.
Population parameter vs. sample statistic
When you collect data from a population or a sample, there are various measurements
and numbers you can calculate from the data. A parameter is a measure that
describes the whole population. A statistic is a measure that describes the sample.
You can use estimation or hypothesis testing to estimate how likely it is that a sample
statistic differs from the population parameter.
Sampling error
A sampling error is the difference between a population parameter and a sample
statistic.
Statistics & Its Types:
Statistics simply means numerical data, and is field of math that generally deals with
collection of data, tabulation, and interpretation of numerical data.
1. Descriptive Statistics :
Descriptive statistics uses data that provides a description of the population either
through numerical calculation or graph or table. It provides a graphical summary of
data. It is simply used for summarizing objects, etc.
2. Inferential Statistics :
Inferential Statistics makes inference and prediction about population based on a
sample of data taken from population. It generalizes a large dataset and applies
probabilities to draw a conclusion.