0% found this document useful (0 votes)
2 views

Document (9)

Chapter 7 discusses data processing, focusing on structured data, its types, and representation methods such as graphs, including line graphs, bar charts, pie charts, and scatter plots. It explains key concepts like correlation, levels of measurement, data matrices, and statistical measures such as mean, median, mode, variance, and standard deviation. The chapter emphasizes the importance of data visualization and statistical analysis in understanding and interpreting datasets.

Uploaded by

shub6328
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Document (9)

Chapter 7 discusses data processing, focusing on structured data, its types, and representation methods such as graphs, including line graphs, bar charts, pie charts, and scatter plots. It explains key concepts like correlation, levels of measurement, data matrices, and statistical measures such as mean, median, mode, variance, and standard deviation. The chapter emphasizes the importance of data visualization and statistical analysis in understanding and interpreting datasets.

Uploaded by

shub6328
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

Chapter 7

Data processing
Notes
Structured Data
 Structured data is formatted and has a predefined data type and arrangement.
This means the data is organized in a specific way, making it easy to search, modify,
and manipulate.
 Examples of structured data found in everyday life are names, dates, addresses,
credit card numbers, and stock information.
 Common sources of structured data include Excel files, RDBMS (Databases), and
online forms.
Types of Structured Data
 Date and Time Data Type: Used to store date and time. There are various formats
for storing date and time. Some examples of date formats include dd-mm-yyyy, dd-
mon-yyyy, mm-dd-yy, and yyyy-mm-dd. Time can be represented in formats such as
HH:MM:SS or HH:MM AM/PM.
 String Data Type: A sequence of characters that include alphabets (A-Z and a-z),
numbers (0-9), and special characters including spaces, @, *, ?, etc.. String data is
enclosed within double or single quotes. For example, 'I stay at 23, Pocket Z, Chandni
Chowk'.
 Categorical Data: This data type groups data into two or more categories. For
example, types of pizza ordered or gender (0 for women and 1 for men). Categorical
data can also take numerical values that do not have any mathematical meaning. The
easiest way to verify if the data is categorical is to determine if an average can be
computed.
Representation of Data
 Data should be presented in a manner that enables the user to interpret the important
aspects of the data with minimal effort.
 Graphs are an effective way to represent data visually.
Line Graphs
 A line graph is a pictorial display of data that changes constantly over time.
 It is especially useful for showing trends (upwards or downwards).
 Line graphs are useful for making comparisons between different datasets and can tell
the changes in both short and long terms, even with small changes over time.
 The example in the source shows car sales dipping from April, reaching a low point in
June, and then increasing again.
Bar Charts
 A bar chart uses bars to compare data between categories. The length of the bar is
directly proportional to the value it represents.
 Bars can be displayed either horizontally or vertically.
 Bar charts provide a graphical representation for quick comparison of quantities in
various categories.
 They help to easily identify relationships.
 Bar charts can show big changes over time.
 The example in the sources shows that in Covid-19 vaccinations, Bihar administered
the maximum number of Dose 1 while West Bengal administered the maximum
number of Dose 2.
Pie Charts
 A pie chart is a circular chart divided into segments or sections, where each division
indicates the relative size, or contribution of every class or category to the total.
 Pie charts are mostly used to visualize facts from a small dataset.
 There should be no greater than seven categories in a pie chart, and zero values can't
be displayed.
 Pie charts are used to examine elements of a whole and do not display changes over
time.
 Pie charts can condense a large dataset into a visual form.
 They can be visually simpler than other types of graphs.
 They allow the audience to see a data comparison at a glance, allowing for
immediate analysis.
 The example pie chart about books read in a month shows that people like to read
fiction the most.
 To construct a pie chart:
o Find the percentage of each category by dividing each value by the total
value and multiplying by 100.
o Find the central angle for each category using the formula: (Value of the
component / Total value) * 360.
o Draw a circle, draw a horizontal radius, and then draw radii with the
calculated central angles.
 The formula to calculate the angle of a slice in a pie chart is: (Value of the
component / Total value) * 360.
Scatter Plots
 A scatter plot uses Cartesian coordinates to display values of two variables.
 It typically consists of an X-axis (the horizontal axis) and a Y-axis (the vertical axis).
Each dot on the scatter plot signifies one observation from a dataset.
 The position of the dot represents the values of each variable respectively.
 Scatter plots help analyze how one variable affects another.
 The relationship between the two variables is called a correlation.
 Scatter plots are used with continuous data.
 They show the structure of the whole dataset rather than single observations.
 An example shows the relationship between the number of years of education and
income.
 Each point on the graph represents an ordered pair (years, income).
Correlation in Scatter Plots
 Correlation is the relationship between variables.
 Positive Correlation: Both variables move in the same direction. As one variable
increases, so does the other.
 Negative Correlation: Both variables move in opposite directions. As one variable
increases, the other variable decreases.
 No Correlation: There is no relationship between the two variables.
Exploring Data
 Data analysts use data visualization and statistical tools to characterize datasets,
known as data exploration.
 Data exploration techniques involve manual analysis and automated data exploration
software.
 These techniques are used to connect variables, examine the structure of the dataset,
and analyze the distribution of data values in order to find patterns and insights.
 The process typically involves: Gathering Data, Cleaning the data, Checking for
patterns, Segmenting data, Analysing segments, and Implementing insights into
the business.
Case, Variables, and Levels of Measurement
 A case refers to the people or items from whom data is collected.
 A variable is a characteristic that is measured and can have multiple values.
 A constant is a characteristic that remains the same for all cases in a study.
 The level of measurement refers to the relationship between values given to the
attributes of a variable. There are four levels of measurement: nominal, ordinal,
interval, and ratio.
o Nominal Measurement: Numerical values represent a unique "name" of the
attribute. The cases may be ordered in any manner. For example, jersey
numbers in cricket are nominal.
o Ordinal Measurement: Attributes can be ordered. The distances or intervals
between attributes are irrelevant. For example, educational qualifications like
secondary, senior secondary, or graduation are ordinal.
o Interval Measurement: The distance between attributes is important. For
example, temperature is interval. The ratio of intervals does not have a
meaning.
 Ratio Measurement: There is an absolute zero point that makes sense. Weight, for
example, is a ratio measure.
 Nominal data is 'named' level data, Ordinal data is 'named and ordered' level
data, Interval data is 'named, ordered, and proportional interval' level data and
Ratio data is 'named, ordered, proportional interval and has an absolute zero
point' level data.
Data Matrix and Frequency Tables
 Data is stored in a data matrix that contains rows and columns. Each row contains
details about a case or sample, while columns represent the variable.
 A frequency table shows the number of occurrences of a particular data value in a
dataset.
 A frequency table is created by ordering the collected data values in ascending order
of magnitude and recording their respective frequencies.
 A frequency table can also use class intervals.
Graphs and Shapes of Distributions
 Data can be summarized through data matrices, frequency tables, or graphs.
 Distributions show the extent, dispersion, variability, and scatter of the data.
 The shape of a distribution describes its number of peaks and its symmetry, tendency
to skew, or uniformity.
 Number of Peaks:
o A unimodal distribution has one peak or mode.
o A bimodal distribution has two peaks or modes.
o A multimodal distribution has three or more peaks.
 Symmetry: A distribution is symmetric if its left half forms a mirror image of its
right half.
 Skewness: Skewness is a measure of a distribution's symmetry.
o Left-skewed distribution: The tail on the left side is longer than the tail on
the right. The values are more spread out on the left side. The median and
mean are less than the mode.
o Right-skewed distribution: The tail on the right side is longer than the tail on
the left. The values are more spread out on the right side. The median and
mean are greater than the mode.
Mean, Median, and Mode
 Mean is the average, or the most common value, in a group of numbers.
o To calculate the mean for individual series, find the sum of all values, then
divide by the total number of items.
o To calculate the mean for discrete series, multiply each value by its frequency,
sum these values and then divide by the sum of frequencies.
o To calculate the mean for a frequency distribution series, find the mid value of
each interval, then multiply by its frequency, sum these values, and divide by
the sum of the frequencies.
 Median is the middle value in a given list of numbers arranged in ascending or
descending order.
o To find the median for an individual series, arrange the data in ascending
order, then find the (n+1)/2 th item. If n is even, then the median will be the
average of the n/2th and (n/2 + 1)th values.
o To find the median for a frequency distribution series, find the cumulative
frequencies, then find the (n/2)th item and use the formula: M= l + (n/2-cf)/f *
i.
 Mode is the value that appears most often in a given list of numbers.
o To find the mode in an individual series, arrange the data in ascending order
and see which number is repeated the most.
o To find the mode for a discrete series (frequency array), find the value with
the highest frequency.
Range, Interquartile Range, and Box Plot
 Range is the difference between the largest and smallest values in a dataset. It is a
straightforward measure of variability.
 Interquartile Range (IQR) is a statistical measure of statistical dispersion. It is the
variation between the upper (Q3) and lower (Q1) quartiles.
o The median (Q2) corresponds to the median of the whole data set. The lower
quartile (Q1) is the median of the lower half of the dataset. The upper quartile
(Q3) is the median of the upper half of the dataset.
 A box plot is a graphical form used frequently in exploratory data analysis. It is also
called a box and whisker plot. It is used to visualize the distribution of numerical data
and skewness.
 Box plots display the dataset's minimum score, first (lower) quartile, median, third
(upper) quartile, and maximum score.
Variance and Standard Deviation
 Variance is defined as the sum of the squares of the deviation from the mean. It
determines how far away from the mean each data point is in a collection.
 The square root of the variance is the standard deviation.
 Steps for calculating variance and standard deviation:
o Calculate the mean.
o Subtract the mean from each value and square the result.
o Find the sum of the squared differences.
o Divide the sum from step 3 by the number of values to get the variance.
o Take the square root of the variance to get the standard deviation.
Z-scores
 A statistical measurement known as the Z-score indicates how closely a value
relates to the mean of a group of values. It is calculated using the standard deviation
from the mean.
 A data point's Z-score of 0 means it is the same as the mean. A z-score of 1.0 means it
is one standard deviation from the mean.
 A positive Z-score indicates that the value is above the mean, while a negative value
indicates it is below the mean.
 The formula to calculate the Z-score is: z = (x - μ) / σ, where x is the value being
evaluated, μ is the mean and σ is the standard deviation.
At a Glance
 Statistical methods are required to understand the data used to train a machine
learning model and to interpret the results of testing different machine learning
models.
 Structured data is formatted and has a predefined data type and arrangement.
 A variable is a characteristic that is measured and can have multiple values.
 The level of measurement refers to the relationship between the values given to the
attributes of a variable.
 In nominal measurement, the numerical values represent a unique "name" of the
attribute.
 In ordinal measurement, attributes can be ordered. The distances or intervals between
attributes are irrelevant.
 In interval measurements, the distance between attributes is important.
 Ratio measurements are based on proportion and can have an absolute zero value.
 The tabular format to present cases and variables used in statistical study is known as
a data matrix.
 The shape of a data distribution is described by the number of peaks, symmetry, and
tendency to skew, or its uniformity.
 A graph can be unimodal, bimodal, or multimodal.
 A distribution is symmetric if its left half forms a mirror image of its right half.
 A distribution that is not symmetric has values that tend to be more spread out on one
side of the graph than the other.
 In symmetric distributions, the median, mean, and mode are the same.
 The mean is the average or most common value in a group of numbers.
 The median is the middle value in a list of numbers arranged in ascending or
descending order.
 The mode is the value that appears most often in a given list of numbers..

MCQ
1. What type of data is formatted and has a predefined data type and arrangement?
o a) Unstructured data
o b) Structured data
o c) Categorical data
o d) String data
2. Which of the following is an example of structured data?
o a) A social media post
o b) A handwritten letter
o c) Names, addresses, and credit card numbers
o d) An audio file
3. Which of the following is NOT a source of structured data?
o a) Excel files
o b) Databases
o c) Online Forms
o d) Social media feeds
4. Which data type is used to store date and time?
o a) String data type
o b) Categorical data
o c) Date and Time data type
o d) Numerical data
5. What is a sequence of characters that include alphabets, numbers, and special
characters called?
o a) Categorical Data
o b) Date and Time Data
o c) String Data
o d) Numerical Data
6. What type of data can be grouped according to variables presented in a certificate,
such as class, gender, or department?
o a) String Data
o b) Numerical Data
o c) Categorical Data
o d) Date and Time Data
7. What is a line graph especially useful for displaying?
o a) Comparisons between categories
o b) Trends that change over time
o c) Parts of a whole
o d) Correlations between variables
8. A bar chart is a graphical tool that uses bars to do what?
o a) Show trends over time
o b) Display parts of a whole
o c) Compare data between categories
o d) Show correlations between variables
9. What is a pie chart most useful for visualizing?
o a) Data that changes over time
o b) Facts from a small dataset
o c) Correlations between variables
o d) Trends over time
10. In a pie chart, what does each division indicate?
o a) Trends over time
o b) The relative size or contribution of each category to the total
o c) Correlations between variables
o d) The number of categories
11. What should the total number of categories in a pie chart ideally not exceed?
o a) 5
o b) 7
o c) 10
o d) 12
12. What is the formula to calculate the central angle for each category in a pie chart?
o a) (Total value / Value of the component) * 360
o b) (Value of the component * Total value) / 360
o c) (Value of the component / Total value) * 100
o d) (Value of the component / Total value) * 360
13. What type of graph uses Cartesian coordinates to display values of two variables?
o a) Bar chart
o b) Pie chart
o c) Line graph
o d) Scatter plot
14. What is the relationship between two variables in a scatter plot called?
o a) Association
o b) Distribution
o c) Correlation
o d) Dispersion
15. In a scatter plot, if both variables move in the same direction, what type of correlation
is it?
o a) Negative correlation
o b) No correlation
o c) Positive correlation
o d) Inverse correlation
16. If one variable increases and the other variable decreases, what type of correlation is
it?
o a) Positive correlation
o b) No correlation
o c) Negative correlation
o d) Direct correlation
17. Which type of data visualization is considered the simplest way to convey facts and is
less stressful for the brain to process than numbers alone?
 a) Tabular data
 b) Numerical data
 c) Graphical representations
 d) String data
18. In data analysis, what is a "case"?
o a) The people or items from whom the data is collected
o b) A characteristic that is measured
o c) A constant in a study
o d) A collection of data points
19. What is a "variable" in data analysis?
o a) A constant in a study
o b) A characteristic that is measured and can have multiple values
o c) A level of measurement
o d) A collection of cases
20. What does nominal measurement mean?
o a) Attributes can be ordered
o b) Distances between attributes are important
o c) Numerical values represent a unique "name" of the attribute
o d) There is an absolute zero point
21. In ordinal measurement, what is the relationship between attributes?
o a) They represent unique names
o b) They can be ordered
o c) Distances between them are important
o d) There is an absolute zero point
22. What is true of interval measurement?
o a) Attributes can be ordered
o b) Distances between attributes are important
o c) There is an absolute zero point
o d) Numerical values represent a unique name
23. What does ratio measurement have that other levels of measurement do not?
o a) Attributes that can be ordered
o b) Distances between attributes that are important
o c) An absolute zero point
o d) Numerical values that represent unique names
24. What is a data matrix?
o a) A visual representation of data
o b) A tabular format to present cases and variables
o c) A statistical calculation
o d) A type of graph
25. What does the "mode" represent in a dataset?
 a) The average value
 b) The middle value
 c) The value that appears most often
 d) The range of the values
 Okay, here are some very short questions and answers based on the sources and our
conversation history:
 Q: What are some sources of structured data? A: Structured data can be found
in Excel files, RDBMS (Databases), online forms, and server & website logs.
 Q: What are three types of structured data? A: The three main types of structured
data are date and time, string, and categorical data.
 Q: What are the three types of correlation in scatter plots? A: The three types of
correlation in scatter plots are positive correlation, negative correlation, and no
correlation.
 Q: What are the four levels of measurement? A: The four levels of measurement
are nominal, ordinal, interval, and ratio.
 Q: What are the measures of central tendency? A: The measures of central
tendency are mean, median, and mode.
 Q: What is a z-score? A: A z-score indicates how closely a value relates to the
mean of a group of values.
 Q: What is skewness? A: Skewness is a measure of a distribution's symmetry.
 Q: What is a box plot? A: A box plot is a diagram that depicts the distribution of
data and shows minimum score, lower quartile, median, upper quartile, and
maximum scores.

You might also like