Data Sceinces
Data Sceinces
Line charts are one of the most commonly used charts for comparing two data sets.
Use line charts when the number of data points is high, and you want to show a
trend in the data over time.
Use cases for line charts:
A company’s quarterly sales for the past five years.
The number of customers per week in the first year of a new retail shop.
Changes in a stock’s price from opening to the closing bell.
Best practices for line charts:
Label the axes and the reference lines used to measure the graph
coordinates. It is common to plot time on the x-axis (horizontal) and the data
values on the y-axis (vertical).
Use a solid line to connect the data points to illustrate trends.
Keep the number of plotted lines to a minimum, typically no more than 5, so
the chart does not become cluttered and difficult to read.
Add a legend, a small visual representation of the chart’s data, that tells what
each line represents to help your audience understand what they are viewing.
Always add a title.
Column Charts
Column charts are positioned vertically, as shown in the figure. They are probably
the most common chart type used to display the numerical value of a specific data
point and compare that value across similar categories. They allow for easy
comparison among several data points.
Use cases for column charts:
Revenue by country, as shown in the chart example.
Last year’s sales for the top four car companies in the US.
Average student test scores for each of six math classes.
Best practices for column charts:
Label the axes.
If the chart shows changes over time, plot the time increments on the x-axis.
If time is not part of the data, consider ordering the column heights to ascend
or descend to demonstrate changes or trends.
Keep the number of columns low, typically no more than 7, so the viewer can
see the value for each column.
Start the value of the y-axis at zero to accurately reflect the column's total
value.
The spacing between columns should be roughly half the width of a column.
Bar Charts
Bar charts are similar to column charts, except the data is horizontally displayed.
Bar charts also allow for easy comparison between several data points. The data
point labels on the horizontal bar chart are on the left side and are more readable
when the label contains text rather than values.
Use cases for bar charts:
Gross domestic product (GDP) of the 25 highest-grossing nations.
The number of cars at a dealership sold by each sales representative.
Exam scores for each student in a math class.
Best practices for bar charts:
Label the axes.
Consider ordering the bars so that the lengths go from longest to shortest.
The data type will most likely determine whether the longest bar should be on
the bottom or the top to best illustrate the intended pattern or trend.
Start the value of the x-axis at zero to accurately reflect the total value of the
bars.
The spacing between bars should be roughly half the width of a bar.
Pie Charts
Pie charts show parts of a whole. Each slice, or segment, of the “pie”, represents a
percentage of the total number. The total sum of the segments must equal 100%. A
pie chart displays the different values of a given variable. Some use cases that
illustrate comparing the information with a pie chart include:
Annual expenses categories for a corporation (e.g., rent, administrative,
utilities, production)
A country’s energy sources (e.g., oil, coal, gas, solar, wind)
Survey results for favorite type of movie (e.g., action, romance, comedy,
drama, science fiction)
Some best practices for pie charts include:
Keep the number of categories minimal so the viewer can differentiate
between segments. Beyond ten segments, the slices begin to lose meaning
and impact. If necessary, consolidate smaller segments into one segment with
a label such as “Other” or “Miscellaneous”.
Use a different color or darkness of grayscale for each segment.
Order the segments according to size.
Make sure the value of all segments equals 100%.
Scatter Plots
Scatter plots are very popular for correlation visualizations or when you want to
show the distribution, or all possible values, of a large number of data points. Scatter
plots are also useful for demonstrating clustering or identifying outliers in the data.
Some use cases that illustrate visualizing the distribution of many data points with a
scatter plot include:
Comparing countries’ life expectancies to their GDPs (Gross Domestic
Product).
Comparing the daily sales of ice cream to the average outside temperature
across multiple days.
Comparing the weight to the height of each person in a large group.
Some best practices for scatter plots include:
Label your axes.
Make sure the data set is large enough to provide visualization for clustering
or outliers.
Start the value of the y-axis at zero to represent the data accurately. The
value of the x-axis will depend on the data. For example, age ranges might be
labeled on the x-axis.
Consider adding a trend line if a scatter plot shows a correlation between x-
and y-axes.
Do not use more than two trend lines.
Discrete Data
Discrete data is data collected by counting. It is whole, concrete numbers. Discrete
data typically involves counting rather than measuring and is often prefixed with “the
number of”. The number of customers who bought bicycles, the number of
employees in each department, and the amount of diesel fuel purchased each week
for a delivery truck are discrete data.
Continuous Data
Continuous data is data collected by measuring. It includes complex numbers
and involves fluctuating numbers. Continuous data usually involves fluctuating
numbers. The temperature inside the store, the speed of a rider in a race, and the
distance traveled by a moving bicycle are continuous data.
The number of employees in each department is discrete data because it has
a limited number of possible values.
The temperature inside the store is continuous data because temperature
can be in an infinite range of values.
The distance traveled by a moving bicycle is continuous data because
weight can be in an infinite range of values.
The number of customers who bought bicycles is discrete data because it
has a limited number of possible values.
The speed of a rider in a bike race is continuous data because wind speed
can be in an infinite range of values.
The amount of diesel fuel purchased each week for a delivery truck
is discrete data because it has a limited number of possible values.
Each column (or field) in the Employee Information table has data that is all the same type.
The next table describes the various data types. Read the descriptions for each data type.
Compare them to the data displayed in the Employee Information table and think about
which type each field is (For example, Job Title contains string data).
Data Types
Data
Description
Type
Data that is treated as text. It is composed of letters, numbers that are not used in
computation, and symbols such as punctuation. String data also includes white space, or
String
the spaces used to separate and format text. Examples of string data are “hello world”
and “Building 153”.
Whole numbers, or numbers that do not include decimals or fractions. One use of
Integer
integers is to order or rank things. Another is for counts and basic quantities. Examples
of integers include 0, 1, 2, 3, and 10,546.
Floating Numbers with decimal places. These numbers are frequently employed in statistical
point analysis. Examples of floating points include 0.0003, 1.2, and -3.67.
Stores an instant in time that is expressed as a calendar date and time of day. Date and
time data is important in recording when an observation in a data set is made. Date and
Date and
time formats can vary between data sources. Examples of date and time data formats
time
include YYYY-MM-DD such as 2022-08-15, and YYYY-MM-DD hh:mm:ss such as
2022-01-01 19:24:05.
Data that is treated as either True or False. Typically, True and False are capitalized to
represent a boolean instead of a string. Boolean values can also be represented as “Yes”
Boolean
or “1” (for True) and “No” or “0” (for False). An example of a boolean expression is
“15 is greater than 30” = False. “User John Smith has a membership account” = Yes.
Employee Information
Refer to the Employee Information table below. You are tasked with defining the
correct data types that can be stored in each of the columns.
Averag
Yearly Gym
Employe Employmen Job Base e Contract
Name Vacatio Membershi
e ID t Date Title City Weekly
n Days p
Hours
PR
Chicago
Helen 100200 2010-05-05 Manage 38.0 No 20 Yes
, IL
r
Berlin,
Bob 100289 2008-03-01 Sales 40.0 No 22 No
DE
Softwar
Cynthi e Beijing,
500788 2010-01-10 55.75 No 17 No
a Enginee CN
r
Los
Data
Jordan 100305 2006-11-22 Angeles 40.0 No 18 No
Analyst
, CA
Softwar New
e York
Alex 100819 2011-09-05 54.1 Yes 20 Yes
Enginee City,
Matching. Select from lists and then submit.
r NY
For each column name in the Employee Information table, select the correct data
type. Each data type can be used more than once.
In order to process, store, and analyze all of these different types of data, it is
important to think about whether they are structured data or unstructured data.
Select each type for more information.
Structured Data
Structured data makes up about 10%-20% of generated data and has clearly defined
data types and patterns that make them easily stored and organized into columns
and rows. This organization makes structured data easy to search and analyze.
Sources of structured data include sales records, airline reservation systems, and
inventory control. Structured data is usually stored in relational databases such as
Structured Query Language (SQL) databases or in spreadsheets such as Microsoft
Excel.
Unstructured Data
Unstructured data makes up most data that is generated, about 80%, and cannot be
organized into row and columns. This makes unstructured data difficult to search,
manage, and analyze. Sources of unstructured data include images, PDFs, sensor
data, and social media posts. Unstructured data is usually stored in a non-relational
database also known as NoSQL Database.
Employee Information
Average Yearly
Employee Employment Job Gym
Name Base City Weekly Contract Vacation
ID Date Title Membership
Hours Days
PR Chicago,
Helen 100200 2010-05-05 38.0 No 20 Yes
Manager IL
Bob 100289 2008-03-01 Sales Berlin, DE 40.0 No 22 No
Software Beijing,
Cynthia 500788 2010-01-10 55.75 No 17 No
Engineer CN
Los
Data
Jordan 100305 2006-11-22 Angeles, 40.0 No 18 No
Analyst
CA
Software New York
Alex 100819 2011-09-05 54.1 Yes 20 Yes
Complete 1.2.9 Practice Item - Selecting Relevant Data
Engineer City, NY
Employee Information
Refer to the Employee Information table below. One of Data Crunchers' clients is a
large national firm with multiple branch office locations. You are gathering data to
help prepare a report regarding general workforce wellness. The report will be
presented at the yearly meeting with branch managers and needs to have the
wellness data broken down by branch office.
Note: Base City has been changed to Branch Office.
Average Yearly
Employee Employment Branch Gym
Name Job Title Weekly Contract Vacation
ID Date Office Membership
Hours Days
PR Chicago,
Helen 100200 2010-05-05 38.0 No 20 Yes
Manager IL
Bob 100289 2008-03-01 Sales Berlin, DE 40.0 No 22 No
Software Beijing,
Cynthia 500788 2010-01-10 55.75 No 17 No
Engineer CN
Los
Data
Jordan 100305 2006-11-22 Angeles, 40.0 No 18 No
Analyst
CA
Software New York
Alex 100819 2011-09-05 54.1 Yes 20 Yes
Engineer City, NY
Matching. Select from lists and then submit.
A car dealership hired a data analyst to analyze trends to increase new car sales. The result of
the analysis recommended that the dealership purchase newspaper and television advertising
during specific time periods to target new buyers. Which two analysis trends would support
this strategy? (Choose two.)
The average amount of years that customers kept their cars before purchasing a new one.
The average amount of time (in months) that elapse before customers purchase oil change
services.
ResetShow feedback
Incomplete 1.3.3 Video - Humanitarian Insights from Data Analytics
Businesses are not the only beneficiaries of the explosion in data and analytics.
Data is created through many modern daily activities. Organizations gather this data
and apply data analytics to inform practical business applications. There are three
main types of data: observed data, volunteered data, and inferred data. The correct
data visualization can intuitively present complex data patterns and trends.
The main factors to consider when choosing a visualization are:
The number of variables to be shown.
The number of data points, or units of information, in each variable.
Whether the data illustrates changes over time.
The need to make a comparison or correlation between different groups of
data points.
Data is created through many modern daily activities. Organizations gather this data
and apply data analytics to inform practical business applications. There are three
main types of data: observed data, volunteered data, and inferred data. The correct
data visualization can intuitively present complex data patterns and trends.
The main factors to consider when choosing a visualization are:
The number of variables to be shown.
The number of data points, or units of information, in each variable.
Whether the data illustrates changes over time.
The need to make a comparison or correlation between different groups of
data points.
Topic objective: Compare and contrast the different types of data.
Data analysis begins with understanding the types of data. Data is either defined as
quantitative or qualitative. Quantitative data is divided into discrete and continuous
data. The data type tells a system how to interpret the data’s value, so the system
can perform operations to transform and use the data in computations.
Data types include:
String
Integer
Floating point
Date and time
Boolean
The data should be categorized as structured or unstructured before it is processed,
stored, and analyzed. After data is defined and selected, it becomes relevant in
determining the questions to be answered.
Topic Objective: Evaluate the value gained through analytics.
Data science enables businesses to better understand the impact of their products
and services, adjust their methods and goals, and provide their customers with better
products faster. Trend analysis is one way to gain insights into key performance
indicators (KPI) over time.
Humanitarian Organizations use data to serve their communities and the world.
Predictive analytics can focus humanitarian efforts on preventive rather than reactive
actions. Environmental agencies track climate change data through observations
which enable predictions of societal impact.
1.4.2 Reflection
The world population continues to climb. The need to feed that growing population
makes the agricultural sector more dependent on data than ever before. Agricultural
farmers utilize data to determine the right weather, soil yield, supply and demand for
the type of crops. These analytics impact what to plant, when to plant, when to
harvest, and when to sell. For livestock farmers, data analytics drives breeding rates,
land management, purchasing of hay and grain, and determining the sales market.
The farmer's role may have stayed relatively the same as the decades before us, but
the approach has drastically changed due to data analytics. The increased need for
a sustainable and consistent food supply raises new questions. Questions for
reflection include:
What percentage of land should be dedicated to farming versus communities?
What other data should be gathered to ensure a good food source for
generations while not negatively impacting the environment?
Data analytics will continue to play a significant role in improving farming methods for
future generations.