0% found this document useful (0 votes)
2 views

CD101 Fundamental of Data Science

Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

CD101 Fundamental of Data Science

Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 41

Data Science Tutorial for Beginners

Data Science has become the most demanding job of the 21st century. Every
organization is looking for candidates with knowledge of data science. In this
tutorial, we are giving an introduction to data science, with data science Job
roles, tools for data science, components of data science, application, etc.

So let's start,

What is Data Science?


Data science is a deep study of the massive amount of data, which involves
extracting meaningful insights from raw, structured, and unstructured data
that is processed using the scientific method, different technologies, and
algorithms.
It is a multidisciplinary field that uses tools and techniques to manipulate the
data so that you can find something new and meaningful.

Data science uses the most powerful hardware, programming systems, and
most efficient algorithms to solve the data related problems. It is the future
of artificial intelligence.

In short, we can say that data science is all about:

o Asking the correct questions and analyzing the raw data.


o Modeling the data using various complex and efficient algorithms.
o Visualizing the data to get a better perspective.
o Understanding the data to make better decisions and finding the final result.

Example:
Let suppose we want to travel from station A to station B by car. Now, we
need to take some decisions such as which route will be the best route to
reach faster at the location, in which route there will be no traffic jam, and
which will be cost-effective. All these decision factors will act as input data,
and we will get an appropriate answer from these decisions, so this analysis
of data is called the data analysis, which is a part of data science.

Need for Data Science:


Some years ago, data was less and mostly available in a structured form,
which could be easily stored in excel sheets, and processed using BI tools.

But in today's world, data is becoming so vast, i.e., approximately 2.5


quintals bytes of data is generating on every day, which led to data
explosion. It is estimated as per researches, that by 2020, 1.7 MB of data will
be created at every single second, by a single person on earth. Every
Company requires data to work, grow, and improve their businesses.

Now, handling of such huge amount of data is a challenging task for every
organization. So to handle, process, and analysis of this, we required some
complex, powerful, and efficient algorithms and technology, and that
technology came into existence as data Science. Following are some main
reasons for using data science technology:

o With the help of data science technology, we can convert the massive
amount of raw and unstructured data into meaningful insights.
o Data science technology is opting by various companies, whether it is a big
brand or a startup. Google, Amazon, Netflix, etc, which handle the huge
amount of data, are using data science algorithms for better customer
experience.
o Data science is working for automating transportation such as creating a self-
driving car, which is the future of transportation.
o Data science can help in different predictions such as various survey,
elections, flight ticket confirmation, etc.

Data science Jobs:


As per various surveys, data scientist job is becoming the most demanding
Job of the 21st century due to increasing demands for data science. Some
people also called it "the hottest job title of the 21st century". Data
scientists are the experts who can use various statistical tools and machine
learning algorithms to understand and analyze the data.

The average salary range for data scientist will be approximately $95,000
to $ 165,000 per annum, and as per different researches, about 11.5
millions of job will be created by the year 2026.

Types of Data Science Job


If you learn data science, then you get the opportunity to find the various
exciting job roles in this domain. The main job roles are given below:

1. Data Scientist
2. Data Analyst
3. Machine learning expert
4. Data engineer
5. Data Architect
6. Data Administrator
7. Business Analyst
8. Business Intelligence Manager

Below is the explanation of some critical job titles of data science.

1. Data Analyst:

Data analyst is an individual, who performs mining of huge amount of data,


models the data, looks for patterns, relationship, trends, and so on. At the
end of the day, he comes up with visualization and reporting for analyzing
the data for decision making and problem-solving process.

Skill required: For becoming a data analyst, you must get a good
background in mathematics, business intelligence, data mining, and
basic knowledge of statistics. You should also be familiar with some
computer languages and tools such as MATLAB, Python, SQL, Hive, Pig,
Excel, SAS, R, JS, Spark, etc.

2. Machine Learning Expert:

The machine learning expert is the one who works with various machine
learning algorithms used in data science such as regression, clustering,
classification, decision tree, random forest, etc.

Skill Required: Computer programming languages such as Python, C++, R,


Java, and Hadoop. You should also have an understanding of various
algorithms, problem-solving analytical skill, probability, and statistics.

3. Data Engineer:

A data engineer works with massive amount of data and responsible for
building and maintaining the data architecture of a data science project.
Data engineer also works for the creation of data set processes used in
modeling, mining, acquisition, and verification.

Skill required: Data engineer must have depth knowledge of SQL,


MongoDB, Cassandra, HBase, Apache Spark, Hive, MapReduce, with
language knowledge of Python, C/C++, Java, Perl, etc.

4. Data Scientist:

A data scientist is a professional who works with an enormous amount of


data to come up with compelling business insights through the deployment
of various tools, techniques, methodologies, algorithms, etc.

Skill required: To become a data scientist, one should have technical


language skills such as R, SAS, SQL, Python, Hive, Pig, Apache spark,
MATLAB. Data scientists must have an understanding of Statistics,
Mathematics, visualization, and communication skills.

Prerequisite for Data Science


Non-Technical Prerequisite:
o Curiosity: To learn data science, one must have curiosities. When you have
curiosity and ask various questions, then you can understand the business
problem easily.
o Critical Thinking: It is also required for a data scientist so that you can find
multiple new ways to solve the problem with efficiency.
o Communication skills: Communication skills are most important for a data
scientist because after solving a business problem, you need to communicate
it with the team.

Technical Prerequisite:
o Machine learning: To understand data science, one needs to understand
the concept of machine learning. Data science uses machine learning
algorithms to solve various problems.
o Mathematical modeling: Mathematical modeling is required to make fast
mathematical calculations and predictions from the available data.
o Statistics: Basic understanding of statistics is required, such as mean,
median, or standard deviation. It is needed to extract knowledge and obtain
better results from the data.
o Computer programming: For data science, knowledge of at least one
programming language is required. R, Python, Spark are some required
computer programming languages for data science.
o Databases: The depth understanding of Databases such as SQL, is essential
for data science to get the data and to work with data.

Difference between BI and Data Science


BI stands for business intelligence, which is also used for data analysis of
business information: Below are some differences between BI and Data
sciences:

Criterion Business intelligence Data science

Data Business intelligence deals with Data science deals with structured and
Source structured data, e.g., data unstructured data, e.g., weblogs, feedback, etc.
warehouse.

Method Analytical(historical data) Scientific(goes deeper to know the reason for the
data report)

Skills Statistics and Visualization are Statistics, Visualization, and Machine learning are
the two skills required for the required skills for data science.
business intelligence.

Focus Business intelligence focuses on Data science focuses on past data, present data,
both Past and present data and also future predictions.

Data Science Components:


The main components of Data Science are given below:

1. Statistics: Statistics is one of the most important components of data


science. Statistics is a way to collect and analyze the numerical data in a
large amount and finding meaningful insights from it.

2. Domain Expertise: In data science, domain expertise binds data science


together. Domain expertise means specialized knowledge or skills of a
particular area. In data science, there are various areas for which we need
domain experts.

3. Data engineering: Data engineering is a part of data science, which


involves acquiring, storing, retrieving, and transforming the data. Data
engineering also includes metadata (data about data) to the data.

4. Visualization: Data visualization is meant by representing data in a


visual context so that people can easily understand the significance of data.
Data visualization makes it easy to access the huge amount of data in
visuals.

5. Advanced computing: Heavy lifting of data science is advanced


computing. Advanced computing involves designing, writing, debugging, and
maintaining the source code of computer programs.

6. Mathematics: Mathematics is the critical part of data science.


Mathematics involves the study of quantity, structure, space, and changes.
For a data scientist, knowledge of good mathematics is essential.

7. Machine learning: Machine learning is backbone of data science.


Machine learning is all about to provide training to a machine so that it can
act as a human brain. In data science, we use various machine learning
algorithms to solve the problems.

Tools for Data Science


Following are some tools required for data science:

o Data Analysis tools: R, Python, Statistics, SAS, Jupyter, R Studio,


MATLAB, Excel, RapidMiner.
o Data Warehousing: ETL, SQL, Hadoop, Informatica/Talend, AWS
Redshift
o Data Visualization tools: R, Jupyter, Tableau, Cognos.
o Machine learning tools: Spark, Mahout, Azure ML studio.

Machine learning in Data Science


To become a data scientist, one should also be aware of machine learning
and its algorithms, as in data science, there are various machine learning
algorithms which are broadly being used. Following are the name of some
machine learning algorithms used in data science:

o Regression
o Decision tree
o Clustering
o Principal component analysis
o Support vector machines
o Naive Bayes
o Artificial neural network
o Apriori

We will provide you some brief introduction for few of the important
algorithms here,

1. Linear Regression Algorithm: Linear regression is the most popular


machine learning algorithm based on supervised learning. This algorithm
work on regression, which is a method of modeling target values based on
independent variables. It represents the form of the linear equation, which
has a relationship between the set of inputs and predictive output. This
algorithm is mostly used in forecasting and predictions. Since it shows the
linear relationship between input and output variable, hence it is called linear
regression.
The below equation can describe the relationship between x and y variables:

1. Y= mx+c

Where, y= Dependent variable


X= independent variable
M= slope
C= intercept.

2. Decision Tree: Decision Tree algorithm is another machine learning


algorithm, which belongs to the supervised learning algorithm. This is one of
the most popular machine learning algorithms. It can be used for both
classification and regression problems.

In the decision tree algorithm, we can solve the problem, by using tree
representation in which, each node represents a feature, each branch
represents a decision, and each leaf represents the outcome.

Following is the example for a Job offer problem:


In the decision tree, we start from the root of the tree and compare the
values of the root attribute with record attribute. On the basis of this
comparison, we follow the branch as per the value and then move to the
next node. We continue comparing these values until we reach the leaf node
with predicated class value.

3. K-Means Clustering: K-means clustering is one of the most popular


algorithms of machine learning, which belongs to the unsupervised learning
algorithm. It solves the clustering problem.

If we are given a data set of items, with certain features and values, and we
need to categorize those set of items into groups, so such type of problems
can be solved using k-means clustering algorithm.

K-means clustering algorithm aims at minimizing an objective function,


which known as squared error function, and it is given as:
Where, J(V) => Objective function
'||xi - vj||' => Euclidean distance between xi and v j.
ci' => Number of data points in i cluster.
th

C => Number of clusters.

How to solve a problem in Data Science using Machine


learning algorithms?
Now, let's understand what are the most common types of problems
occurred in data science and what is the approach to solving the problems.
So in data science, problems are solved using algorithms, and below is the
diagram representation for applicable algorithms for possible questions:

Is this A or B? :

We can refer to this type of problem which has only two fixed solutions such
as Yes or No, 1 or 0, may or may not. And this type of problems can be
solved using classification algorithms.

Is this different? :

We can refer to this type of question which belongs to various patterns, and
we need to find odd from them. Such type of problems can be solved using
Anomaly Detection Algorithms.
How much or how many?

The other type of problem occurs which ask for numerical values or figures
such as what is the time today, what will be the temperature today, can be
solved using regression algorithms.

How is this organized?

Now if you have a problem which needs to deal with the organization of data,
then it can be solved using clustering algorithms.

Clustering algorithm organizes and groups the data based on features,


colors, or other common characteristics.

Data Science Lifecycle


The life-cycle of data science is explained as below diagram.
The main phases of data science life cycle are given below:

1. Discovery: The first phase is discovery, which involves asking the right
questions. When you start any data science project, you need to determine
what are the basic requirements, priorities, and project budget. In this phase,
we need to determine all the requirements of the project such as the number
of people, technology, time, data, an end goal, and then we can frame the
business problem on first hypothesis level.

2. Data preparation: Data preparation is also known as Data Munging. In


this phase, we need to perform the following tasks:

o Data cleaning
o Data Reduction
o Data integration
o Data transformation,
After performing all the above tasks, we can easily use this data for our
further processes.

3. Model Planning: In this phase, we need to determine the various


methods and techniques to establish the relation between input variables.
We will apply Exploratory data analytics(EDA) by using various statistical
formula and visualization tools to understand the relations between variable
and to see what data can inform us. Common tools used for model planning
are:

o SQL Analysis Services


o R
o SAS
o Python

4. Model-building: In this phase, the process of model building starts. We


will create datasets for training and testing purpose. We will apply different
techniques such as association, classification, and clustering, to build the
model.

Following are some common Model building tools:

o SAS Enterprise Miner


o WEKA
o SPCS Modeler
o MATLAB

5. Operationalize: In this phase, we will deliver the final reports of the


project, along with briefings, code, and technical documents. This phase
provides you a clear overview of complete project performance and other
components on a small scale before the full deployment.

6. Communicate results: In this phase, we will check if we reach the goal,


which we have set on the initial phase. We will communicate the findings
and final result with the business team.

Applications of Data Science:


o Image recognition and speech recognition:
Data science is currently using for Image and speech recognition.
When you upload an image on Facebook and start getting the
suggestion to tag to your friends. This automatic tagging suggestion
uses image recognition algorithm, which is part of data science.
When you say something using, "Ok Google, Siri, Cortana", etc., and
these devices respond as per voice control, so this is possible with
speech recognition algorithm.
o Gaming world:
In the gaming world, the use of Machine learning algorithms is
increasing day by day. EA Sports, Sony, Nintendo, are widely using
data science for enhancing user experience.
o Internet search:
When we want to search for something on the internet, then we use
different types of search engines such as Google, Yahoo, Bing, Ask,
etc. All these search engines use the data science technology to make
the search experience better, and you can get a search result with a
fraction of seconds.
o Transport:
Transport industries also using data science technology to create self-
driving cars. With self-driving cars, it will be easy to reduce the number
of road accidents.
o Healthcare:
In the healthcare sector, data science is providing lots of benefits. Data
science is being used for tumor detection, drug discovery, medical
image analysis, virtual medical bots, etc.
o Recommendation systems:
Most of the companies, such as Amazon, Netflix, Google Play, etc., are
using data science technology for making a better user experience
with personalized recommendations. Such as, when you search for
something on Amazon, and you started getting suggestions for similar
products, so this is because of data science technology.
o Risk detection:
Finance industries always had an issue of fraud and risk of losses, but
with the help of data science, this can be rescued.
Most of the finance companies are looking for the data scientist to
avoid risk and any type of losses with an increase in customer
satisfaction.
Data Science Introduction
Data Science is a combination of multiple disciplines that uses statistics,
data analysis, and machine learning to analyze data and to extract
knowledge and insights from it.

What is Data Science?


Data Science is about data gathering, analysis and decision-making.

Data Science is about finding patterns in data, through analysis, and make
future predictions.

By using Data Science, companies are able to make:

 Better decisions (should we choose A or B)


 Predictive analysis (what will happen next?)
 Pattern discoveries (find pattern, or maybe hidden information in the
data)

Where is Data Science Needed?


Data Science is used in many industries in the world today, e.g. banking,
consultancy, healthcare, and manufacturing.

Examples of where Data Science is needed:

 For route planning: To discover the best routes to ship


 To foresee delays for flight/ship/train etc. (through predictive analysis)
 To create promotional offers
 To find the best suited time to deliver goods
 To forecast the next years revenue for a company
 To analyze health benefit of training
 To predict who will win elections
Data Science can be applied in nearly every part of a business where data is
available. Examples are:

 Consumer goods
 Stock markets
 Industry
 Politics
 Logistic companies
 E-commerce

How Does a Data Scientist Work?


A Data Scientist requires expertise in several backgrounds:

 Machine Learning
 Statistics
 Programming (Python or R)
 Mathematics
 Databases

A Data Scientist must find patterns within the data. Before he/she can find the
patterns, he/she must organize the data in a standard format.

Here is how a Data Scientist works:

1. Ask the right questions - To understand the business problem.


2. Explore and collect data - From database, web logs, customer
feedback, etc.
3. Extract the data - Transform the data to a standardized format.
4. Clean the data - Remove erroneous values from the data.
5. Find and replace missing values - Check for missing values and
replace them with a suitable value (e.g. an average value).
6. Normalize data - Scale the values in a practical range (e.g. 140 cm is
smaller than 1,8 m. However, the number 140 is larger than 1,8. - so
scaling is important).
7. Analyze data, find patterns and make future predictions.
8. Represent the result - Present the result with useful insights in a way
the "company" can understand.

Where to Start?
In this tutorial, we will start by presenting what data is and how data can be
analyzed.
You will learn how to use statistics and mathematical functions to make
predictions.

What is Data?
Data is a collection of information.

One purpose of Data Science is to structure data, making it interpretable and


easy to work with.

Data can be categorized into two groups:

 Structured data
 Unstructured data

Unstructured Data
Unstructured data is not organized. We must organize the data for analysis
purposes.
Structured Data
Structured data is organized and easier to work with.

How to Structure Data?


We can use an array or a database table to structure or present data.

Example of an array:
[80, 85, 90, 95, 100, 105, 110, 115, 120, 125]

The following example shows how to create an array in Python:

Example
Array = [80, 85, 90, 95, 100, 105, 110, 115, 120, 125]
print(Array)

Try it Yourself »

It is common to work with very large data sets in Data Science.

In this tutorial we will try to make it as easy as possible to understand the


concepts of Data Science. We will therefore work with a small data set that is
easy to interpret.

Database Table
A database table is a table with structured data.
The following table shows a database table with health data extracted from a
sports watch:

Duration Average_Pulse Max_Pulse Calorie_Burnage Hours_W Hou


ork rs_S
lee
p

30 80 120 240 10 7

30 85 120 250 10 7

45 90 130 260 8 7

45 95 130 270 8 7

45 100 140 280 0 7

60 105 140 290 7 8

60 110 145 300 7 8

60 115 145 310 8 8

75 120 150 320 0 8

75 125 150 330 8 8


This dataset contains information of a typical training

session such as duration, average pulse, calorie burnage etc.

Column 1 Column 2 Column 3 Column 4 Column 5 Column 6

Duration Average_Pulse Max_Pulse Calorie_Burnag Hours_Work Hours_Sleep


e

Row 1 30 80 120 240 10 7

Row 2 30 85 120 250 10 7

Row 3 45 90 130 260 8 7

Row 4 45 95 130 270 8 7


Row 5 45 100 140 280 0 7

Row 6 60 105 140 290 7 8

Row 7 60 110 145 300 7 8

Row 8 60 115 145 310 8 8

Row 9 75 120 150 320 0 8

Row 10 75 125 150 330 8 8

Database Table Structure

A database table consists of column(s) and row(s):

A row is a horizontal representation of data.

A column is a vertical representation of data.

ADVERTISEMENT

Variables
A variable is defined as something that can be measured or counted.

Examples can be characters, numbers or time.


In the example under, we can observe that each column represents a variable.

Duration Average_Pulse Max_Pulse Calorie_Burnage Hours_Work Hours_Sleep

30 80 120 240 10 7

30 85 120 250 10 7

45 90 130 260 8 7

45 95 130 270 8 7

45 100 140 280 0 7

60 105 140 290 7 8

60 110 145 300 7 8

60 115 145 310 8 8

75 120 150 320 0 8

75 125 150 330 8 8


There are 6 columns, meaning that there are 6 variables (Duration,
Average_Pulse, Max_Pulse, Calorie_Burnage, Hours_Work, Hours_Sleep).

There are 11 rows, meaning that each variable has 10 observations.

But if there are 11 rows, how come there are only 10 observations?

It is because the first row is the label, meaning that it is the name of
the variable.

Python
Python is a programming language widely used by Data Scientists.

Python has in-built mathematical libraries and functions, making it easier to


calculate mathematical problems and to perform data analysis.

We will provide practical examples using Python.

To learn more about Python, please visit our Python Tutorial.

Python Libraries
Python has libraries with large collections of mathematical functions and
analytical tools.

In this course, we will use the following libraries:

 Pandas - This library is used for structured data operations, like import
CSV files, create dataframes, and data preparation
 Numpy - This is a mathematical library. Has a powerful N-dimensional
array object, linear algebra, Fourier transform, etc.
 Matplotlib - This library is used for visualization of data.
 SciPy - This library has linear algebra modules

We will use these libraries throughout the course to create examples.


Create a DataFrame with Pandas
A data frame is a structured representation of data.

Let's define a data frame with 3 columns and 5 rows with fictional numbers:

Example
import pandas as pd

d = {'col1': [1, 2, 3, 4, 7], 'col2': [4, 5, 6, 9, 5], 'col3':


[7, 8, 12, 1, 11]}

df = pd.DataFrame(data=d)

print(df)

Try it Yourself »

Example Explained
 Import the Pandas library as pd
 Define data with column and rows in a variable named d
 Create a data frame using the function pd.DataFrame()
 The data frame contains 3 columns and 5 rows
 Print the data frame output with the print() function

We write pd. in front of DataFrame() to let Python know that we want to


activate the DataFrame() function from the Pandas library.

Be aware of the capital D and F in DataFrame!

Interpreting the Output


This is the output:
We see that "col1", "col2" and "col3" are the names of the columns.

Do not be confused about the vertical numbers ranging from 0-4. They tell us
the information about the position of the rows.

In Python, the numbering of rows starts with zero.

Now, we can use Python to count the columns and rows.

We can use df.shape[1] to find the number of columns:

Example
Count the number of columns:

count_column = df.shape[1]
print(count_column)

Try it Yourself »

We can use df.shape[0] to find the number of rows:

Example
Count the number of rows:

count_row = df.shape[0]
print(count_row)

Try it Yourself »
Why Can We Not Just Count the Rows and Columns
Ourselves?
If we work with larger data sets with many columns and rows, it will be
confusing to count it by yourself. You risk to count it wrongly. If we use the built-
in functions in Python correctly, we assure that the count is correct.

Data Science Functions


❮ PreviousNext ❯

This chapter shows three commonly used functions when working with Data
Science: max(), min(), and mean().

The Sports Watch Data Set


Duration Average_Pulse Max_Pulse Calorie_Burnage Hours_Work Hours_Sleep

30 80 120 240 10 7

30 85 120 250 10 7

45 90 130 260 8 7

45 95 130 270 8 7
45 100 140 280 0 7

60 105 140 290 7 8

60 110 145 300 7 8

60 115 145 310 8 8

75 120 150 320 0 8

75 125 150 330 8 8

The data set above consists of 6 variables, each with 10 observations:

 Duration - How long lasted the training session in minutes?


 Average_Pulse - What was the average pulse of the training session?
This is measured by beats per minute
 Max_Pulse - What was the max pulse of the training session?
 Calorie_Burnage - How much calories were burnt on the training
session?
 Hours_Work - How many hours did we work at our job before the training
session?
 Hours_Sleep - How much did we sleep the night before the training
session?

We use underscore (_) to separate strings because Python cannot read space as
separator.

ADVERTISEMENT
The max() function
The Python max() function is used to find the highest value in an array.

Example
Average_pulse_max =
max(80, 85, 90, 95, 100, 105, 110, 115, 120, 125)

print (Average_pulse_max)

Try it Yourself »

The min() function


The Python min() function is used to find the lowest value in an array.

Example
Average_pulse_min =
min(80, 85, 90, 95, 100, 105, 110, 115, 120, 125)

print (Average_pulse_min)

Try it Yourself »

The mean() function


The NumPy mean() function is used to find the average value of an array.

Example
import numpy as np

Calorie_burnage
= [240, 250, 260, 270, 280, 290, 300, 310, 320, 330]
Average_calorie_burnage = np.mean(Calorie_burnage)

print(Average_calorie_burnage)

Try it Yourself »

Note: We write np. in front of mean to let Python know that we want to
activate the mean function from the Numpy library.

❮ PreviousNext ❯

Data Science - Data


Preparation
❮ PreviousNext ❯

Before analyzing data, a Data Scientist must extract the data, and make it
clean and valuable.

Extract and Read Data With Pandas


Before data can be analyzed, it must be imported/extracted.

In the example below, we show you how to import data using Pandas in Python.

We use the read_csv() function to import a CSV file with the health data:

Example
import pandas as pd
health_data = pd.read_csv("data.csv", header=0, sep=",")

print(health_data)

Try it Yourself »

Example Explained
 Import the Pandas library
 Name the data frame as health_data.
 header=0 means that the headers for the variable names are to be found
in the first row (note that 0 means the first row in Python)
 sep="," means that "," is used as the separator between the values. This
is because we are using the file type .csv (comma separated values)

Tip: If you have a large CSV file, you can use the head() function to only show
the top 5rows:

Example
import pandas as pd

health_data = pd.read_csv("data.csv", header=0, sep=",")

print(health_data.head())

Try it Yourself »

Data Cleaning
Look at the imported data. As you can see, the data are "dirty" with wrongly or
unregistered values:
 There are some blank fields
 Average pulse of 9 000 is not possible
 9 000 will be treated as non-numeric, because of the space separator
 One observation of max pulse is denoted as "AF", which does not make
sense

So, we must clean the data in order to perform the analysis.

Remove Blank Rows


We see that the non-numeric values (9 000 and AF) are in the same rows with
missing values.

Solution: We can remove the rows with missing observations to fix this problem.

When we load a data set using Pandas, all blank cells are automatically
converted into "NaN" values.

So, removing the NaN cells gives us a clean data set that can be analyzed.

We can use the dropna() function to remove the NaNs. axis=0 means that we
want to remove all rows that have a NaN value:
Example
health_data.dropna(axis=0,inplace=True)

print(health_data)

Try it Yourself »

The result is a data set without NaN rows:

Data Categories
To analyze data, we also need to know the types of data we are dealing with.

Data can be split into two main categories:

1. Quantitative Data - Can be expressed as a number or can be quantified.


Can be divided into two sub-categories:
o Discrete data: Numbers are counted as "whole", e.g. number of
students in a class, number of goals in a soccer game
o Continuous data: Numbers can be of infinite precision. e.g. weight
of a person, shoe size, temperature
2. Qualitative Data - Cannot be expressed as a number and cannot be
quantified. Can be divided into two sub-categories:
o Nominal data: Example: gender, hair color, ethnicity
o Ordinal data: Example: school grades (A, B, C), economic status
(low, middle, high)
By knowing the type of your data, you will be able to know what technique to
use when analyzing them.

Data Types
We can use the info() function to list the data types within our data set:

Example
print(health_data.info())

Try it Yourself »

Result:

We see that this data set has two different types of data:

 Float64
 Object

We cannot use objects to calculate and perform analysis here. We must convert
the type object to float64 (float64 is a number with a decimal in Python).

We can use the astype() function to convert the data into float64.

The following example converts "Average_Pulse" and "Max_Pulse" into data type
float64 (the other variables are already of data type float64):
Example
health_data["Average_Pulse"] =
health_data['Average_Pulse'].astype(float)
health_data["Max_Pulse"] = health_data["Max_Pulse"].astype(float)

print (health_data.info())

Try it Yourself »

Result:

Now, the data set has only float64 data types.

Analyze the Data


When we have cleaned the data set, we can start analyzing the data.

We can use the describe() function in Python to summarize data:

Example
print(health_data.describe())

Try it Yourself »

Result:
Duration Average_Pulse Max_Pulse Calorie_Burnage Ho

Count 10.0 10.0 10.0 10.0 10.

Mean 51.0 102.5 137.0 285.0 6.6

Std 10.49 15.4 11.35 30.28 3.6

Min 30.0 80.0 120.0 240.0 0.0

25% 45.0 91.25 130.0 262.5 7.0

50% 52.5 102.5 140.0 285.0 8.0

75% 60.0 113.75 145.0 307.5 8.0

Max 60.0 125.0 150.0 330.0 10.

 Count - Counts the number of observations


 Mean - The average value
 Std - Standard deviation (explained in the statistics chapter)
 Min - The lowest value
 25%, 50% and 75% are percentiles (explained in the statistics chapter)
 Max - The highest value
❮ PreviousNext ❯

You might also like