CD101 Fundamental of Data Science
CD101 Fundamental of Data Science
Data Science has become the most demanding job of the 21st century. Every
organization is looking for candidates with knowledge of data science. In this
tutorial, we are giving an introduction to data science, with data science Job
roles, tools for data science, components of data science, application, etc.
So let's start,
Data science uses the most powerful hardware, programming systems, and
most efficient algorithms to solve the data related problems. It is the future
of artificial intelligence.
Example:
Let suppose we want to travel from station A to station B by car. Now, we
need to take some decisions such as which route will be the best route to
reach faster at the location, in which route there will be no traffic jam, and
which will be cost-effective. All these decision factors will act as input data,
and we will get an appropriate answer from these decisions, so this analysis
of data is called the data analysis, which is a part of data science.
Now, handling of such huge amount of data is a challenging task for every
organization. So to handle, process, and analysis of this, we required some
complex, powerful, and efficient algorithms and technology, and that
technology came into existence as data Science. Following are some main
reasons for using data science technology:
o With the help of data science technology, we can convert the massive
amount of raw and unstructured data into meaningful insights.
o Data science technology is opting by various companies, whether it is a big
brand or a startup. Google, Amazon, Netflix, etc, which handle the huge
amount of data, are using data science algorithms for better customer
experience.
o Data science is working for automating transportation such as creating a self-
driving car, which is the future of transportation.
o Data science can help in different predictions such as various survey,
elections, flight ticket confirmation, etc.
The average salary range for data scientist will be approximately $95,000
to $ 165,000 per annum, and as per different researches, about 11.5
millions of job will be created by the year 2026.
1. Data Scientist
2. Data Analyst
3. Machine learning expert
4. Data engineer
5. Data Architect
6. Data Administrator
7. Business Analyst
8. Business Intelligence Manager
1. Data Analyst:
Skill required: For becoming a data analyst, you must get a good
background in mathematics, business intelligence, data mining, and
basic knowledge of statistics. You should also be familiar with some
computer languages and tools such as MATLAB, Python, SQL, Hive, Pig,
Excel, SAS, R, JS, Spark, etc.
The machine learning expert is the one who works with various machine
learning algorithms used in data science such as regression, clustering,
classification, decision tree, random forest, etc.
3. Data Engineer:
A data engineer works with massive amount of data and responsible for
building and maintaining the data architecture of a data science project.
Data engineer also works for the creation of data set processes used in
modeling, mining, acquisition, and verification.
4. Data Scientist:
Technical Prerequisite:
o Machine learning: To understand data science, one needs to understand
the concept of machine learning. Data science uses machine learning
algorithms to solve various problems.
o Mathematical modeling: Mathematical modeling is required to make fast
mathematical calculations and predictions from the available data.
o Statistics: Basic understanding of statistics is required, such as mean,
median, or standard deviation. It is needed to extract knowledge and obtain
better results from the data.
o Computer programming: For data science, knowledge of at least one
programming language is required. R, Python, Spark are some required
computer programming languages for data science.
o Databases: The depth understanding of Databases such as SQL, is essential
for data science to get the data and to work with data.
Data Business intelligence deals with Data science deals with structured and
Source structured data, e.g., data unstructured data, e.g., weblogs, feedback, etc.
warehouse.
Method Analytical(historical data) Scientific(goes deeper to know the reason for the
data report)
Skills Statistics and Visualization are Statistics, Visualization, and Machine learning are
the two skills required for the required skills for data science.
business intelligence.
Focus Business intelligence focuses on Data science focuses on past data, present data,
both Past and present data and also future predictions.
o Regression
o Decision tree
o Clustering
o Principal component analysis
o Support vector machines
o Naive Bayes
o Artificial neural network
o Apriori
We will provide you some brief introduction for few of the important
algorithms here,
1. Y= mx+c
In the decision tree algorithm, we can solve the problem, by using tree
representation in which, each node represents a feature, each branch
represents a decision, and each leaf represents the outcome.
If we are given a data set of items, with certain features and values, and we
need to categorize those set of items into groups, so such type of problems
can be solved using k-means clustering algorithm.
Is this A or B? :
We can refer to this type of problem which has only two fixed solutions such
as Yes or No, 1 or 0, may or may not. And this type of problems can be
solved using classification algorithms.
Is this different? :
We can refer to this type of question which belongs to various patterns, and
we need to find odd from them. Such type of problems can be solved using
Anomaly Detection Algorithms.
How much or how many?
The other type of problem occurs which ask for numerical values or figures
such as what is the time today, what will be the temperature today, can be
solved using regression algorithms.
Now if you have a problem which needs to deal with the organization of data,
then it can be solved using clustering algorithms.
1. Discovery: The first phase is discovery, which involves asking the right
questions. When you start any data science project, you need to determine
what are the basic requirements, priorities, and project budget. In this phase,
we need to determine all the requirements of the project such as the number
of people, technology, time, data, an end goal, and then we can frame the
business problem on first hypothesis level.
o Data cleaning
o Data Reduction
o Data integration
o Data transformation,
After performing all the above tasks, we can easily use this data for our
further processes.
Data Science is about finding patterns in data, through analysis, and make
future predictions.
Consumer goods
Stock markets
Industry
Politics
Logistic companies
E-commerce
Machine Learning
Statistics
Programming (Python or R)
Mathematics
Databases
A Data Scientist must find patterns within the data. Before he/she can find the
patterns, he/she must organize the data in a standard format.
Where to Start?
In this tutorial, we will start by presenting what data is and how data can be
analyzed.
You will learn how to use statistics and mathematical functions to make
predictions.
What is Data?
Data is a collection of information.
Structured data
Unstructured data
Unstructured Data
Unstructured data is not organized. We must organize the data for analysis
purposes.
Structured Data
Structured data is organized and easier to work with.
Example of an array:
[80, 85, 90, 95, 100, 105, 110, 115, 120, 125]
Example
Array = [80, 85, 90, 95, 100, 105, 110, 115, 120, 125]
print(Array)
Try it Yourself »
Database Table
A database table is a table with structured data.
The following table shows a database table with health data extracted from a
sports watch:
30 80 120 240 10 7
30 85 120 250 10 7
45 90 130 260 8 7
45 95 130 270 8 7
ADVERTISEMENT
Variables
A variable is defined as something that can be measured or counted.
30 80 120 240 10 7
30 85 120 250 10 7
45 90 130 260 8 7
45 95 130 270 8 7
But if there are 11 rows, how come there are only 10 observations?
It is because the first row is the label, meaning that it is the name of
the variable.
Python
Python is a programming language widely used by Data Scientists.
Python Libraries
Python has libraries with large collections of mathematical functions and
analytical tools.
Pandas - This library is used for structured data operations, like import
CSV files, create dataframes, and data preparation
Numpy - This is a mathematical library. Has a powerful N-dimensional
array object, linear algebra, Fourier transform, etc.
Matplotlib - This library is used for visualization of data.
SciPy - This library has linear algebra modules
Let's define a data frame with 3 columns and 5 rows with fictional numbers:
Example
import pandas as pd
df = pd.DataFrame(data=d)
print(df)
Try it Yourself »
Example Explained
Import the Pandas library as pd
Define data with column and rows in a variable named d
Create a data frame using the function pd.DataFrame()
The data frame contains 3 columns and 5 rows
Print the data frame output with the print() function
Do not be confused about the vertical numbers ranging from 0-4. They tell us
the information about the position of the rows.
Example
Count the number of columns:
count_column = df.shape[1]
print(count_column)
Try it Yourself »
Example
Count the number of rows:
count_row = df.shape[0]
print(count_row)
Try it Yourself »
Why Can We Not Just Count the Rows and Columns
Ourselves?
If we work with larger data sets with many columns and rows, it will be
confusing to count it by yourself. You risk to count it wrongly. If we use the built-
in functions in Python correctly, we assure that the count is correct.
This chapter shows three commonly used functions when working with Data
Science: max(), min(), and mean().
30 80 120 240 10 7
30 85 120 250 10 7
45 90 130 260 8 7
45 95 130 270 8 7
45 100 140 280 0 7
We use underscore (_) to separate strings because Python cannot read space as
separator.
ADVERTISEMENT
The max() function
The Python max() function is used to find the highest value in an array.
Example
Average_pulse_max =
max(80, 85, 90, 95, 100, 105, 110, 115, 120, 125)
print (Average_pulse_max)
Try it Yourself »
Example
Average_pulse_min =
min(80, 85, 90, 95, 100, 105, 110, 115, 120, 125)
print (Average_pulse_min)
Try it Yourself »
Example
import numpy as np
Calorie_burnage
= [240, 250, 260, 270, 280, 290, 300, 310, 320, 330]
Average_calorie_burnage = np.mean(Calorie_burnage)
print(Average_calorie_burnage)
Try it Yourself »
Note: We write np. in front of mean to let Python know that we want to
activate the mean function from the Numpy library.
❮ PreviousNext ❯
Before analyzing data, a Data Scientist must extract the data, and make it
clean and valuable.
In the example below, we show you how to import data using Pandas in Python.
We use the read_csv() function to import a CSV file with the health data:
Example
import pandas as pd
health_data = pd.read_csv("data.csv", header=0, sep=",")
print(health_data)
Try it Yourself »
Example Explained
Import the Pandas library
Name the data frame as health_data.
header=0 means that the headers for the variable names are to be found
in the first row (note that 0 means the first row in Python)
sep="," means that "," is used as the separator between the values. This
is because we are using the file type .csv (comma separated values)
Tip: If you have a large CSV file, you can use the head() function to only show
the top 5rows:
Example
import pandas as pd
print(health_data.head())
Try it Yourself »
Data Cleaning
Look at the imported data. As you can see, the data are "dirty" with wrongly or
unregistered values:
There are some blank fields
Average pulse of 9 000 is not possible
9 000 will be treated as non-numeric, because of the space separator
One observation of max pulse is denoted as "AF", which does not make
sense
Solution: We can remove the rows with missing observations to fix this problem.
When we load a data set using Pandas, all blank cells are automatically
converted into "NaN" values.
So, removing the NaN cells gives us a clean data set that can be analyzed.
We can use the dropna() function to remove the NaNs. axis=0 means that we
want to remove all rows that have a NaN value:
Example
health_data.dropna(axis=0,inplace=True)
print(health_data)
Try it Yourself »
Data Categories
To analyze data, we also need to know the types of data we are dealing with.
Data Types
We can use the info() function to list the data types within our data set:
Example
print(health_data.info())
Try it Yourself »
Result:
We see that this data set has two different types of data:
Float64
Object
We cannot use objects to calculate and perform analysis here. We must convert
the type object to float64 (float64 is a number with a decimal in Python).
We can use the astype() function to convert the data into float64.
The following example converts "Average_Pulse" and "Max_Pulse" into data type
float64 (the other variables are already of data type float64):
Example
health_data["Average_Pulse"] =
health_data['Average_Pulse'].astype(float)
health_data["Max_Pulse"] = health_data["Max_Pulse"].astype(float)
print (health_data.info())
Try it Yourself »
Result:
Example
print(health_data.describe())
Try it Yourself »
Result:
Duration Average_Pulse Max_Pulse Calorie_Burnage Ho