Data Science Tutorial 1
Data Science Tutorial 1
A Beginner’s
Guide To Data Science
As the world entered the era of big data, the need for its storage also grew. It was the main
challenge and concern for the enterprise industries until 2010. The main focus was on
building a framework and solutions to store data. Now that Hadoop and other frameworks
have successfully solved the problem of storage, the focus has shifted to the processing of
this data. Data Science is the secret sauce here. All the ideas which you see in Hollywood
sci-fi movies can actually turn into reality by Data Science. Data Science is the future of
Artificial Intelligence. Therefore, it is very important to understand what Data Science is and
how it can add value to your business.
Agenda :
So, Data Science is primarily used to make decisions and predictions making use of
predictive causal analytics, prescriptive analytics (predictive plus decision science) and
machine learning.
● Predictive causal analytics – If you want a model that can predict the possibilities
of a particular event in the future, you need to apply predictive causal analytics. Say,
if you are providing money on credit, then the probability of customers making future
credit payments on time is a matter of concern for you. Here, you can build a model
that can perform predictive analytics on the payment history of the customer to
predict if the future payments will be on time or not.
● Prescriptive analytics: If you want a model that has the intelligence of taking its own
decisions and the ability to modify it with dynamic parameters, you certainly need
prescriptive analytics for it. This relatively new field is all about providing advice. In
other terms, it not only predicts but suggests a range of prescribed actions and
associated outcomes.
The best example for this is Google’s self-driving car which I had discussed earlier
too. The data gathered by vehicles can be used to train self-driving cars. You can run
algorithms on this data to bring intelligence to it. This will enable your car to take
decisions like when to turn, which path to take, when to slow down or speed up.
● Machine learning for making predictions — If you have transactional data of a
finance company and need to build a model to determine the future trend, then
machine learning algorithms are the best bet. This falls under the paradigm of
supervised learning. It is called supervised because you already have the data based
on which you can train your machines. For example, a fraud detection model can be
trained using a historical record of fraudulent purchases.
● Machine learning for pattern discovery — If you don’t have the parameters based
on which you can make predictions, then you need to find out the hidden patterns
within the dataset to be able to make meaningful predictions. This is nothing but the
unsupervised model as you don’t have any predefined labels for grouping. The most
common algorithm used for pattern discovery is Clustering.
Let’s say you are working in a telephone company and you need to establish a
network by putting towers in a region. Then, you can use the clustering technique to
find those tower locations which will ensure that all the users receive optimum signal
strength.
Let’s see how the proportion of above-described approaches differ for Data Analysis as well
as Data Science. As you can see in the image below, Data Analysis includes descriptive
analytics and prediction to a certain extent. On the other hand, Data Science is more about
Predictive Causal Analytics and Machine Learning.
Now that you know what exactly is Data Science, let now find out the reason why it was
needed in the first place.
Why Data Science?
● Traditionally, the data that we had was mostly structured and small in size, which
could be analyzed by using simple BI tools. Unlike data in the traditional systems
which were mostly structured, today most of the data is unstructured or
semi-structured. Let’s have a look at the data trends in the image given below which
shows that by 2020, more than 80 % of the data will be unstructured.
This data is generated from different sources like financial logs, text files, multimedia
forms, sensors, and instruments. Simple BI tools are not capable of processing this
huge volume and variety of data. This is why we need more complex and advanced
analytical tools and algorithms for processing, analyzing and drawing meaningful
insights out of it.
This is not the only reason why Data Science has become so popular. Let’s dig deeper and
see how Data Science is being used in various domains.
● How about if you could understand the precise requirements of your customers from
the existing data like the customer’s past browsing history, purchase history, age and
income. No doubt you had all this data earlier too, but now with the vast amount and
variety of data, you can train models more effectively and recommend the product to
your customers with more precision. Wouldn’t it be amazing as it will bring more
business to your organisation?
● Let’s take a different scenario to understand the role of Data Science in decision
making. How about if your car had the intelligence to drive you home? Self-driving
cars collect live data from sensors, including radars, cameras, and lasers to create a
map of its surroundings. Based on this data, it takes decisions like when to speed up,
when to speed down, when to overtake, where to take a turn – making use of
advanced machine learning algorithms.
● Let’s see how Data Science can be used in predictive analytics. Let’s take weather
forecasting as an example. Data from ships, aircraft, radars, satellites can be
collected and analysed to build models. These models will not only forecast the
weather but also help in predicting the occurrence of any natural calamities. It will
help you to take appropriate measures beforehand and save many precious lives.
Let’s have a look at the below infographic to see all the domains where Data Science is
creating its impression.
Data
This was all about what is Data Science, now let’s understand the lifecycle of Data Science.
A common mistake made in Data Science projects is rushing into data collection and
analysis without understanding the requirements or even framing the business problem
properly. Therefore, it is very important for you to follow all the phases throughout the Data
Science lifecycle to ensure the project’s smooth functioning.
Statistics:
Statistics is the most critical unit of Data Science basics, and it is the method or science
of collecting and analysing numerical data in large quantities to get useful insights.
Visualization:
Visualisation technique helps you access huge amounts of data in easy to understand
and digestible visuals.
Lifecycle of Data Science
Here is a brief overview of the main phases of the Data Science Lifecycle:
Phase 1—Discovery / Framing the Problem: Before you begin the project, it is important
to understand the various specifications, requirements, priorities and required budget. You
must possess the ability to ask the right questions. Here, you assess if you have the required
resources present in terms of people, technology, time and data to support the project. In
this phase, you also need to frame the business problem and formulate initial hypotheses
(IH) to test.
A great way to go through this step is to ask questions like:
Phase 2—Data preparation: In this phase, you require an analytical sandbox in which you
can perform analytics for the entire duration of the project. You need to explore, preprocess
and condition data prior to modelling. Further, you will perform ETLT (extract, transform, load
and transform) to get data into the sandbox. Let’s have a look at the Statistical Analysis flow
below.
You can use R for data cleaning, transformation, and visualization. This will help you to spot
the outliers and establish a relationship between the variables/attributes. Once you have
cleaned and prepared the data, it’s time to do exploratory analytics on it. Let’s see how you
can achieve that.
Data can have many inconsistencies like missing values, blank columns, an incorrect data
format, which needs to be cleaned. You need to process, explore, and condition data before
modelling. The cleaner your data, the better are your predictions.
Phase 3—Model planning: Here, you will determine the methods and techniques to draw
the relationships between variables. These relationships will set the base for the algorithms
which you will implement in the next phase. You will apply Exploratory Data Analytics (EDA)
using various statistical formulas and visualisation tools.
1. R has a complete set of modelling capabilities and provides a good environment for
building interpretive models.
2. SQL Analysis services can perform in-database analytics using common data mining
functions and basic predictive models.
3. SAS/ACCESS can be used to access Hadoop data and to create repeatable and
reusable model flow diagrams.
Although, many tools are present in the market but R/Python is the most commonly used
tool.
Now that you have got insights into the nature of your data and have decided the algorithms
to be used. In the next stage, you will apply the algorithm and build up a model.
Phase 4—Model building: You will develop datasets for training and testing purposes in
this phase. Here you need to consider whether your existing tools will suffice for running the
models or it will need a more robust environment (like fast and parallel processing). You will
analyze various learning techniques like classification, association and clustering to build the
model.
Phase 5—Operationalize: In this phase, you deliver final reports, briefings, code and
technical documents. In addition, sometimes a pilot project is also implemented in a
real-time production environment. This will provide you a clear picture of the performance
and other related constraints on a small scale before full deployment.
Phase 6—Communicate results: Now it is important to evaluate if you have been able to
achieve your goal that you had planned in the first phase. So, in the last phase, you identify
all the key findings, communicate to the stakeholders and determine if the results of the
project are a success or a failure based on the criteria developed in Phase 1.
Introduction
Data Science is an associated field of Big Data designed to analyze
large mounds of complex and raw data and provide meaningful
information based on that data to the company. It is a combination of
many fields such as statistics, mathematics, and computation to
interpret and present data for effective decision-making by business
leaders. Data Science helps businesses improve their performance,
efficiency, customer satisfaction, and meet financial goals more easily.
But, for data scientists to use data science effectively and give
beneficial, productive results, a deep understanding of the data
science process is required. The different stages of the data science
process help in converting data into practical outcomes. It helps in
analyzing, extracting, visualising, storing, and managing data more
effectively.
As the data science process stages help in converting raw data into
monetary gains and overall profits, any data scientist should be well
aware of the process and its significance. Now, let us discuss these
steps in detail.
You will need much more context from numbers for them to become
insights. At the end of this step, you must have as much information at
hand as possible.
Step 2: Collecting the Raw Data for the Problem
After defining the problem, you will need to collect the requisite data
to derive insights and turn the business problem into a probable
solution. The process involves thinking through your data and finding
ways to collect and get the data you need. It can include scanning your
internal databases or purchasing databases from external sources.
The most common errors that you can encounter and should look out
for are:
1. Missing values
2. Corrupted values like invalid entries
3. Time zone differences
4. Date range errors like a recorded sale before the sales even
started
You will have to also look at the aggregate of all the rows and columns
in the file and see if the values you obtain make sense. If it doesn’t, you
will have to remove or replace the data that doesn’t make sense. Once
you have completed the data cleaning process, your data will be ready
for an exploratory data analysis (EDA).
You might find several aspects that affect the customer, like some
people may prefer being reached over the phone rather than social
media. These findings can prove helpful as most of the marketing
done nowadays is on social media and only aimed at the youth. How
the product is marketed hugely affects sales, and you will have to
target demographics that are not a lost cause after all. Once you are all
done with this step, you can combine the quantitative and qualitative
data that you have and move them into action.
You need to link the data you have collected and your insights with the
sales head’s knowledge so that they can understand it better. You can
start by explaining why a product was underperforming and why
specific demographics were not interested in the sales pitch. After
presenting the problem, you can move on to the solution to that
problem. You will have to make a strong narrative with clarity and
strong objectives.
__________________________________________________________________
Now, I will take a case study to explain you the various phases described above.
Case Study: Diabetes Prevention
What if we could predict the occurrence of diabetes and take appropriate measures
beforehand to prevent it?
In this use case, we will predict the occurrence of diabetes using the entire lifecycle we
discussed earlier. Let’s go through the various steps.
Step 1:
● First, we will collect the data based on the medical history of the patient as discussed
in Phase 1. You can refer to the sample data below.
Attributes:
Step 2:
● Now, once we have the data, we need to clean and prepare the data for data
analysis.
● This data has a lot of inconsistencies like missing values, blank columns, abrupt
values and incorrect data format which need to be cleaned.
● Here, we have organized the data into a single table under different attributes –
making it look more structured.
● Let’s have a look at the sample data below.
1. In the column npreg, “one” is written in words, whereas it should be in the numeric
form like 1.
2. In column bp one of the values is 6600 which is impossible (at least for humans) as
bp cannot go up to such huge value.
3. As you can see the Income column is blank and also makes no sense in predicting
diabetes. Therefore, it is redundant to have it here and should be removed from the
table.
● So, we will clean and preprocess this data by removing the outliers, filling up the null
values and normalizing the data type. If you remember, this is our second phase
which is data preprocessing.
● Finally, we get the clean data as shown below which can be used for analysis.
Step 3:
● First, we will load the data into the analytical sandbox and apply various statistical
functions on it. For example, R has functions like describe which gives us the number
of missing values and unique values. We can also use the summary function which
will give us statistical information like mean, median, range, min and max values.
● Then, we use visualization techniques like histograms, line graphs, box plots to get a
fair idea of the distribution of data.
Step 4:
Now, based on insights derived from the previous step, the best fit for this kind of problem is
the decision tree. Let’s see how?
● Since, we already have the major attributes for analysis like npreg, bmi, etc., so we
will use supervised learning technique to build a model here.
● Further, we have particularly used decision tree because it takes all attributes into
consideration in one go, like the ones which have a linear relationship as well as
those which have a non-linear relationship. In our case, we have a linear relationship
between npreg and age, whereas the nonlinear relationship between npreg and ped.
● Decision tree models are also very robust as we can use the different combination of
attributes to make various trees and then finally implement the one with the
maximum efficiency.
If you want to learn more about the implementation of the decision tree, refer this blog How
To Create A Perfect Decision Tree
Step 5:
In this phase, we will run a small pilot project to check if our results are appropriate. We will
also look for performance constraints if any. If the results are inaccurate, we need to replan
and rebuild the model.
Step 6:
Data Science with Python Certification Course
Weekday / Weekend Batches
See Batch Details
Once we have executed the project successfully, we will share the output for full deployment.
Being a Data Scientist is easier said than done. So, let’s see what all you need to be a Data
Scientist. A Data Scientist requires skills basically from three major areas as shown below.
As you can see in the above image, you need to acquire various hard skills and soft skills.
You need to be good at statistics and mathematics to analyze and visualize data. Needless
to say, Machine Learning forms the heart of Data Science and requires you to be good at it.
Also, you need to have a solid understanding of the domain you are working in to
understand the business problems clearly. Your task does not end here. You should be
capable of implementing various algorithms which require good coding skills. Finally, once
you have made certain key decisions, it is important for you to deliver them to the
stakeholders. So, good communication will definitely add brownie points to your skills.
I urge you to see this Data Science video tutorial that explains what is Data Science and all
that we have discussed in the blog. Go ahead, enjoy the video and tell me what you think.
What Is Data Science? Data Science Course – Data Science Tutorial For Beginners |
subject
This subject Data Science course video will take you through the need of data science, what
is data science, data science use cases for business, BI vs data science, data analytics
tools, data science lifecycle along with a demo.
In the end, it won’t be wrong to say that the future belongs to Data Scientists. It is predicted
that by the end of the year 2018, there will be a need of around one million Data Scientists.
More and more data will provide opportunities to drive key business decisions. It will soon
change how we look at the world deluged with data around us. Therefore, a Data Scientist
should be highly skilled and motivated to solve the most complex problems. You can predict
the growth of their business by incorporating data science methods in operations in the
coming years, anticipate the potential for problems, and develop strategies based on data to
achieve success. This is the best opportunity to kick off your career in the field of data
science by taking the Data Science Masters Program.