Edureka Data Science Ebook
Edureka Data Science Ebook
TABLE OF CONTENTS
TABLE OF CONTENTS
Chapter 1
INTRODUCTION TO
DATA SCIENCE
The term Data Science has emerged recently with the evolution of
mathematical statistics and data analysis. Data Science also known as data-
driven science, makes use of scientific methods, processes, and systems to
extract knowledge or insights from data in various forms, i.e either structured
or unstructured.
With the help of Data Science in the next few years, we will be able to predict
the future as claimed by researchers from MIT. They already have reached a
milestone in predicting the future, with their awesome research in various
domains.
01 Is this A or B?
Classification Algorithm
02 Is this weird?
Anomaly Detection Algorithm
These algorithms are based on human psychology. We like being appreciated, right? Computers
implement these algorithms and expect to be appreciated when being trained. How? Let’s see.
Rather than teaching the computer what to do, you let it decide what to do, and at the end of that
action, you give either positive or negative feedback.
It’s just like training your dog. You cannot control what your dog does, right? But you can scold him
when he does wrong. Similarly, maybe patting him on the back when he does what is expected.
With each feedback, your system is learning and hence becomes more accurate in its next decision.
This type of learning is called Reinforcement Learning.
Chapter 2
DISCOVERY
1 Before you begin the project, it is important to understand the various specifications, requirements,
priorities and required budget. You must possess the ability to ask the right questions. Here, you assess
if you have the required resources present in terms of people, technology, time and data to support the
project. In this phase, you also need to frame the business problem and formulate initial hypotheses (IH)
to test.
DATA PREPARATION
2 In this phase, you require an analytical sandbox in which you can perform analytics for the entire
duration of the project. You need to explore, preprocess and condition data prior to modeling. Further,
you will perform ETLT (extract, transform, load and transform) to get data into the sandbox. Let’s have a
look at the Statistical Analysis flow below.
MODEL PLANNING
3 Here, you will determine the methods and techniques to draw the relationships between variables.
These relationships will set the base for the algorithms which you will implement in the next phase. You
will apply Exploratory Data Analytics (EDA) using various statistical formulas and visualization tools.
Let’s have a look at various model planning tools.
4 MODEL BUILDING
In this phase, you will develop datasets for training and testing purposes. Here you need to consider
whether your existing tools will suffice for running the models or it will need a more robust environment
(like fast and parallel processing). You will analyze various learning techniques like classification,
association and clustering to build the model.
OPERATIONALIZE
5 In this phase, you will deliver final reports, briefings, code and technical documents. Moreover,
sometimes a pilot project is also implemented in a real-time production environment. This will provide
you a clear picture of the performance and other related constraints on a small scale before full
deployment.
COMMUNICATE RESULTS
6 Now it is important to evaluate if you have been able to achieve the goal that you had planned in the first
phase. So, in the last phase, you identify all the key findings, communicate to the stakeholders and
determine if the results of the project are a success or a failure based on the criteria developed in Phase
1.
Chapter 3
INTRODUCTION TO
MACHINE LEARNING
The term Machine Learning was first coined by Arthur Samuel in the year 1959. Looking back, that year
was probably the most significant in terms of technological advancements. If you browse through the
internet about 'What is Machine Learning’, you’ll get at least 100 different definitions. However, the
very first formal definition was:
"
A computer program is said to learn from experience E with respect to
some class of tasks T and performance measure P if its performance at
tasks in T, as measured by P, improves with experience E.
Tom M. Mitchell
"
What is Machine Leaning?
In simple terms, Machine learning is a subset of Artificial Intelligence (AI) which provides machines the
ability to learn automatically & improve from experience without being explicitly programmed to do so.
In this sense, it is the practice of getting machines to solve problems by gaining the ability to think.
It enables the computers or the machines to make data-driven decisions rather than being explicitly
programmed for carrying out a certain task. These programs or algorithms are designed in such a way
that they learn and improve over time when exposed to new data.
But wait, can a machine think or make decisions? Well, if you feed the machine a good amount of data, it
will learn how to interpret, process and analyze this data by using Machine Learning Algorithms. A
machine can learn to solve a problem by following any one of the following three approaches.
1 2 3
Supervised Learning Unsupervised Learning Reinforcement Learning
1 Supervised Learning
Supervised learning is a technique in which we teach or train the machine using data that is well
labeled. To understand Supervised Learning, let’s consider an analogy. As kids we all needed guidance
to solve math problems. Our teachers helped us understand what addition is, and how it is done.
Similarly, you can think of Supervised Learning as a type of Machine Learning that involves a guide.
The labeled data set is the teacher that will train you to understand patterns in the data. Here, the
labeled data set is the training data set.
2 Unsupervised Learning
Unsupervised learning involves training by using unlabeled data and allowing the model to act on that
information without guidance. The model learns through observation and finds structures in the data.
Once the model is given a dataset, it automatically detects patterns and relationships in the dataset
by creating clusters in it. What it cannot do is add labels to the cluster, like it cannot say this is a group
of 'apples' or 'mangoes', but it will separate all the 'apples' from 'mangoes'. Suppose we presented
images of 'apples', 'bananas' and 'mangoes' to the model, so what it does, based on some patterns and
relationships it creates clusters and divides the dataset into those clusters. Now if a new data is fed to
the model, it adds it to one of the created clusters.
01 Clustering 02 Association
a. Hierarchical Clustering a. Apriori Algorithm
b. K-Means Clustering b. FP-Growth Algorithm
c. K-NN Clustering
3 Reinforcement Learning
01 Positive 02 Negative
Chapter 4
best languages
Among all these, R is considered to be the best programming language for any statistician as it
possesses an extensive catalog of statistical and graphical methods. Python, on the other hand, can do
pretty much the same work as R but it is preferred by data scientists or data analysts because of its
simplicity and high performance. R is a powerful scripting language and highly flexible with a vibrant
community and resource bank whereas Python is a widely used object-oriented language that is easy to
learn and debug.
LEARNING CURVE
2 With most of the programming languages, their learning curves tend to grow parabolic with time which
means it is hard to grasp early but as you become familiar with this language the learning becomes easy.
Whereas, in the case of Python, the learning is easy because of easy syntax and short handwriting.
DATA VISUALIZATION
4 Provides a comprehensive set of functionalities for data visualization that helps in understanding a dataset
and the relationship between various variables.
Chapter 6
1 2 3 4
Data Storage Exploratory Data Analysis Data Modelling Data Visualization
APACHE HADOOP
Apache Hadoop is a free, open-source framework that can manage and
store tons and tons of data. It provides distributed computing of massive
data sets over a cluster of 1000s of computers. It is used for high-level
computations and data processing.
AZURE HDINSIGHT
Azure HDInsight is a cloud platform provided by Microsoft for the purpose
of data storage, processing, and analytics. Enterprises such as Adobe, Jet,
and Milliman use Azure HDInsights to process and manage massive
amounts of data.
RAPIDMINER
There is no surprise that RapidMiner is one of the most popular tools for implementing
Data Science. RapidMiner is ranked at number 1 in the Gartner Magic Quadrant for
Data Science Platforms 2017, and in the Forrester Wave for predictive analytics and
ML.
DATAROBOT
DataRobot is an AI-driven automation platform, that aids in developing accurate predictive
models. DataRobot makes it easy to implement a wide range of Machine Learning algorithms,
including Clustering, Classification and Regression Models.
QLIKVIEW
QlikView is another Data Visualization tool that is used by more than 24,000
organizations worldwide. It is one of the most effective visualization platforms for
visually analyzing data to derive useful business insights.
Chapter 7
best frameworks
7.1 TensorFlow
7.3 PyTorch
features
NATIVE SUPPORT
1 Native support for Python and use of its libraries
USED IN FACEBOOK
2 Actively used in the development of Facebook for all of it’s Deep Learning requirements in the platform
2 4
1 3 5
FEATURIZATION PERSISTENCE
It includes feature extraction, Persistence helps in saving
transformation, dimensionality and loading algorithms,
reduction and selection models and Pipelines
MLlib Algorithms
Chapter 8
FREQUENTLY
ASKED
INTERVIEW
QUESTIONS
As per Glassdoor and Linkedin, Data Scientists have
consistently been ranked at number 1 because of their
high demand in the business and tech industries. This
chapter covers the questions which will help you in
your Data Science Interviews and open up various
career opportunities for a Data Science aspirant.
1. What is Data Science? How would you say it is similar or 13. What is missing value imputation? How do you handle
different to Business Analytics and Business Intelligence? missing values in Python or R?
2. Which package is used to do data import in R and Python? 14. Why do you need a for loop? What do you do for loops in
How do you do data import in SAS? Python and R?
3. How do you build a custom function in Python or R? 15. What is the advantage of using the apply family of functions
4. What is an RDBMS? Name some examples for RDBMS? in R? How do you use lambda in Python?
What is CRUD? 16. What packages are used for data mining in Python and R?
5. Define a SQL query? What is the difference between 17. What is Machine Learning? What is the difference between
SELECT and UPDATE Query? How do you use SQL in SAS, Supervised and Unsupervised methods?
Python, R languages? 18. What are random forests and how is it different from
6. What is an API ? What are APIs used for? decision trees?
7. What is NoSQL? Name some examples of NoSQL 19. What is logistic and linear regression? Name some packages
databases. What is a key value store? What is column in R and Python for building regression models.
storage? What is a document database? 20. What is linear optimization? Where is it used? What is the
8. What is a data warehouse? travelling salesman problem?
9. What is JSON and What is XML? 21. What is CART and CHAID? How is bagging different from
10. Name some kinds of graphs and explain how you would boosting?
build them in Python or R. 22. What is a Z test, Chi Square test, F test and T test?
11. How do you check for data quality? 23. What are Entropy and Information gain in the decision tree
12. What is an outlier? How do you treat outlier data? algorithm?
Data Engineers
Data Analyst
www.edureka.co/data-science-python-certification-course
www.edureka.co/masters-program/data-scientist-certification
www.edureka.co/machine-learning-certification-training
www.edureka.co/ai-deep-learning-with-tensorflow
LEARNER'S REVIEWS
2500+ Technical
Blogs
3000+
Video Tutorials on
YouTube
30+
Active
Free Monthly
Community Webinars
WWW.EDUREKA.CO/DATA-SCIENCE
About Us
There are countless online education marketplaces on the internet. And there’s us. We
are not the biggest. We are not the cheapest. But we are the fastest growing. We have
the highest course completion rate in the industry. We aim to become the largest
online learning ecosystem for continuing education, in partnership with corporates
and academia. To achieve that we remain ridiculously committed to our students. Be it
constant reminders, relentless masters or 24 x 7 online technical support - we will
absolutely make sure that you run out of excuses to not complete the course.