0% found this document useful (0 votes)
29 views

Syai Sem3 - Ds Unit2

Uploaded by

dheerentejani05
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

Syai Sem3 - Ds Unit2

Uploaded by

dheerentejani05
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Course Learning Objective

The objective of this course is:


● To impart students with the fundamental knowledge of Data Science tools and Programming to
identify patterns in data,analyze the data and derive insights using a variety of statistical techniques
and languages.
● To prepare individuals for roles where they can effectively leverage data to make informed
decisions, solve complex problems, and contribute meaningfully to their respective fields.

Course Learning Outcomes


After completion of this course learner will be able to:
1.Describe process of Data science.
2.Compare Data science.
3.Explain Big Data life cycle.

Module Name of the [30L]


module

Data Science
1 1.1 Data Science,need of data science,pillars of Data science
1.2 Drew Conway’s Venn diagram of data science
1.3 Data Science Jobs Roles & Responsibilities
1.4 Difference between Data Scientist, Data Analyst, and Data Engineer
1.5 Data Science Processes
1.6 Applications of Data Science
1.7 Tools for Data Science 10

Data in Data Science and Comparison of Data science


2.1Facets of data,what is data,big data
2.2 Process of Data Science
2.3 Lifecycle of Data Science
2.4.What is Data Analytics?
2.5 Difference Between Data Science and Data Analytics
10
2 2.6 What is Machine Learning?
2.7 Drew Conway’s Venn Diagram of Data Science,
2.8 Difference Between Data Science and Machine Learning
DATA PREPARATION AND ANALYSIS
3.1 BIG DATA ANALYTICS:Data Life Cycle: Traditional Data Mining Life
Cycle:
3 3.2 CRISP-DM Methodology,SEMMA Methodology 10
3.3 Big Data Life Cycle
3.4 Data formats: Primary and Secondary,Qualitative and Quantitative

Ref: Textbooks:
1.Foundations of Data Science Unknown Binding by Avrim Blum (Author)
2.Data Science for Dummies" by Aviral Sharma and Chaitan Baru
3. Introduction to Data Science: Practical Approach with R and Python by B.
Uma Maheswari and R. Sujatha
4. Data Science Fundamentals And Practical Approaches: Understand Why
Data Science Is The Next: Understand Why Data Science Is the Next
(English Edition) – 1 January 2020, by Dr Gypsy Anand/ Dr Rupam Sharma
Data
Additional References websites:
1. https://www.geeksforgeeks.org/data-science-fundamentals/
2.https://builtin.com/data-science
Unit 2:Data in Data Science and Comparison of Data science
2.1Facets of data,what is data,big data
Big data is a voluminous set of structured, unstructured, and semi-structured datasets, which is
challenging to manage using traditional data processing tools. It requires additional infrastructure
to govern, analyze, and convert into insights
Facets of Data

● • Very large amount of data will generate in big data and data science. These data is
various types and main categories of data are as follows:
● a) Structured.
● b) Natural language.
● c) Graph-based.
● d) Streaming.
● e) Unstructured.
● f) Machine-generated.
● g) Audio, video and images.

Difference between Traditional data and Big data


1. Traditional data:
Traditional data is the structured data that is being majorly maintained by all types of businesses
starting from very small to big organizations. In a traditional database system, a centralized
database architecture used to store and maintain the data in a fixed format or fields in a file. For
managing and accessing the data Structured Query Language (SQL) is used.
● Traditional data is characterized by its high level of organization and structure, which
makes it easy to store, manage, and analyze.
● Traditional data analysis techniques involve using statistical methods and visualizations
to identify patterns and trends in the data.
● Traditional data is often collected and managed by enterprise resource planning (ERP)
systems and other enterprise-level applications.
● This data is critical for businesses to make informed decisions and drive performance
improvements.
2. Big data:
Big data an upper version of traditional data.
Big data deal with too large or complex data sets which is difficult to manage in traditional
data-processing application software.
It deals with large volume of both structured, semi structured and unstructured data.
Volume, Velocity and Variety, Veracity (the quality of being true or correct;) and Value refer to the
5’V characteristics of big data.
Big data not only refers to large amount of data it refers to extracting meaningful data by
analyzing the huge amount of complex data sets. semi-structured
Big data is characterized by the three Vs: volume, velocity, and variety.
Volume refers to the vast amount of data that is generated and collected;
velocity refers to the speed at which data is generated and must be processed; and
variety refers to the many different types and formats of data that must be analyzed, including
structured, semi-structured, and unstructured data.
Due to the size and complexity of big data sets, traditional data management tools and techniques
are often inadequate for processing and analyzing the data.
Big data technologies, such as Hadoop, Spark, and NoSQL databases, have emerged to help
organizations store, manage, and analyze large volumes of data.
The main differences between traditional data and big data as follows:
● Volume: Traditional data typically refers to small to medium-sized datasets that can
be easily stored and analyzed using traditional data processing technologies. In
contrast, big data refers to extremely large datasets that cannot be easily managed or
processed using traditional technologies.
● Variety: Traditional data is typically structured, meaning it is organized in a
predefined manner such as tables, columns, and rows. Big data, on the other hand,
can be structured, unstructured, or semi-structured, meaning it may contain text,
images, videos, or other types of data.
● Velocity: Traditional data is usually static and updated on a periodic basis. In contrast,
big data is constantly changing and updated in real-time or near real-time.
● Complexity: Traditional data is relatively simple to manage and analyze. Big data, on
the other hand, is complex and requires specialized tools and techniques to manage,
process, and analyze.
● Value: Traditional data typically has a lower potential value than big data because it is
limited in scope and size. Big data, on the other hand, can provide valuable insights
into customer behavior, market trends, and other business-critical information.

Some similarities between them, including:


● Data Quality: The quality of data is essential in both traditional and big data
environments. Accurate and reliable data is necessary for making informed business
decisions.
● Data Analysis: Both traditional and big data require some form of analysis to derive
insights and knowledge from the data. Traditional data analysis methods typically
involve statistical techniques and visualizations, while big data analysis may require
machine learning and other advanced techniques.
● Data Storage: In both traditional and big data environments, data needs to be stored
and managed effectively. Traditional data is typically stored in relational databases,
while big data may require specialized technologies such as Hadoop, NoSQL, or
cloud-based storage systems.
● Data Security: Data security is a critical consideration in both traditional and big data
environments. Protecting sensitive information from unauthorized access, theft, or
misuse is essential in both contexts.
● Business Value: Both traditional and big data can provide significant value to
organizations. Traditional data can provide insights into historical trends and patterns,
while big data can uncover new opportunities and help organizations make more
informed decisions.

The difference between Traditional data and Big data are as follows:

Traditional Data Big Data

Traditional data is generated in enterprise Big data is generated outside the


level. enterprise level.

Its volume ranges from Gigabytes to Its volume ranges from Petabytes to
Terabytes. Zettabytes or Exabytes.
Big data system deals with structured,
Traditional database system deals with
semi-structured,database, and
structured data.
unstructured data.

Traditional data is generated per hour or But big data is generated more
per day or more. frequently mainly per seconds.

Traditional data source is centralized and Big data source is distributed and it is
it is managed in centralized form. managed in distributed form.

Data integration is very easy. Data integration is very difficult.

Normal system configuration is capable to High system configuration is required


process traditional data. to process big data.

The size is more than the traditional


The size of the data is very small.
data size.
Special kind of data base tools are
Traditional data base tools are required to
required to perform any
perform any data base operation.
databaseschema-based operation.

Special kind of functions can


Normal functions can manipulate data.
manipulate data.

Its data model is strict schema based and it Its data model is a flat schema based
is static. and it is dynamic.

Traditional data is stable and inter Big data is not stable and unknown
relationship. relationship.

Big data is in huge volume which


Traditional data is in manageable volume.
becomes unmanageable.

It is easy to manage and manipulate the It is difficult to manage and


data. manipulate the data.
Its data sources includes ERP transaction
Its data sources includes social media,
data, CRM transaction data, financial data,
device data, sensor data, video,
organizational data, web transaction data
images, audio etc.
etc.

2.2 Process of Data Science


Data can be proved to be very fruitful if we know how to manipulate it to get hidden patterns
from them. This logic behind the data or the process behind the manipulation is what is known as
Data Science. From formulating the problem statement and collection of data to extracting the
required results from them the Data Science process and the professional who ensures that the
whole process is going smoothly or not is known as the Data Scientist. But there are other job
roles as well in this domain as well like:
1. Data Engineers
2. Data Analysts
3. Data Architect
4. Machine Learning Engineer
5. Deep Learning Engineer

Data Science Process Life Cycle

There are some steps that are necessary for any of the tasks that are being done in the field of
data science to derive any fruitful results from the data at hand.
● Data Collection – After formulating any problem statement the main task is to
calculate data that can help us in our analysis and manipulation. Sometimes data is
collected by performing some kind of survey and there are times when it is done by
performing scrapping.
● Data Cleaning – Most of the real-world data is not structured and requires cleaning
and conversion into structured data before it can be used for any analysis or
modeling.
● Exploratory Data Analysis – This is the step in which we try to find the hidden
patterns in the data at hand. Also, we try to analyze different factors which affect the
target variable and the extent to which it does so. How the independent features are
related to each other and what can be done to achieve the desired results all these
answers can be extracted from this process as well. This also gives us a direction in
which we should work to get started with the modeling process.
● Model Building – Different types of machine learning algorithms as well as
techniques have been developed which can easily identify complex patterns in the
data which will be a very tedious task to be done by a human.
● Model Deployment – After a model is developed and gives better results on the
holdout or the real-world dataset then we deploy it and monitor its performance. This
is the main part where we use our learning from the data to be applied in real-world
applications and use cases.

Data Science Process Life Cycle


Components of Data Science Process
Data Science is a very vast field and to get the best out of the data at hand one has to apply
multiple methodologies and use different tools to make sure the integrity of the data remains
intact throughout the process keeping data privacy in mind. Machine Learning and Data analysis
is the part where we focus on the results which can be extracted from the data at hand. But Data
engineering is the part in which the main task is to ensure that the data is managed properly and
proper data pipelines are created for smooth data flow. If we try to point out the main
components of Data Science then it would be:
● Data Analysis – There are times when there is no need to apply advanced deep
learning and complex methods to the data at hand to derive some patterns from it.
Due to this before moving on to the modeling part, we first perform an exploratory
data analysis to get a basic idea of the data and patterns which are available in it this
gives us a direction to work on if we want to apply some complex analysis methods
on our data.
● Statistics – It is a natural phenomenon that many real-life datasets follow a normal
distribution. And when we already know that a particular dataset follows some known
distribution then most of its properties can be analyzed at once. Also, descriptive
statistics and correlation and covariances between two features of the dataset help us
get a better understanding of how one factor is related to the other in our dataset.
● Data Engineering – When we deal with a large amount of data then we have to make
sure that the data is kept safe from any online threats also it is easy to retrieve and
make changes in the data as well. To ensure that the data is used efficiently Data
Engineers play a crucial role.
● Advanced Computing
○ Machine Learning – Machine Learning has opened new horizons
which had helped us to build different advanced applications and
methodologies so, that the machines become more efficient and
provide a personalized experience to each individual and perform
tasks in a snap of the hand earlier which requires heavy human
labor and time intense.
○ Deep Learning – This is also a part of Artificial Intelligence and
Machine Learning but it is a bit more advanced than machine
learning itself. High computing power and a huge corpus of data
have led to the emergence of this field in data science.

2.3 Lifecycle of Data Science


The lifecycle of Data Science
1. Business Understanding: The complete cycle revolves around the enterprise goal. What will
you resolve if you do not longer have a specific problem? It is extraordinarily essential to
apprehend the commercial enterprise goal sincerely due to the fact that will be your ultimate aim
of the analysis. After desirable perception only we can set the precise aim of evaluation that is in
sync with the enterprise objective. You need to understand if the customer desires to minimize
savings loss, or if they prefer to predict the rate of a commodity, etc.
2. Data Understanding: After enterprise understanding, the subsequent step is data
understanding. This includes a series of all the reachable data. Here you need to intently work
with the commercial enterprise group as they are certainly conscious of what information is
present, what facts should be used for this commercial enterprise problem, and different
information. This step includes describing the data, their structure, their relevance, their records
type. Explore the information using graphical plots. Basically, extracting any data that you can
get about the information through simply exploring the data.
3. Preparation of Data: Next comes the data preparation stage. This consists of steps like
choosing the applicable data, integrating the data by means of merging the data sets, cleaning it,
treating the lacking values through either eliminating them or imputing them, treating inaccurate
data through eliminating them, additionally test for outliers the use of box plots and cope with
them. Constructing new data, derive new elements from present ones. Format the data into the
preferred structure, eliminate undesirable columns and features. Data preparation is the most
time-consuming but arguably the most essential step in the complete existence cycle. Your model
will be as accurate as your data.
4. Exploratory Data Analysis: This step includes getting some concept about the answer and
elements affecting it, earlier than constructing the real model. Distribution of data inside
distinctive variables of a character is explored graphically the usage of bar-graphs, Relations
between distinct aspects are captured via graphical representations like scatter plots and warmth
maps. Many data visualization strategies are considerably used to discover each and every
characteristic individually and by means of combining them with different features.
5. Data Modeling: Data modeling is the coronary heart of data analysis. A model takes the
organized data as input and gives the preferred output. This step consists of selecting the suitable
kind of model, whether the problem is a classification problem, or a regression problem or a
clustering problem. After deciding on the model family, amongst the number of algorithms
amongst that family, we need to cautiously pick out the algorithms to put into effect and enforce
them. We need to tune the hyperparameters of every model to obtain the preferred performance.
We additionally need to make positive there is the right stability between overall performance
and generalizability. We do no longer desire the model to study the data and operate poorly on
new data.
6. Model Evaluation: Here the model is evaluated for checking if it is geared up to be deployed.
The model is examined on an unseen data, evaluated on a cautiously thought out set of
assessment metrics. We additionally need to make positive that the model conforms to reality. If
we do not acquire a quality end result in the evaluation, we have to re-iterate the complete
modelling procedure until the preferred stage of metrics is achieved. Any data science solution, a
machine learning model, simply like a human, must evolve, must be capable to enhance itself
with new data, adapt to a new evaluation metric. We can construct more than one model for a
certain phenomenon, however, a lot of them may additionally be imperfect. The model
assessment helps us select and construct an ideal model.
7. Model Deployment: The model after a rigorous assessment is at the end deployed in the
preferred structure and channel. This is the last step in the data science life cycle. Each step in
the data science life cycle defined above must be laboured upon carefully. If any step is
performed improperly, and hence, have an effect on the subsequent step and the complete effort
goes to waste. For example, if data is no longer accumulated properly, you’ll lose records and
you will no longer be constructing an ideal model. If information is not cleaned properly, the
model will no longer work. If the model is not evaluated properly, it will fail in the actual world.
Right from Business perception to model deployment, every step has to be given appropriate
attention, time, and effort.
2.4.What is Data Analytics?

What is Data Analytics?


Data analytics, also known as data analysis, is a crucial component of modern business
operations. It involves examining datasets to uncover useful information that can be used to
make informed decisions. This process is used across industries to optimize performance,
improve decision-making, and gain a competitive edge.
Data Analytics can be done in the following steps which are mentioned below:
1. Data Collection :

What Is Data Collection: Methods, Types, Tools

The process of gathering and analyzing accurate data from various sources to find answers to
research problems, trends and probabilities, etc., to evaluate possible outcomes is Known as Data
Collection.
Knowledge is power, information is knowledge, and data is information in digitized form, at
least as defined in IT. Hence, data is power. But before you can leverage that data into a
successful strategy for your organization or business, you need to gather it. That’s your first step.
What is Data Collection?
Data collection is the process of collecting and evaluating information or data from multiple
sources to find answers to research problems, answer questions, evaluate outcomes, and forecast
trends and probabilities.

It is an essential phase in all types of research, analysis, and decision-making, including that
done in the social sciences, business, and healthcare.

Accurate data collection is necessary to make informed business decisions, ensure quality
assurance, and keep research integrity.

During data collection, the researchers must identify the data types, the sources of data, and what
methods are being used.

Before an analyst begins collecting data, they must answer three questions first:

● What’s the goal or purpose of this research?


● What kinds of data are they planning on gathering?
● What methods and procedures will be used to collect, store, and process the
information?

Additionally, we can break up data into qualitative and quantitative types.

Qualitative data covers descriptions such as color, size, quality, and appearance. Quantitative
data, unsurprisingly, deals with numbers, such as statistics, poll numbers, percentages, etc.

You need data collection to help you make better choices.


What Are the Different Data Collection Methods?
Primary and secondary methods of data collection are two approaches used to gather information
for research or analysis purposes.
1. Primary Data Collection:

Primary data collection involves the collection of original data directly from the source or
through direct interaction with the respondents. This method allows researchers to obtain firsth
and information specifically tailored to their research objectives. There are various techniques
for primary data collection, including:

a. Surveys and Questionnaires: Researchers design structured questionnaires or surveys to collect


data from individuals or groups. These can be conducted through face-to-face interviews,
telephone calls, mail, or online platforms.

b. Interviews: Interviews involve direct interaction between the researcher and the respondent.
They can be conducted in person, over the phone, or through video conferencing. Interviews can
be structured (with predefined questions), semi-structured (allowing flexibility), or unstructured
(more conversational).

c. Observations: Researchers observe and record behaviors, actions, or events in their natural
setting. This method is useful for gathering data on human behavior, interactions, or phenomena
without direct intervention.

d. Experiments: Experimental studies involve the manipulation of variables to observe their


impact on the outcome. Researchers control the conditions and collect data to draw conclusions
about cause-and-effect relationships.

e. Focus Groups: Focus groups bring together a small group of individuals who discuss specific
topics in a moderated setting. This method helps in understanding opinions, perceptions, and
experiences shared by the participants.

2. Secondary Data Collection:


Secondary data collection involves using existing data collected by someone else for a purpose
different from the original intent. Researchers analyze and interpret this data to extract relevant
information. Secondary data can be obtained from various sources, including:

a. Published Sources: Researchers refer to books, academic journals, magazines, newspapers,


government reports, and other published materials that contain relevant data.

b. Online Databases: Numerous online databases provide access to a wide range of secondary
data, such as research articles, statistical information, economic data, and social surveys.

c. Government and Institutional Records: Government agencies, research institutions, and


organizations often maintain databases or records that can be used for research purposes.

d. Publicly Available Data: Data shared by individuals, organizations, or communities on public


platforms, websites, or social media can be accessed and utilized for research.

e. Past Research Studies: Previous research studies and their findings can serve as valuable
secondary data sources. Researchers can review and analyze the data to gain insights or build
upon existing knowledge.

Data Collection Tools

● Word Association

The researcher gives the respondent a set of words and asks them what comes to mind when they
hear each word.

● Sentence Completion
Researchers use sentence completion to understand what kind of ideas the respondent has. This
tool involves giving an incomplete sentence and seeing how the interviewee finishes it.

● Role-Playing

Respondents are presented with an imaginary situation and asked how they would act or react if
it was real.

● In-Person Surveys

The researcher asks questions in person.

● Online/Web Surveys

These surveys are easy to accomplish, but some users may be unwilling to answer truthfully, if at
all.

● Mobile Surveys

These surveys take advantage of the increasing proliferation of mobile technology. Mobile
collection surveys rely on mobile devices like tablets or smartphones to conduct surveys via
SMS or mobile apps.

● Phone Surveys

No researcher can call thousands of people at once, so they need a third party to handle the
chore. However, many people have call screening and won’t answer.

● Observation
Sometimes, the simplest method is the best. Researchers who make direct observations collect
data quickly and easily, with little intrusion or third-party bias. Naturally, it’s only effective in
small-scale situations.

The Importance of Ensuring Accurate and Appropriate Among the effects of data
collection done incorrectly, include the following -

● Erroneous conclusions that squander resources


● Decisions that compromise public policy
● Incapacity to correctly respond to research inquiries
● Bringing harm to participants who are humans or animals
● Deceiving other researchers into pursuing futile research avenues
● The study's inability to be replicated and validated

Issues Related to Maintaining the Integrity of Data Collection

Each strategy is used at various stages of the research timeline:

● Quality control - tasks that are performed both after and during data collecting
● Quality assurance - events that happen before data gathering starts

What are Common Challenges in Data Collection?

● Data Quality Issues


● Inconsistent Data
● Data Downtime
● Ambiguous Data
● Duplicate Data
● Too Much Data
● Inaccurate Data
● Finding Relevant Data

What are the Key Steps in the Data Collection Process?

1. Decide What Data You Want to Gather

2. Establish a Deadline for Data Collection

3. Select a Data Collection Approach

4. Gather Information

5. Examine the Information and Apply Your Findings

2. Data Cleansing : After collecting the data the next step is to clean the quality of the
data as the collected data consists of a lot of quality problems such as errors, duplicate
entries and white spaces which need to be corrected before moving to the next step.
By running data profiling and data cleansing tasks these errors can be corrected.
These data are organised according to the needs of the analytical model by the
analysts.

What is data cleaning?

Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted,
duplicate, or incomplete data within a dataset.

When combining multiple data sources, there are many opportunities for data to be duplicated or
mislabeled. If data is incorrect, outcomes and algorithms are unreliable, even though they may
look correct.

There is no one absolute way to prescribe the exact steps in the data cleaning process because the
processes will vary from dataset to dataset

. But it is crucial to establish a template for your data cleaning process so you know you are
doing it the right way every time.

What is the difference between data cleaning and data transformation?


Data cleaning is the process that removes data that does not belong in your dataset.

Data transformation is the process of converting data from one format or structure into another.
Transformation processes can also be referred to as data wrangling, or data munging,
transforming and mapping data from one "raw" data form into another format for warehousing
and analyzing.

While the techniques used for data cleaning may vary according to the types of data your
company stores, you can follow these basic steps to map out a framework for your organization.

Step 1: Remove duplicate or irrelevant observations

Remove unwanted observations from your dataset, including duplicate observations or irrelevant
observations. Duplicate observations will happen most often during data collection. When you
combine data sets from multiple places, scrape data, or receive data from clients or multiple
departments, there are opportunities to create duplicate data. De-duplication is one of the largest
areas to be considered in this process. Irrelevant observations are when you notice observations
that do not fit into the specific problem you are trying to analyze. For example, if you want to
analyze data regarding millennial customers, but your dataset includes older generations, you
might remove those irrelevant observations. This can make analysis more efficient and minimize
distraction from your primary target—as well as creating a more manageable and more
performant dataset.

Step 2: Fix structural errors

Structural errors are when you measure or transfer data and notice strange naming conventions,
typos, or incorrect capitalization. These inconsistencies can cause mislabeled categories or
classes. For example, you may find “N/A” and “Not Applicable” both appear, but they should be
analyzed as the same category.

Step 3: Filter unwanted outliers

Often, there will be one-off observations where, at a glance, they do not appear to fit within the
data you are analyzing. If you have a legitimate reason to remove an outlier, like improper
data-entry, doing so will help the performance of the data you are working with. However,
sometimes it is the appearance of an outlier that will prove a theory you are working on.
Remember: just because an outlier exists, doesn’t mean it is incorrect. This step is needed to
determine the validity of that number. If an outlier proves to be irrelevant for analysis or is a
mistake, consider removing it.

Step 4: Handle missing data


You can’t ignore missing data because many algorithms will not accept missing values. There are
a couple of ways to deal with missing data. Neither is optimal, but both can be considered.

1. As a first option, you can drop observations that have missing values, but doing this will
drop or lose information, so be mindful of this before you remove it.
2. As a second option, you can input missing values based on other observations; again,
there is an opportunity to lose integrity of the data because you may be operating from
assumptions and not actual observations.
3. As a third option, you might alter the way the data is used to effectively navigate null
values.

Step 5: Validate and QA

At the end of the data cleaning process, you should be able to answer these questions as a part of
basic validation:

● Does the data make sense?


● Does the data follow the appropriate rules for its field?
● Does it prove or disprove your working theory, or bring any insight to light?
● Can you find trends in the data to help you form your next theory?
● If not, is that because of a data quality issue?

False conclusions because of incorrect or “dirty” data can inform poor business strategy and
decision-making. False conclusions can lead to an embarrassing moment in a reporting meeting
when you realize your data doesn’t stand up to scrutiny. Before you get there, it is important to
create a culture of quality data in your organization. To do this, you should document the tools
you might use to create this culture and what data quality means to you

3. Data Analysis and Data Interpretation: Analytical models are created using
software and other tools which interpret the data and understand it. The tools include
Python, Excel, R, Scala and SQL. Lastly this model is tested again and again until the
model works as it needs to be then in production mode the data set is run against the
model.
4. Data Visualisation: Data visualisation is the process of creating visual representation
of data using the plots, charts and graphs which helps to analyse the patterns, trends
and get the valuable insights of the data. By comparing the datasets and analysing it
data analysts find the useful data from the raw data.

Types of Data Analytics


There are different types of data analysis in which raw data is converted into valuable insights.
Some of the types of data analysis are mentioned below:

1. Descriptive Data Analytics : Descriptive data Analytics is a type of data analysis


which summarises the data set and it is used to compare the past results, differentiate
between the weakness and strength, and identify the anomalies. Descriptive data
analysis is used by the companies to identify the problems in the data set as it helps in
identifying the patterns.
2. Real-time Data Analytics: Real time data Analytics doesn’t use data from past
events. It is a type of data analysis which involves using the data when the data is
immediately entered in the database. This type of analysis is used by the companies to
identify the trends and track the competitors’ operations.
3. Diagnostic Data Analytics: Diagnostic Data Analytics uses past data sets to analyse
the cause of an anomaly. Some of the techniques used in diagnostic analysis are
correlation analysis, regression analysis and analysis of variance.The results which
are provided by diagnostic analysis help the companies to give accurate solutions to
the problems.
4. Predictive Data Analytics: This type of Analytics is done in the current data to
predict future outcomes. To build the predictive models it uses machine learning
algorithms, statistical model techniques to identify the trends and patterns. Predictive
data analysis is also used in sales forecasting, to estimate the risk and to predict
customer behaviour.
5. Prescriptive Data Analytics: Prescriptive data Analytics is an analysis of selecting
best solutions to problems. This type of data analysis is used in loan approval, pricing
models, machine repair scheduling, analysing the decisions and so on. To automate
decision making companies use prescriptive data analysis.

Methods of Data Analytics


There are two types of methods in data analysis which are mentioned below:
1. Qualitative Data Analytics

Qualitative data analysis doesn’t use statistics and derives data from the words, pictures and
symbols. Some common qualitative methods are:
● Narrative Analytics is used for working with data acquired from diaries, interviews
and so on.
● Content Analytics is used for Analytics of verbal data and behaviour.
● Grounded theory is used to explain some given event by studying.

2. Quantitative Data Analysis

Quantitative data Analytics is used to collect data and then process it into the numerical data.
Some of the quantitative methods are mentioned below:
● Hypothesis testing assesses the given hypothesis of the data set.
● Sample size determination is the method of taking a small sample from a large group
of people and then analysing it.
● Average or mean of a subject is dividing the sum total numbers in the list by the
number of items present in that list.
Skills Required for Data Analytics
There are multiple skills which are required to be a Data analyst. Some of the main skills are
mentioned below:
● Some of the common programming languages which are used are R and Python.
● For databases Structured Query Language (SQL) is a programming language used.
● Machine Learning is used in data analysis.
● In order to better analyse and interpret probability and statistics are used.
● For collecting and organising data, Data Management is used in data analysis.
● To use charts and graphs Data visualisation is used.

Data Analytics Jobs


In Data Analytics For Entry level these job roles are available:
● Junior Data Analyst
● Junior Data Scientist
● Associate Data Analyst

2.5 Difference Between Data Science and Data Analytics


What is Data Science
Data Science is a field that deals with extracting meaningful information and insights by
applying various algorithms preprocessing and scientific methods on structured and unstructured
data. This field is related to Artificial Intelligence and is currently one of the most demanded
skills. Data science comprises mathematics, computations, statistics, programming, etc to gain
meaningful insights from the large amount of data provided in various formats.
What is Data Analytics
Data Analytics is used to get conclusions by processing the raw data. It is helpful in various
businesses as it helps the company to make decisions based on the conclusions from the data.
Basically, data analytics helps to convert a Large number of figures in the form of data into Plain
English i.e., conclusions which are further helpful in making in-depth decisions. Below is a table
of differences between Data Science and Data Analytics:

Feature Data Science Data Analytics

The Knowledge of
Coding Python is the most commonly Python and R Language
Language used language for data science is essential for Data
along with the use of other Analytics.
languages such as C++, Java, Perl,
etc.

In-depth knowledge of Basic Programming


Programming
programming is required for data skills is necessary for
Skills
science. data analytics.

Data Science makes use of Data Analytics does not


Use of Machine
machine learning algorithms to use machine learning to
Learning
get insights. get the insight of data.

Hadoop Based analysis is


Data Science makes use of Data
used for getting
Other Skills mining activities for getting
conclusions from raw
meaningful insights.
data.

The Scope of data


Scope The scope of data science is large. analysis is micro i.e.,
small.

Data science deals with Data Analysis makes use


Goals
explorations and new innovations. of existing resources.
Data Science mostly deals with Data Analytics deals with
Data Type
unstructured data. structured data.

The statistical skills are


Statistical skills are necessary in
Statistical Skills of minimal or no use in
the field of Data Science..
data analytics.

2.6 What is Machine Learning?


● Machine learning is a method of data analysis that automates analytical model building.
● It is a branch of artificial intelligence based on the idea that systems can learn from data,
identify patterns and make decisions with minimal human intervention.
Do you get automatic recommendations on Netflix and Amazon Prime about the movies you
should watch next? Or maybe you get options for People You may know on Facebook or
LinkedIn? You might also use Siri, Alexa, etc. on your phones. That’s all Machine Learning!
Machine Learning, as the name says, is all about machines learning automatically without being
explicitly programmed or learning without any direct human intervention.
● This machine learning process starts with feeding them good quality data and then
training the machines by building various machine learning models using the data and
different algorithms.
● The choice of algorithms depends on what type of data we have and what kind of task we
are trying to automate.
● As for the formal definition of Machine Learning, we can say that a Machine Learning
algorithm learns from experience E with respect to some type of task T and
performance measure P, if its performance at tasks in T, as measured by P, improves
with experience E.
For example, If a Machine Learning algorithm is used to play chess. Then the experience E is
playing many games of chess, the task T is playing chess with many players, and the
performance measure P is the probability that the algorithm will win in the game of chess.

What is the difference between Artificial Intelligence and Machine Learning?


Artificial Intelligence and Machine Learning are correlated with each other, and yet they have
some differences.
Artificial Intelligence is an overarching concept that aims to create intelligence that mimics
human-level intelligence.
Artificial Intelligence is a general concept that deals with creating human-like critical thinking
capability and reasoning skills for machines
. On the other hand, Machine Learning is a subset or specific application of Artificial intelligence
that aims to create machines that can learn autonomously from data.
Machine Learning is specific, not general, which means it allows a machine to make predictions
or take some decisions on a specific problem using data.

What are the types of Machine Learning?

1. Supervised Machine Learning


Imagine a teacher supervising a class. The teacher already knows the correct answers but the
learning process doesn’t stop until the students learn the answers as well. This is the essence of
Supervised Machine Learning Algorithms. Here, the algorithm learns from a training dataset and
makes predictions that are compared with the actual output values. If the predictions are not
correct, then the algorithm is modified until it is satisfactory. This learning process continues
until the algorithm achieves the required level of performance. Then it can provide the desired
output values for any new inputs.
2. Unsupervised Machine Learning
In this case, there is no teacher for the class and the students are left to learn for themselves! So
for Unsupervised Machine Learning Algorithms, there is no specific answer to be learned and
there is no teacher. In this way, the algorithm doesn’t figure out any output for input but it
explores the data. The algorithm is left unsupervised to find the underlying structure in the data
in order to learn more and more about the data itself.
3. Semi-Supervised Machine Learning
The students learn both from their teacher and by themselves in Semi-Supervised Machine
Learning. And you can guess that from the name itself! This is a combination of Supervised and
Unsupervised Machine Learning that uses a little amount of labeled data like Supervised
Machine Learning and a larger amount of unlabeled data like Unsupervised Machine Learning to
train the algorithms. First, the labeled data is used to partially train the Machine Learning
Algorithm, and then this partially trained model is used to pseudo-label the rest of the unlabeled
data. Finally, the Machine Learning Algorithm is fully trained using a combination of labeled
and pseudo-labeled data.
4. Reinforcement Machine Learning
Well, here are the hypothetical students who learn from their own mistakes over time (that’s like
life!). So the Reinforcement Machine Learning Algorithms learn optimal actions through trial
and error. This means that the algorithm decides the next action by learning behaviors that are
based on its current state and that will maximize the reward in the future. This is done using
reward feedback that allows the Reinforcement Algorithm to learn which are the best behaviors
that lead to maximum reward. This reward feedback is known as a reinforcement signal.

What are some popular Machine Learning algorithms?

Let’s look at some of the popular Machine Learning algorithms that are based on specific types
of Machine Learning.

Supervised Machine Learning

Supervised Machine Learning includes Regression and Classification algorithms. Some of the
more popular algorithms in these categories are:
1. Linear Regression Algorithm
The Linear Regression Algorithm provides the relation between an independent and a dependent
variable. It demonstrates the impact on the dependent variable when the independent variable is
changed in any way. So the independent variable is called the explanatory variable and the
dependent variable is called the factor of interest. An example of the Linear Regression
Algorithm usage is to analyze the property prices in the area according to the size of the property,
number of rooms, etc.
2. Logistic Regression Algorithm
The Logistic Regression Algorithm deals in discrete values whereas the Linear Regression
Algorithm handles predictions in continuous values. This means that Logistic Regression is a
better option for binary classification. An event in Logistic Regression is classified as 1 if it
occurs and it is classified as 0 otherwise. Hence, the probability of a particular event occurrence
is predicted based on the given predictor variables. An example of the Logistic Regression
Algorithm usage is in medicine to predict if a person has malignant breast cancer tumors or not
based on the size of the tumors.
3. Naive Bayes Classifier Algorithm
Naive Bayes Classifier Algorithm is used to classify data texts such as a web page, a document,
an email, among other things. This algorithm is based on the Bayes Theorem of Probability and
it allocates the element value to a population from one of the categories that are available. An
example of the Naive Bayes Classifier Algorithm usage is for Email Spam Filtering. Gmail uses
this algorithm to classify an email as Spam or Not Spam.

Unsupervised Machine Learning

Unsupervised Machine Learning mainly includes Clustering algorithms. Some of the more
popular algorithms in this category are:
1. K Means Clustering Algorithm
Let’s imagine that you want to search the name “Harry” on Wikipedia. Now, “Harry” can refer to
Harry Potter, Prince Harry of England, or any other popular Harry on Wikipedia! So Wikipedia
groups the web pages that talk about the same ideas using the K Means Clustering Algorithm
(since it is a popular algorithm for cluster analysis). K Means Clustering Algorithm in general
uses K number of clusters to operate on a given data set. In this manner, the output contains K
clusters with the input data partitioned among the clusters.
2. Apriori Algorithm
The Apriori Algorithm uses the if-then format to create association rules. This means that if a
certain event 1 occurs, then there is a high probability that a certain event 2 also occurs. For
example: IF someone buys a car, THEN there is a high chance they buy car insurance as well.
The Apriori Algorithm generates this association rule by observing the number of people who
bought car insurance after buying a car. For example, Google auto-complete uses the Apriori
Algorithm. When a word is typed in Google, the Apriori Algorithm looks for the associated
words that are usually typed after that word and displays the possibilities.

2.7 Drew Conway’s Venn Diagram of Data Science,

Data Science

It is the complex study of the large amounts of data in a company or organization’s repository.
This study includes where the data has originated from, the actual study of its content matter, and
how this data can be useful for the growth of the company in the future. The data relating to an
organization is always in two forms: Structured or unstructured. When we study this data, we
get valuable information about business or market patterns which helps the business have an
edge over the other competitors since they’ve increased their effectiveness by recognizing
patterns in the data set.
Data scientists are specialists who excel in converting raw data into critical business matters.
These scientists are skilled in algorithmic coding along with concepts like data mining, machine
learning, and statistics. Data science is used extensively by companies like Amazon, Netflix, the
healthcare sector, in the fraud detection sector, internet search, airlines, etc.

Machine Learning

Machine Learning is a field of study that gives computers the capability to learn without being
explicitly programmed. Machine Learning is applied using Algorithms to process the data and
get trained for delivering future predictions without human intervention. The inputs for Machine
Learning are the set of instructions or data or observations. Machine Learning is used extensively
by companies like Facebook, Google, etc.

What Makes These Two Techniques Different?

Below is the Drew Conway’s Venn Diagram. Let’s have a look at the Venn Diagram.
You can see the two terms “Data Science” and “Machine Learning” in the above Venn diagram.
So let’s understand the diagram. In Drew Conway’s Venn Diagram of Data Science, the primary
colors of data are
● Hacking Skills,
● Math and Statistics Knowledge, and
● Substantive Expertise

But the question is why has he highlighted these three? So let’s understand the term why!!
Hacking Skills: It is known to everyone that data is the key part of data science. And data is a
commodity traded electronically; so, in order to be in this market, “one needs to speak hacker”.
So what does this line means? Being able to manage text files at the command-line, learning
vectorized operations, thinking algorithmically; are the hacking skills that make for a successful
data hacker.
Math and Statistics Knowledge: Once you have collected and cleaned the data, the next step is
to actually obtain insight from it. In order to do this, you need to use appropriate mathematical
and statistical methods, that demand at least a baseline familiarity with these tools. This is not
to say that a Ph.D. in statistics is required to be a skilled data scientist, but it does need
understanding what an ordinary least squares regression is and how to explain it.
Substantive Expertise: The third important part is Substantive expertise. And this is where our
confusion erases. Yes!!
According to Drew Conway, “Data plus Math and Statistics Knowledge only gets you
Machine Learning”, which is excellent if that is what you are interested in, but not if you are
doing Data Science. Science is about experimentation and building knowledge, which demands
some motivating questions about the world and hypotheses that can be brought to data and tested
with statistical methods.

And this is the main difference point between these two terms. If you want to a Data Scientist
then you must have knowledge in that Domain Area. But Why? The foremost objective of data
science is to extract useful insights from that data so that it can be profitable to the company’s
business. If you are not aware of the business side of the company that how the business model
of the company works and how you can’t build it better than you are of no use to this company.
You need to know how to ask the right questions from the right people so that you can perceive
the appropriate information you need to obtain the information you need. Below is a complete
table of differences between Data Science and Machine Learning.

2.8 Difference Between Data Science and Machine Learning


Difference Between Data Science and Machine Learning

S.No Data Science Machine Learning

Data Science is a field Machine Learning is a


about processes and field of study that gives
1. systems to extract data computers the capability
from structured and to learn without being
semi-structured data. explicitly programmed.
Need the entire analytics Combination of Machine
2.
universe. and Data Science.

Machines utilize data


Branch that deals with
3. science techniques to
data.
learn about the data.

It uses various techniques


Data in Data Science
like regression and
maybe or maybe not
4. supervised clustering.
evolved from a machine
or mechanical process.

Data Science as a
broader term not only
focuses on algorithms But it is only focused on
5.
statistics but also takes algorithm statistics.
care of the data
processing.

It is a broad term for


6. It fits within data science.
multiple disciplines.
Many operations of data It is three types:
science that is, data Unsupervised learning,
7.
gathering, data cleaning, Reinforcement learning,
data manipulation, etc. Supervised learning.

Example: Facebook uses


Machine Learning
Example: Netflix uses
8. technology.
Data Science technology.

You might also like