Syai Sem3 - Ds Unit2
Syai Sem3 - Ds Unit2
Data Science
1 1.1 Data Science,need of data science,pillars of Data science
1.2 Drew Conway’s Venn diagram of data science
1.3 Data Science Jobs Roles & Responsibilities
1.4 Difference between Data Scientist, Data Analyst, and Data Engineer
1.5 Data Science Processes
1.6 Applications of Data Science
1.7 Tools for Data Science 10
Ref: Textbooks:
1.Foundations of Data Science Unknown Binding by Avrim Blum (Author)
2.Data Science for Dummies" by Aviral Sharma and Chaitan Baru
3. Introduction to Data Science: Practical Approach with R and Python by B.
Uma Maheswari and R. Sujatha
4. Data Science Fundamentals And Practical Approaches: Understand Why
Data Science Is The Next: Understand Why Data Science Is the Next
(English Edition) – 1 January 2020, by Dr Gypsy Anand/ Dr Rupam Sharma
Data
Additional References websites:
1. https://www.geeksforgeeks.org/data-science-fundamentals/
2.https://builtin.com/data-science
Unit 2:Data in Data Science and Comparison of Data science
2.1Facets of data,what is data,big data
Big data is a voluminous set of structured, unstructured, and semi-structured datasets, which is
challenging to manage using traditional data processing tools. It requires additional infrastructure
to govern, analyze, and convert into insights
Facets of Data
● • Very large amount of data will generate in big data and data science. These data is
various types and main categories of data are as follows:
● a) Structured.
● b) Natural language.
● c) Graph-based.
● d) Streaming.
● e) Unstructured.
● f) Machine-generated.
● g) Audio, video and images.
The difference between Traditional data and Big data are as follows:
Its volume ranges from Gigabytes to Its volume ranges from Petabytes to
Terabytes. Zettabytes or Exabytes.
Big data system deals with structured,
Traditional database system deals with
semi-structured,database, and
structured data.
unstructured data.
Traditional data is generated per hour or But big data is generated more
per day or more. frequently mainly per seconds.
Traditional data source is centralized and Big data source is distributed and it is
it is managed in centralized form. managed in distributed form.
Its data model is strict schema based and it Its data model is a flat schema based
is static. and it is dynamic.
Traditional data is stable and inter Big data is not stable and unknown
relationship. relationship.
There are some steps that are necessary for any of the tasks that are being done in the field of
data science to derive any fruitful results from the data at hand.
● Data Collection – After formulating any problem statement the main task is to
calculate data that can help us in our analysis and manipulation. Sometimes data is
collected by performing some kind of survey and there are times when it is done by
performing scrapping.
● Data Cleaning – Most of the real-world data is not structured and requires cleaning
and conversion into structured data before it can be used for any analysis or
modeling.
● Exploratory Data Analysis – This is the step in which we try to find the hidden
patterns in the data at hand. Also, we try to analyze different factors which affect the
target variable and the extent to which it does so. How the independent features are
related to each other and what can be done to achieve the desired results all these
answers can be extracted from this process as well. This also gives us a direction in
which we should work to get started with the modeling process.
● Model Building – Different types of machine learning algorithms as well as
techniques have been developed which can easily identify complex patterns in the
data which will be a very tedious task to be done by a human.
● Model Deployment – After a model is developed and gives better results on the
holdout or the real-world dataset then we deploy it and monitor its performance. This
is the main part where we use our learning from the data to be applied in real-world
applications and use cases.
The process of gathering and analyzing accurate data from various sources to find answers to
research problems, trends and probabilities, etc., to evaluate possible outcomes is Known as Data
Collection.
Knowledge is power, information is knowledge, and data is information in digitized form, at
least as defined in IT. Hence, data is power. But before you can leverage that data into a
successful strategy for your organization or business, you need to gather it. That’s your first step.
What is Data Collection?
Data collection is the process of collecting and evaluating information or data from multiple
sources to find answers to research problems, answer questions, evaluate outcomes, and forecast
trends and probabilities.
It is an essential phase in all types of research, analysis, and decision-making, including that
done in the social sciences, business, and healthcare.
Accurate data collection is necessary to make informed business decisions, ensure quality
assurance, and keep research integrity.
During data collection, the researchers must identify the data types, the sources of data, and what
methods are being used.
Before an analyst begins collecting data, they must answer three questions first:
Qualitative data covers descriptions such as color, size, quality, and appearance. Quantitative
data, unsurprisingly, deals with numbers, such as statistics, poll numbers, percentages, etc.
Primary data collection involves the collection of original data directly from the source or
through direct interaction with the respondents. This method allows researchers to obtain firsth
and information specifically tailored to their research objectives. There are various techniques
for primary data collection, including:
b. Interviews: Interviews involve direct interaction between the researcher and the respondent.
They can be conducted in person, over the phone, or through video conferencing. Interviews can
be structured (with predefined questions), semi-structured (allowing flexibility), or unstructured
(more conversational).
c. Observations: Researchers observe and record behaviors, actions, or events in their natural
setting. This method is useful for gathering data on human behavior, interactions, or phenomena
without direct intervention.
e. Focus Groups: Focus groups bring together a small group of individuals who discuss specific
topics in a moderated setting. This method helps in understanding opinions, perceptions, and
experiences shared by the participants.
b. Online Databases: Numerous online databases provide access to a wide range of secondary
data, such as research articles, statistical information, economic data, and social surveys.
e. Past Research Studies: Previous research studies and their findings can serve as valuable
secondary data sources. Researchers can review and analyze the data to gain insights or build
upon existing knowledge.
● Word Association
The researcher gives the respondent a set of words and asks them what comes to mind when they
hear each word.
● Sentence Completion
Researchers use sentence completion to understand what kind of ideas the respondent has. This
tool involves giving an incomplete sentence and seeing how the interviewee finishes it.
● Role-Playing
Respondents are presented with an imaginary situation and asked how they would act or react if
it was real.
● In-Person Surveys
● Online/Web Surveys
These surveys are easy to accomplish, but some users may be unwilling to answer truthfully, if at
all.
● Mobile Surveys
These surveys take advantage of the increasing proliferation of mobile technology. Mobile
collection surveys rely on mobile devices like tablets or smartphones to conduct surveys via
SMS or mobile apps.
● Phone Surveys
No researcher can call thousands of people at once, so they need a third party to handle the
chore. However, many people have call screening and won’t answer.
● Observation
Sometimes, the simplest method is the best. Researchers who make direct observations collect
data quickly and easily, with little intrusion or third-party bias. Naturally, it’s only effective in
small-scale situations.
The Importance of Ensuring Accurate and Appropriate Among the effects of data
collection done incorrectly, include the following -
● Quality control - tasks that are performed both after and during data collecting
● Quality assurance - events that happen before data gathering starts
4. Gather Information
2. Data Cleansing : After collecting the data the next step is to clean the quality of the
data as the collected data consists of a lot of quality problems such as errors, duplicate
entries and white spaces which need to be corrected before moving to the next step.
By running data profiling and data cleansing tasks these errors can be corrected.
These data are organised according to the needs of the analytical model by the
analysts.
Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted,
duplicate, or incomplete data within a dataset.
When combining multiple data sources, there are many opportunities for data to be duplicated or
mislabeled. If data is incorrect, outcomes and algorithms are unreliable, even though they may
look correct.
There is no one absolute way to prescribe the exact steps in the data cleaning process because the
processes will vary from dataset to dataset
. But it is crucial to establish a template for your data cleaning process so you know you are
doing it the right way every time.
Data transformation is the process of converting data from one format or structure into another.
Transformation processes can also be referred to as data wrangling, or data munging,
transforming and mapping data from one "raw" data form into another format for warehousing
and analyzing.
While the techniques used for data cleaning may vary according to the types of data your
company stores, you can follow these basic steps to map out a framework for your organization.
Remove unwanted observations from your dataset, including duplicate observations or irrelevant
observations. Duplicate observations will happen most often during data collection. When you
combine data sets from multiple places, scrape data, or receive data from clients or multiple
departments, there are opportunities to create duplicate data. De-duplication is one of the largest
areas to be considered in this process. Irrelevant observations are when you notice observations
that do not fit into the specific problem you are trying to analyze. For example, if you want to
analyze data regarding millennial customers, but your dataset includes older generations, you
might remove those irrelevant observations. This can make analysis more efficient and minimize
distraction from your primary target—as well as creating a more manageable and more
performant dataset.
Structural errors are when you measure or transfer data and notice strange naming conventions,
typos, or incorrect capitalization. These inconsistencies can cause mislabeled categories or
classes. For example, you may find “N/A” and “Not Applicable” both appear, but they should be
analyzed as the same category.
Often, there will be one-off observations where, at a glance, they do not appear to fit within the
data you are analyzing. If you have a legitimate reason to remove an outlier, like improper
data-entry, doing so will help the performance of the data you are working with. However,
sometimes it is the appearance of an outlier that will prove a theory you are working on.
Remember: just because an outlier exists, doesn’t mean it is incorrect. This step is needed to
determine the validity of that number. If an outlier proves to be irrelevant for analysis or is a
mistake, consider removing it.
1. As a first option, you can drop observations that have missing values, but doing this will
drop or lose information, so be mindful of this before you remove it.
2. As a second option, you can input missing values based on other observations; again,
there is an opportunity to lose integrity of the data because you may be operating from
assumptions and not actual observations.
3. As a third option, you might alter the way the data is used to effectively navigate null
values.
At the end of the data cleaning process, you should be able to answer these questions as a part of
basic validation:
False conclusions because of incorrect or “dirty” data can inform poor business strategy and
decision-making. False conclusions can lead to an embarrassing moment in a reporting meeting
when you realize your data doesn’t stand up to scrutiny. Before you get there, it is important to
create a culture of quality data in your organization. To do this, you should document the tools
you might use to create this culture and what data quality means to you
3. Data Analysis and Data Interpretation: Analytical models are created using
software and other tools which interpret the data and understand it. The tools include
Python, Excel, R, Scala and SQL. Lastly this model is tested again and again until the
model works as it needs to be then in production mode the data set is run against the
model.
4. Data Visualisation: Data visualisation is the process of creating visual representation
of data using the plots, charts and graphs which helps to analyse the patterns, trends
and get the valuable insights of the data. By comparing the datasets and analysing it
data analysts find the useful data from the raw data.
Qualitative data analysis doesn’t use statistics and derives data from the words, pictures and
symbols. Some common qualitative methods are:
● Narrative Analytics is used for working with data acquired from diaries, interviews
and so on.
● Content Analytics is used for Analytics of verbal data and behaviour.
● Grounded theory is used to explain some given event by studying.
Quantitative data Analytics is used to collect data and then process it into the numerical data.
Some of the quantitative methods are mentioned below:
● Hypothesis testing assesses the given hypothesis of the data set.
● Sample size determination is the method of taking a small sample from a large group
of people and then analysing it.
● Average or mean of a subject is dividing the sum total numbers in the list by the
number of items present in that list.
Skills Required for Data Analytics
There are multiple skills which are required to be a Data analyst. Some of the main skills are
mentioned below:
● Some of the common programming languages which are used are R and Python.
● For databases Structured Query Language (SQL) is a programming language used.
● Machine Learning is used in data analysis.
● In order to better analyse and interpret probability and statistics are used.
● For collecting and organising data, Data Management is used in data analysis.
● To use charts and graphs Data visualisation is used.
The Knowledge of
Coding Python is the most commonly Python and R Language
Language used language for data science is essential for Data
along with the use of other Analytics.
languages such as C++, Java, Perl,
etc.
Let’s look at some of the popular Machine Learning algorithms that are based on specific types
of Machine Learning.
Supervised Machine Learning includes Regression and Classification algorithms. Some of the
more popular algorithms in these categories are:
1. Linear Regression Algorithm
The Linear Regression Algorithm provides the relation between an independent and a dependent
variable. It demonstrates the impact on the dependent variable when the independent variable is
changed in any way. So the independent variable is called the explanatory variable and the
dependent variable is called the factor of interest. An example of the Linear Regression
Algorithm usage is to analyze the property prices in the area according to the size of the property,
number of rooms, etc.
2. Logistic Regression Algorithm
The Logistic Regression Algorithm deals in discrete values whereas the Linear Regression
Algorithm handles predictions in continuous values. This means that Logistic Regression is a
better option for binary classification. An event in Logistic Regression is classified as 1 if it
occurs and it is classified as 0 otherwise. Hence, the probability of a particular event occurrence
is predicted based on the given predictor variables. An example of the Logistic Regression
Algorithm usage is in medicine to predict if a person has malignant breast cancer tumors or not
based on the size of the tumors.
3. Naive Bayes Classifier Algorithm
Naive Bayes Classifier Algorithm is used to classify data texts such as a web page, a document,
an email, among other things. This algorithm is based on the Bayes Theorem of Probability and
it allocates the element value to a population from one of the categories that are available. An
example of the Naive Bayes Classifier Algorithm usage is for Email Spam Filtering. Gmail uses
this algorithm to classify an email as Spam or Not Spam.
Unsupervised Machine Learning mainly includes Clustering algorithms. Some of the more
popular algorithms in this category are:
1. K Means Clustering Algorithm
Let’s imagine that you want to search the name “Harry” on Wikipedia. Now, “Harry” can refer to
Harry Potter, Prince Harry of England, or any other popular Harry on Wikipedia! So Wikipedia
groups the web pages that talk about the same ideas using the K Means Clustering Algorithm
(since it is a popular algorithm for cluster analysis). K Means Clustering Algorithm in general
uses K number of clusters to operate on a given data set. In this manner, the output contains K
clusters with the input data partitioned among the clusters.
2. Apriori Algorithm
The Apriori Algorithm uses the if-then format to create association rules. This means that if a
certain event 1 occurs, then there is a high probability that a certain event 2 also occurs. For
example: IF someone buys a car, THEN there is a high chance they buy car insurance as well.
The Apriori Algorithm generates this association rule by observing the number of people who
bought car insurance after buying a car. For example, Google auto-complete uses the Apriori
Algorithm. When a word is typed in Google, the Apriori Algorithm looks for the associated
words that are usually typed after that word and displays the possibilities.
Data Science
It is the complex study of the large amounts of data in a company or organization’s repository.
This study includes where the data has originated from, the actual study of its content matter, and
how this data can be useful for the growth of the company in the future. The data relating to an
organization is always in two forms: Structured or unstructured. When we study this data, we
get valuable information about business or market patterns which helps the business have an
edge over the other competitors since they’ve increased their effectiveness by recognizing
patterns in the data set.
Data scientists are specialists who excel in converting raw data into critical business matters.
These scientists are skilled in algorithmic coding along with concepts like data mining, machine
learning, and statistics. Data science is used extensively by companies like Amazon, Netflix, the
healthcare sector, in the fraud detection sector, internet search, airlines, etc.
Machine Learning
Machine Learning is a field of study that gives computers the capability to learn without being
explicitly programmed. Machine Learning is applied using Algorithms to process the data and
get trained for delivering future predictions without human intervention. The inputs for Machine
Learning are the set of instructions or data or observations. Machine Learning is used extensively
by companies like Facebook, Google, etc.
Below is the Drew Conway’s Venn Diagram. Let’s have a look at the Venn Diagram.
You can see the two terms “Data Science” and “Machine Learning” in the above Venn diagram.
So let’s understand the diagram. In Drew Conway’s Venn Diagram of Data Science, the primary
colors of data are
● Hacking Skills,
● Math and Statistics Knowledge, and
● Substantive Expertise
But the question is why has he highlighted these three? So let’s understand the term why!!
Hacking Skills: It is known to everyone that data is the key part of data science. And data is a
commodity traded electronically; so, in order to be in this market, “one needs to speak hacker”.
So what does this line means? Being able to manage text files at the command-line, learning
vectorized operations, thinking algorithmically; are the hacking skills that make for a successful
data hacker.
Math and Statistics Knowledge: Once you have collected and cleaned the data, the next step is
to actually obtain insight from it. In order to do this, you need to use appropriate mathematical
and statistical methods, that demand at least a baseline familiarity with these tools. This is not
to say that a Ph.D. in statistics is required to be a skilled data scientist, but it does need
understanding what an ordinary least squares regression is and how to explain it.
Substantive Expertise: The third important part is Substantive expertise. And this is where our
confusion erases. Yes!!
According to Drew Conway, “Data plus Math and Statistics Knowledge only gets you
Machine Learning”, which is excellent if that is what you are interested in, but not if you are
doing Data Science. Science is about experimentation and building knowledge, which demands
some motivating questions about the world and hypotheses that can be brought to data and tested
with statistical methods.
And this is the main difference point between these two terms. If you want to a Data Scientist
then you must have knowledge in that Domain Area. But Why? The foremost objective of data
science is to extract useful insights from that data so that it can be profitable to the company’s
business. If you are not aware of the business side of the company that how the business model
of the company works and how you can’t build it better than you are of no use to this company.
You need to know how to ask the right questions from the right people so that you can perceive
the appropriate information you need to obtain the information you need. Below is a complete
table of differences between Data Science and Machine Learning.
Data Science as a
broader term not only
focuses on algorithms But it is only focused on
5.
statistics but also takes algorithm statistics.
care of the data
processing.