Unit 1
Unit 1
Data Science is a combination of multiple disciplines that uses statistics, data analysis, and
machine learning to analyze data and to extract knowledge and insights from it.
Data Science is about finding patterns in data, through analysis, and make future predictions.
Data Science is used in many industries in the world today, e.g. banking, consultancy,
healthcare, and manufacturing.
Data Science can be applied in nearly every part of a business where data is available.
Examples are:
Consumer goods
Stock markets
Industry
Politics
Logistic companies
E-commerce
A Data Scientist must find patterns within the data. Before he/she can find the patterns,
he/she must organize the data in a standard format.
What is Data?
One purpose of Data Science is to structure data, making it interpretable and easy to work
with.
Structured data
Unstructured data
Unstructured Data
Unstructured data is not organized. We must organize the data for analysis purposes.
Structured Data
Example of an array:
[80, 85, 90, 95, 100, 105, 110, 115, 120, 125]
Example
Array = [80, 85, 90, 95, 100, 105, 110, 115, 120, 125]
print(Array)
Advantages Disadvantages
Data science incorporates various disciplines -- for example, data engineering, data
preparation, data mining, predictive analytics, machine learning and data
visualization, as well as statistics, mathematics and software programming.
What are the 4 major components of data science?
The four pillars of data science are domain knowledge, math and statistics skills,
computer science, communication and visualization. Each is essential for the success
of any data scientist. Domain knowledge is critical to understanding the data, what it
means, and how to use it.
How do you set research goals in data science?
Setting the research goal: Understanding the business or activity our data science
project is part of is key to ensuring its success and the first phase of any sound data
analytics project. Defining the what, the why, and the how of our project in a project
charter is the foremost task.
What are the goals of the data science process?
The goal of data science is to construct the means for extracting business-focused
insights from data. This requires an understanding of how value and information
flows in a business, and the ability to use that understanding to identify business
opportunities.
What are the six steps of the data science process?
Data science life cycle is a collection of individual steps that need to be taken to
prepare for and execute a data science project. The steps include identifying the
project goals, gathering relevant data, analyzing it using appropriate tools and
techniques, and presenting results in a meaningful way.
Another term for the data science process is the data science life cycle. The terms can be used
interchangeably, and both describe a workflow process that begins with collecting data, and
ends with deploying a model that will hopefully answer your questions. The steps include:
Collecting Data
The next step is to collect the right set of data. High-quality, targeted data—and the
mechanisms to collect them—are crucial to obtaining meaningful results. Since much of the
roughly 2.5 quintillion bytes of data created every day come in unstructured formats, you’ll
likely need to extract the data and export it into a usable format, such as a CSV or JSON file.
Cleaning Data
Most of the data you collect during the collection phase will be unstructured, irrelevant, and
unfiltered. Bad data produces bad results, so the accuracy and efficacy of your analysis will
depend heavily on the quality of your data.
Cleaning data eliminates duplicate and null values, corrupt data, inconsistent data types,
invalid entries, missing data, and improper formatting.
This step is the most time-intensive process, but finding and resolving flaws in your data is
essential to building effective models.
Exploratory Data Analysis (EDA) is an approach that is used to analyze the data and discover trends,
patterns, or check assumptions in data with the help of statistical summaries and graphical
representations.
Now that you have a large amount of organized, high-quality data, you can begin conducting
an exploratory data analysis (EDA). Effective EDA lets you uncover valuable insights that
will be useful in the next phase of the data science lifecycle.
Your stakeholders are mainly interested in what your results mean for their organization, and
often won’t care about the complex back-end work that was used to build your model.
Communicate your findings in a clear, engaging way that highlights their value in strategic
business planning and operation.
CRISP-DM
CRISP-DM stands for Cross Industry Standard Process for Data Mining. It’s an industry-
standard methodology and process model that’s popular because it’s flexible and
customizable. It’s also a proven method to guide data mining projects. The CRISP-DM
model includes six phases in the data process life cycle. Those six phases are:
1. Business Understanding
The first step in the CRISP-DM process is to clarify the business’s goals and bring focus to
the data science project. Clearly defining the goal should go beyond simply identifying the
metric you want to change. Analysis, no matter how comprehensive, can’t change metrics
without action.
To better understand the business, data scientists meet with stakeholders, subject matter
experts, and others who can offer insights into the problem at hand. They may also do
preliminary research to see how others have tried to solve similar problems. Ultimately,
they’ll have a clearly defined problem and a roadmap to solving it.
2. Data Understanding
The next step in CRISP-DM is understanding your data. In this phase, you’ll determine what
data you have, where you can get more of it, what your data includes, and its quality. You’ll
also decide what data collection tools you’ll use and how you’ll collect your initial data. Then
you’ll describe the properties of your initial data, such as the format, the quantity, and the
records or fields of your data sets.
Collecting and describing your data will allow you to begin exploring it. You can then
formulate your first hypothesis by asking data science questions that can be answered through
queries, visualization, or reporting. Finally, you’ll verify the quality of your data by
determining if there are errors or missing values.
3. Data Preparation
Data preparation is often the most time-consuming phase, and you may need to revisit this
phase multiple times throughout your project.
Data comes from various sources and is usually unusable in its raw state, as it often has
corrupt and missing attributes, conflicting values, and outliers. Data preparation resolves
these issues and improves the quality of your data, allowing it to be used effectively in the
modeling stage.
Data preparation involves many activities that can be performed in different ways. The main
activities of data preparation are:
Facets of data
In data science and big data you’ll come across many different types of data, and each of
them tends to require different tools and techniques. The main categories of data are these:
■ Structured
■ Unstructured
■ Natural language
■ Machine-generated
■ Graph-based
■ Streaming
A model on this data we have to analyze all the information which is present across the
dataset like as what is the salary distribution of employees, what is the bonus they are getting,
what is their starting time, and the assigned team. These all steps of analyzing and modifying
the data come under EDA.
Exploratory Data Analysis (EDA) is an approach that is used to analyze the data and
discover trends, patterns, or check assumptions in data with the help of statistical summaries
and graphical representations.
Types of EDA
Depending on the number of columns we are analyzing we can divide EDA into two types.
1. Univariate Analysis – In univariate analysis, we analyze or deal with only one
variable at a time. The analysis of univariate data is thus the simplest form of analysis
since the information deals with only one quantity that changes. It does not deal with
causes or relationships and the main purpose of the analysis is to describe the data and
find patterns that exist within it.
2. Bi-Variate analysis – This type of data involves two different variables. The analysis
of this type of data deals with causes and relationships and the analysis is done to find
out the relationship between the two variables.
3. Multivariate Analysis – When the data involves three or more variables, it is
categorized under multivariate.
Depending on the type of analysis we can also subcategorize EDA into two parts.
The first step involved in Data Science Modelling is understanding the problem. A Data
Scientist listens for keywords and phrases when interviewing a line-of-business expert about
a business challenge. The Data Scientist breaks down the problem into a procedural flow that
always involves a holistic understanding of the business challenge, the Data that must be
collected, and various Artificial Intelligence and Data Science approach that can be used to
address the problem.
The next step in Data Science Modelling is Data Extraction. Not just any Data, but the
Unstructured Data pieces you collect, relevant to the business problem you’re trying to
address. The Data Extraction is done from various sources online, surveys, and existing
Databases.
Step 3: Data Cleaning
Data Cleaning is useful as you need to sanitize Data while gathering it. The following are
some of the most typical causes of Data Inconsistencies and Errors:
Exploratory Data Analysis (EDA) is a robust technique for familiarising yourself with Data
and extracting useful insights. Data Scientists sift through Unstructured Data to find patterns
and infer relationships between Data elements. Data Scientists use Statistics and Visualisation
tools to summarise Central Measurements and variability to perform EDA.
If Data skewness persists, appropriate transformations are used to scale the distribution
around its mean. When Datasets have a lot of features, exploring them can be difficult. As a
result, to reduce the complexity of Model inputs, Feature Selection is used to rank them in
order of significance in Model Building for enhanced efficiency. Using Business Intelligence
tools like Tableau, MicroStrategy, etc. can be quite beneficial in this step. This step is crucial
in Data Science Modelling as the Metrics are studied carefully for validation of Data
Outcomes.
Feature Selection is the process of identifying and selecting the features that contribute the
most to the prediction variable or output that you are interested in, either automatically or
manually.
The presence of irrelevant characteristics in your Data can reduce the Model accuracy and
cause your Model to train based on irrelevant features. In other words, if the features are
strong enough, the Machine Learning Algorithm will give fantastic outcomes. Two types of
characteristics must be addressed:
This is one of the most crucial processes in Data Science Modelling as the Machine Learning
Algorithm aids in creating a usable Data Model. There are a lot of algorithms to pick from,
the Model is selected based on the problem. There are three types of Machine Learning
methods that are incorporated:
1) Supervised Learning
It is based on the results of a previous operation that is related to the existing business
operation. Based on previous patterns, Supervised Learning aids in the prediction of an
outcome. Some of the Supervised Learning Algorithms are:
Linear Regression
Random Forest
Support Vector Machines
2) Unsupervised Learning
3) Reinforcement Learning
It is a fascinating Machine Learning technique that uses a dynamic Dataset that interacts with
the real world. In simple terms, it is a mechanism by which a system learns from its mistakes
and improves over time. Some of the Reinforcement Learning Algorithms are:
Q-Learning
State-Action-Reward-State-Action (SARSA)
Deep Q Network
This is the next phase, and it’s crucial to check that our Data Science Modelling efforts meet
the expectations. The Data Model is applied to the Test Data to check if it’s accurate and
houses all desirable features. You can further test your Data Model to identify any
adjustments that might be required to enhance the performance and achieve the desired
results. If the required precision is not achieved, you can go back to Step 5 (Machine
Learning Algorithms), choose an alternate Data Model, and then test the model again.
The Model which provides the best result based on test findings is completed and deployed in
the production environment whenever the desired result is achieved through proper testing as
per the business needs. This concludes the process of Data Science Modelling.
Applications of Data Science
Every industry benefits from the experience of Data Science companies, but the most
common areas where Data Science techniques are employed are the following:
Banking and Finance: The banking industry can benefit from Data Science in many
aspects. Fraud Detection is a well-known application in this field that assists banks in
reducing non-performing assets.
Healthcare: Health concerns are being monitored and prevented using Wearable
Data. The Data acquired from the body can be used in the medical field to prevent
future calamities.
Marketing: Marketing offers a lot of potential, such as a more effective price
strategy. Pricing based on Data Science can help companies like Uber and E-
Commerce businesses enhance their profits.
Government Policies: Based on Data gathered through surveys and other official
sources, the government can use Data Science to better build poli==cies that cater to
the interests and wishes of the people.
Data mining is the process of searching and analyzing a large batch of raw data in order to
identify patterns and extract useful information.
Companies use data mining software to learn more about their customers. It can help them to
develop more effective marketing strategies, increase sales, and decrease costs. Data mining
relies on effective data collection, warehousing, and computer processing.
Data mining involves exploring and analyzing large blocks of information to glean
meaningful patterns and trends. It is used in credit risk management, fraud detection, and
spam filtering. It also is a market research tool that helps reveal the sentiment or opinions of a
given group of people. The data mining process breaks down into four steps:
Data is collected and loaded into data warehouses on-site or on a cloud service.
Business analysts, management teams, and information technology professionals
access the data and determine how they want to organize it.
Custom application software sorts and organizes the data.
The end user presents the data in an easy-to-share format, such as a graph or table.
Data mining uses algorithms and various other techniques to convert large collections of data
into useful output. The most popular types of data mining techniques include:
Association rules, also referred to as market basket analysis, search for relationships
between variables. This relationship in itself creates additional value within the data
set as it strives to link pieces of data. For example, association rules would search a
company's sales history to see which products are most commonly purchased
together; with this information, stores can plan, promote, and forecast.
Classification uses predefined classes to assign to objects. These classes describe the
characteristics of items or represent what the data points have in common with each.
This data mining technique allows the underlying data to be more neatly categorized
and summarized across similar features or product lines.
Clustering is similar to classification. However, clustering identifies similarities
between objects, then groups those items based on what makes them different from
other items. While classification may result in groups such as "shampoo,"
"conditioner," "soap," and "toothpaste," clustering may identify groups such as "hair
care" and "dental health."
Decision trees are used to classify or predict an outcome based on a set list of criteria
or decisions. A decision tree is used to ask for the input of a series of cascading
questions that sort the dataset based on the responses given. Sometimes depicted as a
tree-like visual, a decision tree allows for specific direction and user input when
drilling deeper into the data.
K-Nearest neighbor (KNN) is an algorithm that classifies data based on its proximity
to other data. The basis for KNN is rooted in the assumption that data points that are
close to each other are more similar to each other than other bits of data. This non-
parametric, supervised technique is used to predict the features of a group based on
individual data points.
Neural networks process data through the use of nodes. These nodes are comprised
of inputs, weights, and an output. Data is mapped through supervised learning, similar
to the ways in which the human brain is interconnected. This model can be
programmed to give threshold values to determine a model's accuracy.
Predictive analysis strives to leverage historical information to build graphical or
mathematical models to forecast future outcomes. Overlapping with regression
analysis, this technique aims at supporting an unknown figure in the future based on
current data on hand.
To be most effective, data analysts generally follow a certain flow of tasks along the data
mining process. Without this structure, an analyst may encounter an issue in the middle of
their analysis that could have easily been prevented had they prepared for it earlier. The data
mining process is usually broken into the following steps.
Before any data is touched, extracted, cleaned, or analyzed, it is important to understand the
underlying entity and the project at hand. What are the goals the company is trying to achieve
by mining data? What is their current business situation? What are the findings of a SWOT
analysis? Before looking at any data, the mining process starts by understanding what will
define success at the end of the process.
Once the business problem has been clearly defined, it's time to start thinking about data.
This includes what sources are available, how they will be secured and stored, how the
information will be gathered, and what the final outcome or analysis may look like. This step
also includes determining the limits of the data, storage, security, and collection and assesses
how these constraints will affect the data mining process.
Step 3: Prepare the Data
With our clean data set in hand, it's time to crunch the numbers. Data scientists use the types
of data mining above to search for relationships, trends, associations, or sequential patterns.
The data may also be fed into predictive models to assess how previous bits of information
may translate into future outcomes.
The data-centered aspect of data mining concludes by assessing the findings of the data
model or models. The outcomes from the analysis may be aggregated, interpreted, and
presented to decision-makers that have largely been excluded from the data mining process to
this point. In this step, organizations can choose to make decisions based on the findings.
The data mining process concludes with management taking steps in response to the findings
of the analysis. The company may decide the information was not strong enough or the
findings were not relevant, or the company may strategically pivot based on findings. In
either case, management reviews the ultimate impacts of the business and recreates future
data mining loops by identifying new business problems or opportunities.
Data mining ensures a company is collecting and analyzing reliable data. It is often a more
rigid, structured process that formally identifies a problem, gathers data related to the
problem, and strives to formulate a solution. Therefore, data mining helps a business become
more profitable, more efficient, or operationally stronger.
Data mining can look very different across applications, but the overall process can be used
with almost any new or legacy application. Essentially any type of data can be gathered and
analyzed, and almost every business problem that relies on qualifiable evidence can be
tackled using data mining.
The end goal of data mining is to take raw bits of information and determine if there is
cohesion or correlation among the data. This benefit of data mining allows a company to
create value with the information they have on hand that would otherwise not be overly
apparent. Though data models can be complex, they can also yield fascinating results, unearth
hidden trends, and suggest unique strategies.
Limitations of Data Mining
This complexity of data mining is one of its greatest disadvantages. Data analytics often
requires technical skill sets and certain software tools. Smaller companies may find this to be
a barrier of entry too difficult to overcome.
Data mining doesn't always guarantee results. A company may perform statistical analysis,
make conclusions based on strong data, implement changes, and not reap any benefits.
Through inaccurate findings, market changes, model errors, or inappropriate data
populations, data mining can only guide decisions and not ensure outcomes.
There is also a cost component to data mining. Data tools may require costly subscriptions,
and some bits of data may be expensive to obtain. Security and privacy concerns can be
pacified, though additional IT infrastructure may be costly as well. Data mining may also be
most effective when using huge data sets; however, these data sets must be stored and require
heavy computational power to analyze.
A data warehouse is a type of data management system that is designed to enable and
support business intelligence (BI) activities, especially analytics. Data warehouses are solely
intended to perform queries and analysis and often contain large amounts of historical data.
The data within a data warehouse is usually derived from a wide range of sources such as
application log files and transaction applications.
A data warehouse centralizes and consolidates large amounts of data from multiple sources.
Its analytical capabilities allow organizations to derive valuable business insights from their
data to improve decision-making. Over time, it builds a historical record that can be
invaluable to data scientists and business analysts. Because of these capabilities, a data
warehouse can be considered an organization’s “single source of truth.”
Data warehouses offer the overarching and unique benefit of allowing organizations to
analyze large amounts of variant data and extract significant value from it, as well as to keep
a historical record.
Four unique characteristics (described by computer scientist William Inmon, who is
considered the father of the data warehouse) allow data warehouses to deliver this
overarching benefit. According to this definition, data warehouses are
Subject-oriented. They can analyze data about a particular subject or functional area
(such as sales).
Integrated. Data warehouses create consistency among different data types from
disparate sources.
Nonvolatile. Once data is in a data warehouse, it’s stable and doesn’t change.
Time-variant. Data warehouse analysis looks at change over time.
A well-designed data warehouse will perform queries very quickly, deliver high data
throughput, and provide enough flexibility for end users to “slice and dice” or reduce the
volume of data for closer examination to meet a variety of demands—whether at a high level
or at a very fine, detailed level. The data warehouse serves as the functional foundation for
middleware BI environments that provide end users with reports, dashboards, and other
interfaces.
Simple. All data warehouses share a basic design in which metadata, summary data,
and raw data are stored within the central repository of the warehouse. The repository
is fed by data sources on one end and accessed by end users for analysis, reporting,
and mining on the other end.
Simple with a staging area. Operational data must be cleaned and processed before
being put in the warehouse. Although this can be done programmatically, many data
warehouses add a staging area for data before it enters the warehouse, to simplify data
preparation.
Hub and spoke. Adding data marts between the central repository and end users
allows an organization to customize its data warehouse to serve various lines of
business. When the data is ready for use, it is moved to the appropriate data mart.
Sandboxes. Sandboxes are private, secure, safe areas that allow companies to quickly
and informally explore new datasets or ways of analyzing data without having to
conform to or comply with the formal rules and protocol of the data warehouse.
When data warehouses first came onto the scene in the late 1980s, their purpose was to help
data flow from operational systems into decision-support systems (DSSs). These early data
warehouses required an enormous amount of redundancy. Most organizations had multiple
DSS environments that served their various users. Although the DSS environments used
much of the same data, the gathering, cleaning, and integration of the data was often
replicated for each environment.
As data warehouses became more efficient, they evolved from information stores that
supported traditional BI platforms into broad analytics infrastructures that support a wide
variety of applications, such as operational analytics and performance management.
Basic statistical description of data
The three main types of descriptive statistics are frequency distribution, central tendency,
and variability of a data set. The frequency distribution records how often data occurs,
central tendency records the data's center point of distribution, and variability of a data set
records its degree of dispersion.
The evolution of Data Science is a result of the inclusion of contemporary technologies like
Machine Learning (ML), Artificial Intelligence (AI), and the Internet of Things (IoT). The
application of data science started to spread to several other fields, such as engineering and
medicine.
The term “data science” — and the practice itself — has evolved over the years. In recent
years, its popularity has grown considerably due to innovations in data collection,
technology, and mass production of data worldwide. Gone are the days when those who
worked with data had to rely on expensive programs and mainframes. The proliferation of
programming languages like Python and procedures to collect, analyze, and interpret data
paved the way for data science to become the popular field it is today.
Data science began in statistics. Part of the evolution of data science was the inclusion of
concepts such as machine learning, artificial intelligence, and the internet of things. With the
flood of new information coming in and businesses seeking new ways to increase profit and
make better decisions, data science started to expand to other fields, including medicine,
engineering, and more.
In this article, we'll share a concise summary of the evolution of data science — from its
humble beginnings as a statistician’s dream to its current state as a unique science in its own
right recognized by every imaginable industry.
In this article, we'll share a concise summary of the evolution of data science — from its
humble beginnings as a statistician’s dream to its current state as a unique science in its own
right recognized by every imaginable industry.
Origins, Predictions, Beginnings
We could say that data science was born from the idea of merging applied statistics with
computer science. The resulting field of study would use the extraordinary power of modern
computing. Scientists realized they could not only collect data and solve statistical problems
but also use that data to solve real-world problems and make reliable fact-driven predictions.
1962: American mathematician John W. Tukey first articulated the data science dream. In his
now-famous article "The Future of Data Analysis," he foresaw the inevitable emergence of a
new field nearly two decades before the first personal computers. While Tukey was ahead of
his time, he was not alone in his early appreciation of what would come to be known as "data
science." Another early figure was Peter Naur, a Danish computer engineer whose book
Concise Survey of Computer Methods offers one of the very first definitions of data science:
"The science of dealing with data, once they have been established, while the relation of the
data to what they represent is delegated to other fields and sciences."
1977: The theories and predictions of "pre" data scientists like Tukey and Naur became more
concrete with the establishment of The International Association for Statistical Computing
(IASC), whose mission was "to link traditional statistical methodology, modern computer
technology, and the knowledge of domain experts in order to convert data into information
and knowledge."
1980s and 1990s: Data science began taking more significant strides with the emergence of
the first Knowledge Discovery in Databases (KDD) workshop and the founding of the
International Federation of Classification Societies (IFCS). These two societies were among
the first to focus on educating and training professionals in the theory and methodology of
data science (though that term had not yet been formally adopted).
It was at this point that data science started to garner more attention from leading
professionals hoping to monetize big data and applied statistics.
1990s and early 2000s: We can clearly see that data science has emerged as a recognized
and specialized field. Several data science academic journals began to circulate, and data
science proponents like Jeff Wu and William S. Cleveland continued to help develop and
expound upon the necessity and potential of data science.
2000s: Technology made enormous leaps by providing nearly universal access to internet
connectivity, communication, and (of course) data collection.
2005: Big data enters the scene. With tech giants such as Google and Facebook uncovering
large amounts of data, new technologies capable of processing them became necessary.
Hadoop rose to the challenge, and later on Spark and Cassandra made their debuts.
2014: Due to the increasing importance of data, and organizations’ interest in finding patterns
and making better business decisions, demand for data scientists began to see dramatic
growth in different parts of the world.
2015: Machine learning, deep learning, and Artificial Intelligence (AI) officially enter the
realm of data science. These technologies have driven innovations over the past decade —
from personalized shopping and entertainment to self-driven vehicles along with all the
insights to efficiently bring forth these real-life applications of AI into our daily lives.
2018: New regulations in the field are perhaps one of the biggest aspects in the evolution in
data science.
2020s: We are seeing additional breakthroughs in AI, machine learning, and an ever-more-
increasing demand for qualified professionals in Big Data
Seeing how much of our world is currently powered by data and data science, we can
reasonably ask, Where do we go from here? What does the future of data science hold? While
it's difficult to know exactly what the hallmark breakthroughs of the future will be, all signs
seem to indicate the critical importance of machine learning. Data scientists are searching for
ways to use machine learning to produce more intelligent and autonomous AI.
In other words, data scientists are working tirelessly toward developments in deep learning to
make computers smarter. These developments can bring about advanced robotics paired with
a powerful AI. Experts predict the AI will be capable of understanding and interacting
seamlessly with humans, self-driving vehicles, and automated public transportation in a
world interconnected like never before. This new world will be made possible by data
science.
Perhaps, on the more exciting side, we may see an age of extensive automation of labor in the
near future. This is expected to revolutionize the healthcare, finance, transportation, and
defense industries.
Data scientists collaborate closely with business leaders and other key players to comprehend
company objectives and identify data-driven strategies for achieving those objectives. A data
scientist’s job is to gather a large amount of data, analyze it, separate out the essential
information, and then utilize tools like SAS, R programming, Python, etc. to extract insights
that may be used to increase the productivity and efficiency of the business. Depending on an
organization’s needs, data scientists have a wide range of roles and responsibilities. The
following is a list of some of the data scientist roles and responsibilities:
We can think of a data science pipeline as a unified system consisting of customized tools
and processes which enable the organizations to get the maximum value out of their data.
Depending on the factors like scale, nature of the problem at hand, domain, etc, the data
science pipeline can be as simple as a simple ETL process, or in other cases, it could be very
complex consisting of different stages with multiple processes working together to achieve
the final objective. To get a deep understanding of the Data Science Pipeline, you could refer
the Data Science Bootcamp Curriculum.
Why Is the Data Science Pipeline Important?
The data science pipeline of any organization is a fair reflection of how data driven the
organization is and what’s the influence of derived insights on business-critical decisions.
Here are some of the points explaining the importance of a data science pipeline for an
organization:
1. Data Acquisition: Data science pipeline starts with Data. In most companies, there are
Data engineers that create tables for data collection and in some cases, you may use
API to call data and make it available at the start of the pipeline.
2. Data Splitting: The second stage is very important, which is data splitting. Here we
break the dataset into train and test and validate data. Training data is used to train the
model and we do this process in the initial stage only so that we can avoid data
leakage. In some cases, we break the data after preprocessing.
3. Data Preprocessing: In Data science we say it often, “Garbage in, Garbage out”, hence
the quality of data matters if we want quality in outcome. This step is usually about
cleaning the data and normalizing it. Generally, it takes care of getting rid of
characters that are irrelevant. The purpose of normalization is to update the numerical
values in data and bring them to a common scale without harming the actual
difference in values. This is applicable wherever there is a huge range in any variable
values.
4. Feature Engineering: This step consists of multiple tasks which are missing value
treatment, outlier treatment, Combinations (using current features to make new
features), aggregations, transformations, and encoding for categorical data type.
5. Labeling: This step is applicable in supervised cases which means you don't have
labels, but they are required and you feel that model will be better if we feed labels to
it. There are two ways to approach this one is manual method and other is rule based.
6. Exploratory Data Analysis: Some people do this part early but for simplicity it's
suggested to EDA after we have relevant features and labels in hand. EDA can guide
you during feature engineering.
7. Model Training: Model training refers to experimenting with different ML models for
the task at hand and choosing the best model based on the problem at hand.
8. Performance monitoring: After training a model, it is important to spend time in
model performance monitoring. We can make printouts about relevant model metrics,
reports, charts and visuals that provide clarity about model performance.
9. Interpretation: It's critical for businesses and to get knowledge of what is happening.
There are many ways to approach the same. Global and local interpretations are the
examples.
10. Iteration: This step talks about modifying the model to get better performance. It takes
the feedback loop into consideration.
11. Deployment: Next step is Deployment which means to put the model under
production. It depends upon systems, on cloud and also on how a company desires to
use the built model.
12. Monitoring: Post deploying the models, one has to keep monitoring the performance
of it against unseen data. Oftentimes, the model needs to be re-trained due to a certain
data drift i.e., the distribution of the unseen data has changed compared to what was
used during training and validation phase.
1. Organizations can leverage the insights gained with the help of the pipeline and hence
make critical decisions faster.
2. Data Science pipelines allow organizations to understand the behavioral patterns of
their target audience and then recommend personalized products and services.
3. Allows for efficiency in the processes by identifying the anti-patterns bottlenecks.
The most useful application of Data Science is Search Engines. As we know when we want
to search for something on the internet, we mostly used Search engines like Google, Yahoo,
Safari, Firefox, etc. So Data Science is used to get Searches faster.
For Example, When we search something suppose “Data Structure and algorithm courses ”
then at that time on the Internet Explorer we get the first link of GeeksforGeeks Courses. This
happens because the GeeksforGeeks website is visited most in order to get information
regarding Data Structure courses and Computer related subjects. So this analysis is Done
using Data Science, and we get the Topmost visited Web Links.
In Transport
Data Science also entered into the Transport field like Driverless Cars. With the help of
Driverless Cars, it is easy to reduce the number of Accidents.
For Example, In Driverless Cars the training data is fed into the algorithm and with the help
of Data Science techniques, the Data is analyzed like what is the speed limit in Highway,
Busy Streets, Narrow Roads, etc. And how to handle different situations while driving etc.
3. In Finance
Data Science plays a key role in Financial Industries. Financial Industries always have an
issue of fraud and risk of losses. Thus, Financial Industries needs to automate risk of loss
analysis in order to carry out strategic decisions for the company. Also, Financial Industries
uses Data Science Analytics tools in order to predict the future. It allows the companies to
predict customer lifetime value and their stock market moves.
For Example, In Stock Market, Data Science is the main part. In the Stock Market, Data
Science is used to examine past behavior with past data and their goal is to examine the
future outcome. Data is analyzed in such a way that it makes it possible to predict future
stock prices over a set timetable.
4. In E-Commerce
E-Commerce Websites like Amazon, Flipkart, etc. uses data Science to make a better user
experience with personalized recommendations.
For Example, When we search for something on the E-commerce websites we get
suggestions similar to choices according to our past data and also we get recommendations
according to most buy the product, most rated, most searched, etc. This is all done with the
help of Data Science.
5. In Health Care
In the Healthcare Industry data science act as a boon. Data Science is used for:
Detecting Tumor.
Drug discoveries.
Medical Image Analysis.
Virtual Medical Bots.
Genetics and Genomics.
Predictive Modeling for Diagnosis etc.
6. Image Recognition
Currently, Data Science is also used in Image Recognition. For Example, When we upload
our image with our friend on Facebook, Facebook gives suggestions Tagging who is in the
picture. This is done with the help of machine learning and Data Science. When an Image is
Recognized, the data analysis is done on one’s Facebook friends and after analysis, if the
faces which are present in the picture matched with someone else profile then Facebook
suggests us auto-tagging.
7. Targeting Recommendation
Targeting Recommendation is the most important application of Data Science. Whatever the
user searches on the Internet, he/she will see numerous posts everywhere. This can be
explained properly with an example: Suppose I want a mobile phone, so I just Google search
it and after that, I changed my mind to buy offline. Data Science helps those companies who
are paying for Advertisements for their mobile. So everywhere on the internet in the social
media, in the websites, in the apps everywhere I will see the recommendation of that mobile
phone which I searched for. So this will force me to buy online.
With the help of Data Science, Airline Sector is also growing like with the help of it, it
becomes easy to predict flight delays. It also helps to decide whether to directly land into the
destination or take a halt in between like a flight can have a direct route from Delhi to the
U.S.A or it can halt in between after that reach at the destination.
In most of the games where a user will play with an opponent i.e. a Computer Opponent, data
science concepts are used with machine learning where with the help of past data the
Computer will improve its performance. There are many games like Chess, EA Sports, etc.
will use Data Science concepts.
The process of creating medicine is very difficult and time-consuming and has to be done
with full disciplined because it is a matter of Someone’s life. Without Data Science, it takes
lots of time, resources, and finance or developing new Medicine or drug but with the help of
Data Science, it becomes easy because the prediction of success rate can be easily determined
based on biological data or factors. The algorithms based on data science will forecast how
this will react to the human body without lab experiments.
11. In Delivery Logistics
Various Logistics companies like DHL, FedEx, etc. make use of Data Science. Data Science
helps these companies to find the best route for the Shipment of their Products, the best time
suited for delivery, the best mode of transport to reach the destination, etc.
12. Autocomplete
AutoComplete feature is an important part of Data Science where the user will get the facility
to just type a few letters or words, and he will get the feature of auto-completing the line. In
Google Mail, when we are writing formal mail to someone so at that time data science
concept of Autocomplete feature is used where he/she is an efficient choice to auto-complete
the whole line.
Data security is the process of protecting corporate data and preventing data loss through
unauthorized access. This includes protecting your data from attacks that can encrypt or
destroy data, such as ransomware, as well as attacks that can modify or corrupt your data.
Data security also ensures data is available to anyone in the organization who has access to it.
Some industries require a high level of data security to comply with data protection
regulations. For example, organizations that process payment card information must use and
store payment card data securely, and healthcare organizations in the USA must secure
private health information (PHI) in line with the HIPAA standard.
Deals primarily with structured data. Works with both unstructured and structured data.
Mostly important for the following Mostly important for the following industries – e-
industries – healthcare, marketing, retail, commerce, manufacturing, academics, ML/AI,
supply chain, entertainment, etc. fintech, etc.
Data scientists and data analysts analyze data sets to glean knowledge and insights. Data
engineers build systems for collecting, validating, and preparing that high-quality data.
Data engineers gather and prepare the data and data scientists use the data to promote better
business decisions.
Here is the list of roles and responsibilities, Data Engineers are expected to perform:
They use a systematic approach to plan, create, and maintain data architectures while also
keeping it aligned with business requirements.
2. Collect Data
Before initiating any work on the database, they have to obtain data from the right sources.
After formulating a set of dataset processes, data engineers store optimized data.
3. Conduct Research
Data engineers conduct research in the industry to address any issues that can arise while
tackling a business problem.
4. Improve Skills
Data engineers don’t rely on theoretical database concepts alone. They must have the
knowledge and prowess to work in any development environment regardless of their
programming language. Similarly, they must keep themselves up-to-date with machine
learning and its algorithms like the random forest, decision tree, k-means, and others.
They are proficient in analytics tools like Tableau, Knime, and Apache Spark. They use these
tools to generate valuable business insights for all types of industries. For instance, data
engineers can make a difference in the health industry and identify patterns in patient
behavior to improve diagnosis and treatment. Similarly, law enforcement engineers can
observe changes in crime rates.
Data engineers use a descriptive data model for data aggregation to extract historical insights.
They also make predictive models where they apply forecasting techniques to learn about the
future with actionable insights. Likewise, they utilize a prescriptive model, allowing users to
take advantage of recommendations for different outcomes. A considerable chunk of a data
engineer’s time is spent on identifying hidden patterns from stored data.
6. Automate Tasks
Data engineers dive into data and pinpoint tasks where manual participation can be
eliminated with automation.
Data science uses a combination of various tools, algorithms, formulas, and machine
learning principles to draw hidden patterns from raw data. These patterns can then be used
to gain a better understanding of a variety of factors and influence decision making. Data
science does more than just crunch numbers — it reveals the “why” behind your data.
Data science is the key to making information actionable by using massive volumes of data
to predict behaviors and infer meaning from correlating data in a meaningful way. From
finding the best customers and charging the right prices to allocating costs accurately and
minimizing work-in-progress and inventory, data science is helping businesses maximize
innovation.
Data science tools and technologies have come a long way, but no development was more
important than the improvement of artificial intelligence (AI). AI is the ability of computers
to perform tasks that formerly were exclusive to humans. AI used to rely entirely on human
programming, but thanks to the application of machine learning, computers can now learn
from data to further develop their abilities. As a result, AIs can now read, write, listen, chat
and even listen like a human can – though at a scope and speed that far exceeds what any
one person is capable of doing.
How Data Science Can Impact Your Business
Data science can positively impact many business functions, both customer-facing and
internally. And while the benefits and potential uses of data science are vast, here are some
of the primary ways organizations have used data science in their operations, and the
solutions they are using to get results.
Recruiting and retaining quality and skilled employees is a struggle for many businesses,
regardless of industry. NLP is also making a difference here, by automating aspects of the
recruiting process to help organizations find better candidates, faster. Using unique
algorithms, data science can “read” resumes and decide whether or not a candidate is worth
pursuing. It can even select resumes based on specific character and personality traits,
which enables businesses to get very specific about the type of person they are looking to
hire.
Opportunity Identification
Another capability of data science tools and analytics is opportunity identification. Using
historical and forecasted market data, businesses can identify geographic areas to target to
penetrate for sales and marketing initiatives with greater accuracy. Data can inform new
market decisions and make predictions as to whether a new venture is likely to be cost
effective. This will ultimately help organizations determine what is worth the investment
and whether they can expect to see a return.
************************************