White Paper The Evolution of Data Science
White Paper The Evolution of Data Science
Suman Singh
AI, along with the significant backbone of technology and data brings vast value to
business. With technology advancing at a pace faster than what it was a decade ago, we
have managed to accumulate vast amounts of data and data is what drives AI. Today the
volume and velocity at which data is growing poses a huge challenge – harnessing their
power in real time or near real time.
Through this thought paper, we will take you through the evolution of AI & ML and its
current state brought about by the challenges faced by Business leaders on the effective
use of data for faster and outcome-assured decision constrained by tools, skills, and
resources.
Let us start with the basics. What is Artificial Intelligence (AI) and Machine Learning
(ML)? How are they different?
In Computer Science, AI stands for any system or program that can perceive its
environment, taking in different types of data and then using this data to take actions
that maximizes success in a certain task.
Machine learning (ML) is a subset of AI. All ML systems come under AI, but not all AI
systems are ML related. The core concept behind ML is that we let machines learn by
themselves. Usually, computer programmers write programs to accomplish tasks.
However, through ML, scientists aim to avoid ‘programming’ all knowledge into
machines, instead feeding a huge data set to the machine and letting the machine do the
learning.
THE EVOLUTION OF DATA SCIENCE IN BUSINESSES
In its nascent stage, it was just Analytics. Data analysis is the process of examining huge
data sets to gain insights from them. From scattered sets of data that, by itself, make no
sense, a data analyst comes into the picture and tries to find patterns. Through this
analysis, they often derive insights and arrive at answers to “what happened?” and “how
did it happen?”
With these new demands that data had to meet, it became imperative to deliver more
dimensions and scenarios, the need to know what is happening and predict what will
happen took centerstage. Statistical modeling takes data analytics one step further.
Instead of deriving insights about past events, statistical modeling helped scientists make
predictions about future events using the current data. Let us take a simple example -
you need to predict over the counter check deposit transaction is risky based on
probability of fraud. Statistical modeling will take a data set, find transaction amount
more than average balance, or transaction amount greater than historical max
transaction amount and can predict transaction which is likely to be fraudulent. This is
one of the simplest applications of statistical modeling.
Advent of technology tools and availability of huge amounts of data along with the
efficiency of cloud and processing capabilities led to the dawn of Machine Learning for
predictions. Before machine learning, data sciences and statistical modeling were used
exclusively by theoretical scientists and mathematicians. The practical applications and
implications were minimal. However, ML has brought in a wave of change in the way
organizations make automated decisions. ML has also helped make decisions based on
the wisdom acquired from past data. Hence ML influences reactions and decisions and
can predict - What will happen? What should happen? Organizations are using more and
more automated machine learning models to make quick and informed decisions. These
Auto-ML models are so powerful that they can derive knowledge from data and infer
from data. But Auto-ML still cannot solve all of data science’s problems.
The volume of data generated every day is way too enormous for any business to
process. Even if a business employs an army of top-notch data scientists and use
advanced machine learning models, it would be a mountain of data. For efficient data
processing and smart decision making, we need the capability to process vast amounts of
meaningful data fast and data scientists must be relieved from the mundane tasks like
cleaning the data, taking care of missing values, anomalies and cardinality.
This heralded the advent of human and machine collaboration: Enter AI Driven Data Science
& Machine Learning
With its unique ability to sift through volumes of information rapidly and objectively,
AutoML is becoming the tool of choice when it comes to organizing data in an ever-
changing complex world. AutoML enables people with limited knowledge of analytics
and machine learning to make informed decisions. It is an even more powerful weapon
when you put it in the hand of a data scientist. AutoML has become one of the key
weapons of data driven decisioning.
Now let us look at the essential steps in DSML, which AutoML will automate.
Data engineering
Data engineering is essentially a preprocessing step that improves the quality of data.
AutoML can help a data scientist by automating the essential steps of data engineering
like:
Feature engineering
Feature engineering is the backbone for developing a highly consistent and accurate
machine learning model. A data scientist needs to construct new features (feature
derivation) to measure hidden pattern from existing raw data and select the most
suitable features (feature selection).
Every data scientist must ask a few important questions before deriving new features:
• Is the list of features suffering from parsimony problem? How can you overcome
this issue?
• Can you explain the feature in business language? Can you interpret the feature?
Can you measure the impact of the feature on business?
• Do you have methods to track data treatment activities on selected features?
• Are you applying the right feature selection methods? Are you losing important
features that can help you to improve the model accuracy?
• Are you using noisy or redundant features in your model, which can lead to
overfitting?
• How can you deal with high cardinality data elements in categorical features?
Feature derivation is the key to building a good model. Raw features generally do not
result in a well-fitted model, and it also lacks predictive accuracy. AutoML helps data
scientists avoid the commonly made mistakes of not knowing how to implement feature
derivation. For example, if a particular feature has high cardinality in the data then you
need to perform collapsing or indexing.
Why Feature Selection
Too many input variables can create complexity while predicting response and result in
over fitting. The accuracy of prediction can hamper due to large number of variables in
model.
Multiple reasons for feature selection in decision making, few examples are:
AutoML ensures that the best features are selected before the model development
process. Automation takes away the manual steps and bias involved and ensures high
quality models are developed.
Model development
Model should be developed at the lowest level of detail that is feasible as it can help
business to make a more granular decision. Data scientist or Citizen data scientist should
have meetings with Business sponsor throughout the model development process to
understand the impact of predictors on business, fine tune the list of predictors based on
objective. Throughout the model development process, reasonability tests should be
applied to predictor variables.
Every data has unique characteristics, which makes it hard for data scientists to know
which machine learning model can give better results for the particular use case.
AutoML helps you to automate the entire process of data science machine learning and
help you to enable multiple hypothesis and allow you to perform multiple experiments in
a matter of few hours to build the best-in-class model. It not only takes away the
iterative steps, any many lines of code typically involved in developing models, but also
provides you the ability to build several models in parallel so that you can select the
model with the best possible results in the shortest amount of time.
Model evaluation
Every data scientist wants to ensure that model evaluation is robust. Evaluating a
model’s performance and being able to trust them is critical for business. However, it is
sometimes tricky to finalize model performance based on Area Under the Curve (AUC)
or Kolmogorov-Smirnov (KS) chart and similar but limited evaluation techniques.
Incorrect evaluation might tamper results, and this, in turn, can lead to business losses.
AutoML ensures that these errors do not occur, with their automated capabilities.
AutoML ensures several evaluation techniques are applied to your models automatically
so that you can be sure of models’ accuracy and consistency over time.
There are other methods to evaluate machine learning models but below are two
important model evaluation techniques:
A robust AutoML solution should allow you to measure model performance and evaluate
it automatically and using several novel techniques, including those described above.
This ensures that data scientists spend time on actually generating business value, and
also give them the necessary confidence to defend their model’s performance with
business users.
Exhibit B – Auto ML
Most AI & ML solutions stop here – just automation of the traditional DSML life cycle
of data engineering, feature engineering, model development and model evaluation.
However, as per Venture Beats - Transform 2019 Report, 87% of Data Science and
Machine Learning projects do not progress beyond prototype, research & development
stage.
There are various reasons for failure of AI model in production like wrong feature
derivation while building model, challenges in data processing, but more importantly the
difficulty in deployment, lack of Explanability and simply the lack of trust in the model’s
results.
To ensure AI models move from the lab and into production, and to ensure lack of
business impact due to AI failure, businesses need more than just AutoML, they need
MLOPs to ensure that machine learning projects can be easily and automatically
deployed in complex production environments. Predictive accuracy and business value
needs to be measured and improved on an ongoing basis and adapt to constantly
changing conditions, ensuring you get the ROI you are looking for.
The below capabilities should be available in an MLOps platform to ensure that the
power of AI stretches beyond AutoML.
Automated data science machine learning can help businesses to accelerate their AI
journey and increase adoption of predictive intelligence driven decisioning. Earlier to
solve a business problem with Data Science, it would require a team of data scientists,
data engineers, technologists, and business experts and would take a few months to
develop and operationalize AI model.
In comparison now, AutoML+ and MLOps enables businesses to reduce the time from
data to decision to a few days. The power of AI driven technology is not just speed and
accuracy but also allows the data scientist and business user to collaborate more with
stakeholders and focus on building innovative solutions to business problems instead of
performing repetitive, redundant and manual tasks in the data science lifecycle.
iTuring
Our innovative solution to completely automate the DSML lifecycle.
iTuring AutoML+ is a zero code, intuitive UI democratizes data science and puts the
power of data science and machine learning in the hand of your employees without
worrying about the complexity. By automating the entire lifecycle – data transformation,
feature engineering, model development and evaluation, data scientists now have the
capacity to address more business problems faster and at scale. With detailed
explanations at model and transaction level, iTuring provides data driven insights in
hours instead of weeks for business leaders to make informed decisions.
With iTuring MLOps, machine learning projects can be easily and automatically deployed
in complex production environments in a few clicks, irrespective of where they were
built. Predictive accuracy and business value is measured and improved on an ongoing
basis and using CI’s Dynamic AI, models adapt to constantly changing conditions,
ensuring you get the ROI you are looking for. Real time scoring and production ready
rest APIs ensure an automated decision-making capability can be integrated into a
company’s operations to automate decisions at rapid scale.
To learn more about our flagship product iTuring, please visit www.cyborgintell.com.