0% found this document useful (0 votes)
9 views

ML_MLOPs_Platform_Comparison

The document provides a comparative analysis of DataRobot and Cloudera in terms of MLOps capabilities, highlighting strengths and weaknesses. DataRobot is recommended for its ease of use and comprehensive built-in features, achieving the highest maturity level of MLOps, while Cloudera is suited for organizations with existing data setups but requires more integration effort. The final recommendation favors DataRobot for its superior end-to-end model deployment and MLOps capabilities.

Uploaded by

rushedtodonate
Copyright
© © All Rights Reserved
Available Formats
Download as XLSX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

ML_MLOPs_Platform_Comparison

The document provides a comparative analysis of DataRobot and Cloudera in terms of MLOps capabilities, highlighting strengths and weaknesses. DataRobot is recommended for its ease of use and comprehensive built-in features, achieving the highest maturity level of MLOps, while Cloudera is suited for organizations with existing data setups but requires more integration effort. The final recommendation favors DataRobot for its superior end-to-end model deployment and MLOps capabilities.

Uploaded by

rushedtodonate
Copyright
© © All Rights Reserved
Available Formats
Download as XLSX, PDF, TXT or read online on Scribd
You are on page 1/ 40

OUR WORK

•Conducted a comprehensive analysis comparing DataRobot and Cloudera across critical MLOps dime
Management, Explainability, Deployment Automation, Model Monitoring, Collaboration, and AI Govern

•Evaluated the strengths of each platform by analysing their capabilities and researched on how each

•Based on our analysis, we provided a recommendation on which platform demonstrates superior per

INFERENCE

Cloudera:
• Great for handling large amounts of data but lacks built-in AI management tools.
• Requires to spend more time setting up and integrating other tools to get a full or partial MLOps ma
• Suited for companies that already have a strong data setup and are willing to put in the extra work
•Lowest Maturity Level of MLOps can be achieved through Cloudera

DataRobot:
• It’s easy to use with most features ready to go right away.
• Everything you need for managing data and models is built-in.
• Less manual work or no manual work is needed and most of the processes are automated.
• Ideal for companies that want to scale their AI projects quickly and efficiently.
• Highest Maturity Level of MLOps can be achieved through Data Robot

FINAL RECOMMENDATION

Cloudera Data Science Workbench is good at handling ML workloads and can be used as the underlyin
Datarobot scores over Cloudera in its end to end model deployment and MLOps capabilites. It helps im
well. Hence we propose DataRobot as an alternative to Cloudera Data Science Workbench
MLOps Capability Dimensions

Feature Management
Model Management

Explainability
Explainability

Deployment Automation
Model Monitoring
Collaboration

AI Governance
AI Governance
Evaluation Capabilities/Criteria

Versioned Feature Stores

Feature Catalog / Glossary

Feature selection

Feature inferences
Model Registry

Model metadata

Model Portability

AutoML

Model Interpretability

Model Explainability
Explainability Narratives

CI/CD automation

•Model Deployment
•Pipeline Automation

•Data Drift Monitoring

•Model Drift/Decay monitoring

•Performance Metrics monitoring


•Collaboration channels

•Wiki Integrations

•ITSM Integration

•Bias

•Ethics
•Responsible AI**
Data Robot

Commentary

Feature stores serve as a central repository where


frequently-used features are stored and organized for
reuse and sharing.

AI Catalog in DataRobot provides a centralized platform


for managing and accessing data assets, enhancing
collaboration and efficiency in machine learning projects

DataRobot's Python client to accomplish feature


selection by creating aggregated Feature Impact using
models created during Autopilot. It supports both
automed feature engineering & custom feature selection

In DataRobot, "feature inference" typically refers to the


process of automatically identifying and analyzing the
most important features (or variables) in your dataset.
Its achieved through various ways as 1) Automatic
Feature Engineering 2)Feature Importance, selection,
visualization & validation
The Model Registry is an organizational hub for the
variety of models used in DataRobot. Models are
registered as deployment-ready model packages. Each
registered model package functions the same way,
regardless of the origin of its model.

The model-metadata.yaml file is used to specify


additional information about a custom task or a custom
inference model,

The Portable Prediction Server (PPS) is a DataRobot


execution environment for DataRobot model packages
(.mlpkg files) distributed as a self-contained Docker
image. After you configure the Portable Prediction Server,
you can begin running single or multi model portable real-
time predictions and portable batch prediction jobs.

A data scientist can perform AutoML in DataRobot


through a structured process that simplifies the creation,
evaluation, and deployment of machine learning models.

DataRobot (Workbench is SHAP-only) When selecting


between XEMP and SHAP, consider your need for
accuracy versus interpretability and performance.

Prediction explanations identify the most impactful


individual factors for a single prediction. Two common
methods of calculating them, XEMP and SHAP, are
supported in DataRobot.
1) Prediction Explanations overview - Describes the SHAP
and XEMP methodologies, including benefits and
tradeoffs.
2) SHAP Prediction Explanations - Describes how to work
with SHAP-based Prediction Explanations.
3) XEMP Prediction Explanations - Describes how to work
with XEMP-based Prediction Explanations.
4) Text Prediction Explanations - Helps to interpret the
output of text-based explanations.
5) Prediction Explanations for clusters- Helps to interpret
Prediction Explanations for clustering projects (XEMP
only).

DataRobot integrates with GitHub through the GitHub


Actions CI/CD plugin, allowing for automated workflows.
This integration enables data scientists to push model
updates to a remote repository, triggering automated
build, test, and deployment processes

1) Allow to Deploy custom model created in the Custom


Model Workshop.
2) Allow to Deploy external (remote) models from the
Model Registry or by uploading training data in the
deployment inventory.
3) Allows to deploy DataRobot models from the
Leaderboard or the Model Registry.
4) Different ways to deploy:
a) Real-time API Deployments: For making
predictions in real time.
b) Batch Deployments: For processing large
volumes of data in batch mode.
c) Integration with Cloud Services: For deploying
models on cloud platforms like AWS, Azure, or Google
Cloud.
DataRobot automates MLOps pipelines by handling model
training, continuous integration, real-time and batch
predictions, and performance monitoring. It supports
automated deployments, tracks data and model versions,
and integrates with other systems via APIs for
streamlined, scalable machine learning operations.

The platform tracks both target drift (changes in the


prediction target) and feature drift (changes in input
features).Users can configure notifications and monitoring
settings through the Data Drift tab in the deployment
settings

DataRobot offers tools to monitor model performance


metrics continuously, comparing current predictions
against historical data to detect any degradation. This
includes tracking accuracy, latency, and error rates,
which helps in identifying when a model may need
retraining or replacement

Custom metric tab


This feature allows you to implement your organization's
specialized metrics, expanding on the insights provided
by built-in Service Health, Data Drift, and Accuracy
metrics.
The DataRobot AI catalog fosters collaboration by
providing users a system of record for datasets, the
ability to publish and share
datasets/models/applications/monitoring dashboard with
colleagues, tag datasets, and manage the lineage of the
dataset throughout the entire project.

NA

Can be achieved by connecting to the ITSM data sources


and building AI/Ml usecase or data analysis on top of the
data. Also we can deploy and monitor both data & model
pipeline. Integration with other cloud partners like
snowflake, GCP, azure & aws is possible.

The Fairness tab helps you understand why a deployment


is failing fairness tests and which protected features are
below the predefined fairness threshold. It provides two
interactive and exportable visualizations that help
identify which feature is failing fairness testing and why.
-Per-Class Bias chart
-Fairness Over Time chart

DataRobot's Trusted AI Team focuses on building models


that adhere to ethical guidelines, ensuring that AI
solutions are designed with fairness and transparency in
mind
DataRobot supports responsible AI by providing tools for
fairness analysis, explainability, and model governance,
ensuring that machine learning models are ethical,
transparent, and accountable. Features include bias
detection, interpretability with SHAP and LIME, and
comprehensive model documentation and monitoring to
maintain compliance and address potential issues.
Supporting Link

Reference

https://docs.datarobot.com/en/docs/data/index.html#how-to-build-a-feature-store-in-datarobo

https://docs.datarobot.com/en/docs/data/ai-catalog/index.html

Advanced feature selection with Python: DataRobot docs

https://docs.datarobot.com/en/docs/data/transform-data/feature-discovery/enrich-data-using-
Model Registry: DataRobot docs

Model metadata and validation schema: DataRobot docs

Portable Prediction Server: DataRobot docs

See How AutoML Will Transform Your Business | DataRobot

SHAP reference: DataRobot docs

Explainability as a Dimension of Trusted AI | DataRobot


https://docs.datarobot.com/en/docs/modeling/analyze-models/understand/pred-explain/index.

https://youtu.be/cPrLlLDD6UU?si=Kea-WOJe5_Uvnfwn

Deploy DataRobot models: DataRobot docs


https://docs.datarobot.com/en/docs/mlops/monitor/data-drift.html

https://docs.datarobot.com/en/docs/mlops/monitor/data-drift.html

Custom Metrics tab: DataRobot docs


https://www.datarobot.com/blog/driving-ai-success-by-engaging-a-cross-functional-team/#:~:

https://docs.datarobot.com/en/docs/integrations/index.html#integrations

https://docs.datarobot.com/en/docs/mlops/governance/mlops-fairness.html#:~:text=The%20

DataRobot’s State of AI Bias Report Reveals 81% of Technology Leaders Want Government Re
https://docs.datarobot.com/en/docs/mlops/deployment-settings/fairness-settings.html#select-a-fairne
Cloudera

Score Commentary

Cloudera doesn't have a native Feature Store


functionality. You can connect to external
Databases or applications like FEAST and launch
it inside CML by running it on a spark clusters.

Feature Cataloging can be accomplished by using


FEAST - The feature catalog includes a registry
that maintains records of available features and
their metadata. This registry can help us quickly
find and utilize existing features instead of
recreating them, which saves time and resources

There is no native Feature Selection capability


and will require to write a custom python script
through a containerized deployment for this
purpose. It has a AMP - Accelerator Kit for ML
Projects that come with pre-built .py files that can
be customized for the ML models in scope

While Cloudera itself may not have built-in


automated feature selection tools like some
AutoML platforms, users can implement feature
selection techniques using libraries such as
Scikit-learn, XGBoost, or custom algorithms in
Spark. Cloudera offers jupyter notebooks, can be
done through python code via notebooks.
Cloudera ML supports native Model Registry
framework -under the Experiments - you can
choose to register a model based on the model
metrics. Cloudera ML provides suppoers for
MLFlow through which you can register your
models.

Model Versioning and the metadata is captured


through the MLFlow UI - Model Name, Description,
Performance Metrics,

The model portabiity doesn't have a native


support in Cloudera ML. The data scientist have
to specify the format in which the model file gets
saved - PMML and ONNX can be leveraged

Cloudera doesn't have any native AutoML


capability. A data scientist can use AutoML
Libraries to accomplish by making use of
Cloudera's computes and clusters as the
underlying infra

Cloudera has provided an accelerator code for


Model Interpretability using SHAPE and LIME
package. This starter code can be packaged and
customized for the models that the data scientist
chooses to build

Cloudera has provided an accelerator code for


Model Explainability using SHAPE and LIME
package. This starter code can be packaged and
customized for the models that the data scientist
chooses to build
LIME and SHAPE packages help with explanation
narratives for each prediction. There is no native
ML explainability and Interpretability inside
Cloudera ML

CML supports CI/CD automation by integration


with GitHub Actions and BitBucket Pipelines. The
REST API deployed inside Cloudera can be used in
any CI/CD pipeline tools

Cloudera ML supports model deployment as an


endpoint through its SDK and through click-based
deployment
Cloudera ML supports Automated ML pipeline
through custom code that can be run in a shell
script as .py files in a sequence and these jobs
can be monitored . There are no native pipelines
like Data Pre-processing/ML Pipelines etc.
However, they save Accelerator scripts as
Jumpstarter kit.

CML doesn't support any native Model Monitoring


applications. The Data Scientist has to build
custom dashboard for Data Drift. There are
accelator scripts that makes use of open source
dashboard for Data Drift monitoring that is
packaged and deployed in Cloudera that can be
leveraged

CML can leverage use of external API providers


like MLFlow for Model Drift Analysis . There is no
native support.

ClouderaML has a dashboard that shows the


performance of the models that have been
deployed, jobs, applications etc and provides the
user to slice/dice the data based off different
personas - Users/Teams/Projects.
Collaboration happens through a Project/Teams
that get created and access being provided at
Viewer, Operator, Contributor, Admin levels. The
Data Scientists can collaborate through Git Repo
and can fork their projects etc

ClouderaML model end-points can be integrated


with any application to a front-end as it is a REST
API

There is no native framework to detect bias. By


creating custom .py scripts using K-S test, bias
can be detected. The .py file can be run as a job
and deployed inside cloudera environment

There is no native framework for gauging ethical


AI metrics. However using libraries like FairML - it
allows for integration with ClouderaML to audit
the ML model .
There is no native framework for gauging
responsible AI. This can be done by performing
careful feature selection, applying the right data
collection strategy and by checking for fairness
score using methods like ExponentiatedGradient
to mitigate bias.
Reference Score

https://community.cloudera.com/t5/Community-Articles/How-to-integrate-a-Feature-Store-on-

https://community.cloudera.com/t5/Community-Articles/How-to-integrate-a-Feature-Store-on-
https://docs.cloudera.com/machine-learning/cloud/models/topics/ml-using-model-registry.htm

https://docs.cloudera.com/cdp-public-cloud-preview-features/cloud/model-registry/model-regi

https://blog.cloudera.com/putting-machine-learning-models-into-production/

https://github.com/cloudera/CML_AMP_AutoML_with_TPOT

https://github.com/cloudera/
CML_AMP_Explainability_LIME_SHAP/tree/master

https://github.com/cloudera/
CML_AMP_Explainability_LIME_SHAP/tree/master
https://community.cloudera.com/t5/Community-Articles/How-to-set-up-CI-CD-workflows-in-Clo

https://docs.cloudera.com/machine-learning/cloud/models/topics/ml-creating-and-deploying-a
https://www.cloudera.com/services-and-support/tutorials/building-automated-ml-pipelines-in-c

https://github.com/cloudera/CML_AMP_Continuous_Model_Monitoring

https://community.cloudera.com/t5/Community-Articles/CML-Model-Deployment-with-MLFlow-

https://docs.cloudera.com/observability/cloud/cluster-management/topics/obs-cml-workload-p
https://docs.cloudera.com/machine-learning/cloud/projects/topics/ml-collaborate.html
Oracle ML

Commentary Reference
Dataiku

Score Commentary Reference Score

You might also like