0% found this document useful (0 votes)
2 views

DSML Projects

The document outlines various projects focused on building data-driven systems, including a real-time retail recommendation engine, automated document intelligence platform, scalable autonomous vehicle simulation, real-time fraud detection system, and a conversational AI chatbot. Each project includes descriptions, requirements, and processes involving data ingestion, processing, modeling, deployment, and monitoring. Technologies mentioned include Kafka, Spark, AWS services, and machine learning frameworks like PyTorch and TensorFlow.

Uploaded by

Subham Saha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

DSML Projects

The document outlines various projects focused on building data-driven systems, including a real-time retail recommendation engine, automated document intelligence platform, scalable autonomous vehicle simulation, real-time fraud detection system, and a conversational AI chatbot. Each project includes descriptions, requirements, and processes involving data ingestion, processing, modeling, deployment, and monitoring. Technologies mentioned include Kafka, Spark, AWS services, and machine learning frameworks like PyTorch and TensorFlow.

Uploaded by

Subham Saha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Project Title Description Requirements Process

1. Data ingestion: Deploy


Kafka producers to capture
user events.
2. ETL pipeline: Use Spark on
- Data processing:
Hadoop cluster; write cleaned
Kafka,
tables into Snowflake.
Spark/PySpark,
3. Feature engineering: Build
Hadoop
features with Pandas, NumPy,
- Feature store:
Prophet for seasonality.
Redis, Snowflake
4. Model training & tuning:
- Modeling: Scikit-
Train LightGBM/XGBoost on
Learn, LightGBM,
Sagemaker; optimize
XGBoost, Bayesian
hyperparams via Bayesian
Build a low-latency, Optimization, SHAP
Optimization.
1. Real-Time personalized product - Serving: FastAPI,
5. Explainability: Analyze
Retail recommender for an e- Docker, Kubernetes,
feature impact with SHAP.
Recommendation commerce platform, AWS EC2/EKS,
6. Containerize: Package
Engine ingesting clickstreams Lambda,
model server in Docker; push
and purchase history. Sagemaker
to ECR.
Endpoints
7. Deploy: Orchestrate with
- Monitoring &
Kubernetes/EKS; expose via
Logging:
FastAPI behind AWS ALB.
Prometheus,
8. Real-time scoring: Lambda
Grafana,
functions triggered by Kafka;
CloudWatch, ELK
cache lookups in Redis.
(Elasticsearch)
9. Monitoring: Track latency &
- UI: Streamlit UI,
errors in Prometheus/Grafana;
Flask
ship logs to Elasticsearch.
10. Dashboard: Build admin UI
in Streamlit to visualize
recommendations and drift.

Page 1 of 10
Project Title Description Requirements Process

1. File ingestion: Upload docs


- Ingestion & to S3; trigger Glue crawler to
Storage: S3, Glue, catalog.
Athena 2. OCR: Invoke AWS Textract
- OCR & NLP: AWS via Lambda; fallback to
Textract, Tesseract, Tesseract if needed.
spaCy, NLTK, 3. Text cleaning: Preprocess
Transformers with spaCy, NLTK.
(BERT), LangChain, 4. Embedding & Semantic
Whisper search: Generate embeddings
- Database & via Transformers; store in
Search: Pinecone.
Elasticsearch, 5. Entity extraction &
End-to-end system to Pinecone, Neo4j classification: Use BERT fine-
2. Automated
ingest, OCR-extract, (knowledge graph) tuned on invoices; build graph
Document
classify, and analyze - Orchestration: in Neo4j.
Intelligence
business documents Airflow, Step 6. Workflow orchestration:
Platform
(invoices, contracts). Functions, AWS Airflow DAG or Step Functions
Lambda for ETL→OCR→NLU.
- Web UI: Flask, 7. API layer: Expose
React + Streamlit search/classify via FastAPI;
UI, Gradio containerize.
- Deployment: 8. UI: Interactive demo in
Docker, AWS Gradio or Streamlit.
ECS/Fargate, 9. Monitoring & alerts:
Kubernetes CloudWatch + Prometheus
- Monitoring: metrics.
CloudWatch, 10. Versioning & retraining:
Prometheus, MLflow for model registry;
Grafana schedule retraining with
Airflow.

Page 2 of 10
Project Title Description Requirements Process

1. Scenario setup: Define road


networks in SUMO; control via
Python ROS scripts.
2. Data ingestion: Use Spark
on Hadoop to process traffic
logs; store in Snowflake.
3. Route planning: Formulate
as optimization; solve with
- Simulator: SUMO,
OR-Tools/Gurobi.
ROS
4. Model acceleration:
- Routing &
Convert PyTorch/TensorFlow
Optimization: OR-
planning networks to ONNX;
Tools, Gurobi,
optimize with TensorRT on
Bayesian
CUDA GPUs.
Optimization
5. Batch simulations: Launch
- Data handling:
parallel jobs on
3. Scalable Simulate fleets of Hadoop, Spark,
Kubernetes/EKS or AWS
Autonomous autonomous vehicles, Snowflake
Batch.
Vehicle plan routes, and - Compute: CUDA,
6. Metrics collection:
Simulation & evaluate with traffic TensorRT, ONNX,
Aggregate latency, collision
Planning models. PyTorch,
stats in Snowflake; push to
TensorFlow
Prometheus.
- Orchestration:
7. Visualization: Plot routes
Kubernetes, Docker,
and performance in OpenCV
AWS Batch, EC2
overlays; chart with
GPU (p3)
Matplotlib/Seaborn.
- Visualization:
8. Hyperparameter tuning:
OpenCV, Matplotlib,
Use Bayesian Optimization to
Seaborn, Grafana
refine planning heuristics.
9. Continuous integration:
MLflow to track experiments;
deploy new planners via
Docker.
10. Dashboard: Grafana
dashboards for fleet health
and simulation KPIs.

Page 3 of 10
Project Title Description Requirements Process

1. Stream ingestion: Deploy


Kafka consumers; ingest
transaction stream.
2. Feature computation: Real-
time features in Spark
Streaming; store in Redis.
- Streaming: Kafka, 3. Baseline modeling: Fit
Kinesis, Spark seasonal models with Prophet;
Streaming, PySpark detect outliers.
- Feature store: 4. Machine learning: Train
Redis, Snowflake Isolation Forest/XGBoost on
- Anomaly historical data stored in
Detection: Scikit- Snowflake.
Learn, Isolation 5. Deployment: Expose
Detect anomalous Forest, Prophet models via SageMaker
4. Real-Time
transactions in (seasonal patterns) endpoints and FastAPI
Fraud Detection
streaming finance data, - Model hosting: containers.
& Alerting
raise alerts with low SageMaker, 6. Real-time scoring: Lambda
System
false positives. FastAPI, Lambda functions invoked per
- Alerting: AWS transaction; push scores via
SNS, WebSocket, WebSocket to front-end.
Grafana 7. Alerting: SNS pushes
Alertmanager SMS/email for high-risk flags;
- Logging & Audit: Grafana monitors thresholds.
CloudTrail, 8. Audit logs: Stream logs to
Elasticsearch, Elasticsearch; build Kibana
Kibana dashboards.
9. Feedback loop: Label
confirmed fraud; retrain
weekly via Glue & Airflow.
10. Security & compliance:
Enforce IAM policies; encrypt
data at rest/in transit.

Page 4 of 10
Project Title Description Requirements Process

1. Intent and entity design:


Configure Dialogflow with
training phrases.
2. Embedding pipeline: Use
Whisper for speech-to-text;
generate embeddings via
Transformers; index in
- NLP: Dialogflow, FAISS/Pinecone.
Rasa (optional), 3. Context management:
Transformers Build LangChain chains; store
(BERT, Whisper), session state in Firebase.
spaCy, NLTK 4. Backend APIs: FastAPI
- Vector DB: FAISS, endpoints for chat; integrate
Pinecone Dialogflow via webhook.
- Orchestration: 5. Orchestration: Use Airflow
End-to-end chatbot
5. Conversational LangChain, Airflow to retrain NLU models daily
with multimodal
AI & Virtual - Backend: FastAPI, with new transcripts.
capabilities and context
Assistant Flask, Firebase 6. UI: Web chat widget built in
management.
- Deployment: Flask or Streamlit UI.
Docker, Kubernetes, 7. Containerization & scaling:
AWS Lambda, Cloud Docker + Kubernetes on AWS
Functions EKS; Lambda for lightweight
- Monitoring: tasks.
Prometheus, 8. Monitoring: Track user
Grafana, sessions and latencies in
CloudWatch Prometheus/Grafana.
9. Analytics: Store logs in
Elasticsearch; analyze in
Kibana.
10. Continuous improvement:
Use MLflow for experiment
tracking; deploy updated NLU
models.

Page 5 of 10
Project Title Description Requirements Process

1. Data collection: Automate


with Selenium and Beautiful
Soup; push to Kafka.
2. Storage & ETL: Spark jobs
on Hadoop; land into
Snowflake.
- Scraping:
3. Preprocessing: Clean text
Beautiful Soup,
with spaCy/NLTK.
Selenium
4. Topic modeling: Train LDA
- Streaming &
or Transformers-based
storage: Kafka,
classifiers.
Hadoop, Snowflake
5. Sentiment analysis: Fine-
- NLP:
tune BERT; serve via FastAPI.
Ingest social feeds, Transformers
6. Social Media 6. Trend forecasting:
analyze sentiment & (BERT), spaCy,
Analytics & Trend Forecast topic volumes with
topics, forecast viral NLTK, TextBlob
Prediction Prophet; refine with SciPy
trends. - Time series:
optimizations.
Prophet, SciPy
7. Dashboard: Build real-time
- Visualization:
Grafana dashboards; embed
Matplotlib, Seaborn,
print-quality charts in Tableau.
Tableau, Grafana
8. Pipeline orchestration:
- Deployment:
Orchestrate
Docker, AWS EC2,
scraping→ETL→modeling via
Lambda, Airflow
Airflow.
9. Alerts: Lambda-triggered
SNS alerts for spikes.
10. Scaling & monitoring:
Kubernetes on AWS; metrics
in CloudWatch/Prometheus.

Page 6 of 10
Project Title Description Requirements Process

1. Data ingestion: Load EHR


data into S3; catalog with
Glue.
2. Exploratory analysis: Query
via Athena; visualize with
- Data platforms:
Seaborn.
Hadoop, Snowflake,
3. Feature engineering: Use
AWS
Pandas, SciPy for numeric,
S3/Glue/Athena
Transformers for unstructured
- Modeling: Scikit-
notes.
Learn, XGBoost,
4. Model training: Compare
LightGBM, PyTorch,
XGBoost/LightGBM vs. deep
Keras,
nets in PyTorch/Keras; tune
Transformers
with Bayesian Optimization.
- Optimization:
7. Personalized Predict patient 5. Explainability: Compute
Bayesian
Healthcare outcomes, recommend SHAP values; build
Optimization, SHAP
Predictive interventions, visualize interpretability reports.
for explainability
Analytics risk. 6. Deployment: Package best
- Deployment:
model in Docker; deploy on
SageMaker,
SageMaker endpoint.
FastAPI, Docker,
7. API & UI: Serve predictions
Kubernetes, AWS
via FastAPI; interactive
Lambda
dashboard in Streamlit.
- Visualization:
8. Monitoring & retraining:
Dashboards in
Track data drift via SageMaker
Tableau, Streamlit
Model Monitor; schedule
UI, Matplotlib,
retrain with Airflow.
Seaborn
9. Security: Use IAM, encrypt
PHI, Secrets Manager.
10. Reporting: Export KPI
charts to Tableau for
clinicians.

Page 7 of 10
Project Title Description Requirements Process

1. Market data feed: Ingest via


Kinesis; buffer in Kafka.
2. Feature extraction: Real-
time features in Spark
- Streaming: Streaming; cache in Redis.
Kinesis, Kafka 3. Strategy modeling: Build
- Compute: EC2 RL/supervised models in
Spot GPU (for PyTorch; accelerate inference
backtest ML), Ray via TensorRT.
for distributed 4. Backtesting engine: Use
execution QuantLib for pricing; solve
- Frameworks: portfolio allocation via Gurobi
PyTorch, and OR-Tools.
8. High- Ultra-low-latency TensorFlow, CUDA, 5. Distributed execution:
Frequency analytics and TensorRT Orchestrate with Ray across
Trading Analytics backtesting system for - Backtesting: EC2 spot clusters.
Platform trading strategies. QuantLib, OR-Tools, 6. Results storage: Persist
Gurobi trade logs to Snowflake.
- Data store: Redis, 7. Dashboard & alerts:
Snowflake, Hadoop Grafana for latency/trade P&L;
- Orchestration: SNS alerts for anomalies.
Airflow, Step 8. CI/CD: Dockerize
Functions strategies; deploy via
- Visualization: Kubernetes/EKS.
Grafana, Matplotlib, 9. Scheduling: Airflow DAG for
Seaborn nightly backtests; Step
Functions for deploy pipelines.
10. Performance tuning:
Profile with CUDA tools;
iterate.

Page 8 of 10
Project Title Description Requirements Process

1. Video ingestion: Stream


from Kinesis Video; shard to
Kafka.
2. Preprocessing: Decode
- CV frameworks:
frames with OpenCV; resize,
OpenCV, YOLO,
normalize.
Roboflow, ONNX,
3. Object detection: Run YOLO
TensorRT
in PyTorch; convert to ONNX
- Streaming: Kafka,
→ TensorRT on GPU.
Kinesis Video
4. Tracking & classification:
Streams
Use SORT/DeepSORT; classify
- Storage: S3,
actions with deep nets.
Hadoop
5. Audio analytics: Extract
- Modeling:
Detect, track, and audio; run Whisper for
9. Intelligent PyTorch,
classify speech-to-text.
Video Analytics & TensorFlow,
objects/activities in live 6. Event pipeline: Lambda
Surveillance Transformers (for
video feeds. triggers on detections; store
captioning),
metadata in Snowflake.
Whisper (audio)
7. Dashboard: Real-time
- Deployment: AWS
metrics in Grafana; playback
EC2 GPU, Lambda
UI in Streamlit.
(for triggers),
8. Model updates: Roboflow
Docker, Kubernetes
pipeline to label new data;
- Monitoring:
retrain weekly.
CloudWatch,
9. Scalability: Kubernetes
Prometheus,
auto-scale based on stream
Grafana
volume.
10. Alerts & export: SNS
email/SMS on critical events;
export clips to S3.

Page 9 of 10
Project Title Description Requirements Process

1. Setup repo: Define standard


project template with MLflow
integration.
2. Data ingestion: Ingest from
- Experiment S3 or Snowflake; register
tracking: MLflow, features in Redis.
Weights & Biases 3. Experimentation: Use
- CI/CD: GitHub MLflow to log
Actions, AWS metrics/artifacts for sklearn,
CodePipeline, TF, PyTorch.
CodeBuild, 4. Hyperparameter tuning:
CodeDeploy Integrate Bayesian
- Containers & Optimization in MLflow
Orchestration: pipeline.
Docker, Kubernetes, 5. CI/CD: Configure GitHub
EKS/ECS Actions → build Docker
- Model registry &
images → push to ECR.
Build a generic MLOps serving: MLflow
10. End-to-End 6. Model registry: Promote
framework to track, Serving, SageMaker
MLOps Platform models through
package, deploy, and - Data & feature
with MLflow Dev→Staging→Prod in
monitor any ML model. store: Snowflake,
Redis, AWS MLflow.
S3/Glue/Athena 7. Deployment: Serve via
- Monitoring: MLflow Serving or SageMaker
Prometheus, endpoints; orchestrate with
Grafana, Kubernetes.
CloudWatch 8. Monitoring: Instrument
- Languages & libs: code for Prometheus metrics;
Python, Scikit- dashboards in Grafana.
Learn, TensorFlow, 9. Alerts: CloudWatch alarms
PyTorch, XGBoost, for drift; SNS notifications.
LightGBM, 10. Documentation &
Transformers, templates: Provide Streamlit
UI to kick off new
SpaCy, NLTK
experiments; include
examples using Transformers,
XGBoost, Deep Learning, NLP
with spaCy.

Page 10 of 10

You might also like