CampusX DSMP 2.0 Syllabus
CampusX DSMP 2.0 Syllabus
—-----------------------------------------------------------------------
XGBoost (Extreme Gradient Boosting)
1. Introduction to XGBoost
* Introduction
* Features
* Performance
* Speed
* Flexibility
2. XGBoost for Regression
1. Regression Problem Statement
2. Step-by-Step Mathematical Calculation
3. XGBoost for Classification
1. Classification Problem Statement
2. Step-by-Step Mathematical Calculation
4. The Complete Maths of XGBoost
1. Prerequisite & Disclaimer
2. Boosting as an Additive Model
3. XGBoost Loss Function
4. Deriving Objective Function
5. Problem With Objective Function and Solution
1. The Taylor series
2. Applying Taylor Series
3. Simplification
6. Output Value for Regression
7. Output Value for Classification
8. Derivation of Similarity Score
9. Final Calculation of Similarity Score
MLOps Curriculum
4. Session 3: Reproducibility
1. Story
2. Industry Tools
3. Cookiecutter
1. Step 1: Install the Cookiecutter Library and start a project
2. Step 2: Explore the Template Structure
3. Step 3: Customize the Cookiecutter Variables
4. Step 4: Benefits of Using Cookiecutter Templates in Data Science
5. Session 4: Data Versioning Control
1. Introduction
2. Prerequisites
3. Setup
1. Step 1: Initialize a Git repository
2. Step 2: Set up DVC in your project
3. Step 3: Add a dataset to your project
4. Step 4: Commit changes to Git
5. Step 5: Create and version your machine learning pipeline
6. Step 6: Track changes and reproduce experiments
6. Doubt Clearance Session 2
1. Assignment Solution on DVC: 10:19
2. Doubt Clearance
1. DVC with G-Drive 42:50
2. DVC Setup Error: 48:45
3. Containerization with Virtual Environment 49:40
4. Create Version and ML Pipeline: 56:50
5. DVC Checkout 57:50
6. How to which ID(commit) to go to - through commit messages? 1:00:00
7. What is Kubernetes?
8. Not able to understand by reading documentation 1:04:30
9. Getting no of commits 11k+ 1:09:40
Week 3: End-to-end ML lifecycle management
Setting up MLFlow: Understand MLFlow and its alternatives in depth.
Life cycle components: Projects, model registry, performance tracking.
Best Practices for ML Lifecycle
7. Session 5 - ML Pipelines and Experimentation Tracking
1. Doubts
1. DVC Track by Add
2. Git clone with SSH vs HTTPS
2. Recap
3. Pipelines + DVC + Experimentation Tracking
4. MLFlow
8. Session 6 on MLOPs
1. Recap of Pipelines - Credit Card Example
2. Writing dvc.yaml File
3. Reproducibility after Data Changes
4. Reproducibility after Params Changes
5. ML end-2-end Pipeline
6. Tools for different Stages of Pipeline
9. Doubt Clearance Session 3
1. Assignment Solution
2. File not found error - Joblib/Data
3. Models Not found error
4. Get familiar With Terminal - DVC –help
5. DVC Repro vs DVC Exp
6. Session 2 on Discretization
1. Types of Discretization
1. Uniform Binning
2. Quantile Binning
3. K-Means Binning
4. Decision Tree Based tiening
5. Custom Binning
6. Threshold Binning (Binarization)
7. Session 1 on Handling Missing Data
1. Missing Values
2. The missingo library
3. Why missing values occur?
4. Types of missing values
5. How missing visual impact ML?
6. How to handle missing values?
1. Removing
2. Imputing
7. Removing Missing Data
Unsupervised Learning
1. KMeans Clustering
1. Session 1 on KMeans Clustering
1. Plan of Attack (Getting Started with Clustering)
2. Types of ML Learning
3. Applications of Clustering
4. Geometric Intuition of K-Means
5. Elbow Method for Deciding Number of Clusters
1. Code Example
2. Limitation of Elbow Method
6. Assumptions of KMeans
7. Limitations of K Means
2. Session 2 on KMeans Clustering
1. Recap of Last class
2. Assignment Solution
3. Silhouette Score
4. Kmeans Hyperparameters
1. Number of Clusters(k)
2. Initialization Method (K Means++)
3. Number of Initialization Runs (n_init)
4. Maximum Number of Iterations (max_iter)
5. Tolerance (tol)
6. Algorithm (auto, full, ..)
7. Random State
5. K Means ++
3. Session 3 on KMeans Clustering
1. K-Means Mathematical Formulation (Loyd’s Algorithm)
2. K-Means Time and Space Complexity
3. Mini Batch K Means
4. Types of Clustering
1. Partitional Clustering
2. Hierarchical Clustering
3. Density Based Clustering
4. Distribution/Model-based Clustering
4. K-Means Clustering Algorithms from Scratch in Python
1. Algorithms implementation from Scratch in Python
2. Other Clustering Algorithms
1. Session on DBSCAN
1. Why DBSCAN?
2. What is Density Based Clustering
3. MinPts & Epsilon
4. Core Points, Border Points & Noise Points
5. Density Connected Points
6. DBSCAN Algorithm
7. Code
8. Limitations
9. Visualization
2. Session on Hierarchical Clustering
1. Need of Other Clustering Methods
2. Introduction
3. Algorithm
4. Types of Agglomerative Clustering
1. Min (Single-link)
2. Max (Complete Link)
3. Average
4. Ward
5. How to find the ideal number of clusters
6. Hyperparameter
7. Code Example
8. Benefits/Limitations
3. Session - 1 on Gaussian Mixture Models (GMM)
1. The Why?
2. The What?
3. Geometric Intuition
4. Multivariate Normal Distribution
5. Geometric Intuition 2D
6. EM(Expectation Minimization) Algorithm
7. Python Code
4. Session - 2 on Gaussian Mixture Models
1. Recap of Session 1
2. Covariance Types: Spherical, Diagonal, Full, and Tied
3. How to decide n_components?
1. Akaike Information Criterion (AIC)
2. Bayesian Information Criterion(BIC)
3. Likelihood Formula for GMM
4. Python Implementation.
5. Why not Silhouette Score?
4. Visualization
5. Assumptions
6. Advantages & Disadvantages
7. K Means vs GMM
8. DBSCAN vs GMM
9. Applications of GMM
5. Session on T-SNE
1. What is T-SNE?
2. Why learn T-SNE?
3. Geometric Intuition
4. Mathematical Formulation
5. Code Implementation
1. Session 2 on T-SNE
1. Mathematical Formulation
2. Some Questions!
1. Why use probabilities instead of distances to calculate similarity?
2. Why use Gaussian distribution to calculate similarity in high
dimensions?
3. How is variance calculated for each Gaussian distribution?
4. Why use T-distribution in lower dimensions?
3. Code Example
4. Hyperparameters
1. Perplexity
2. Learning Rate
3. Number of Iterations
5. Points of Wisdom
6. Advantages & Disadvantages
2. Apriori
1. Introduction: Principles of association rule mining
2. Key Concepts: Support, Confidence, Lift
3. Algorithm Steps: Candidate generation, Pruning
4. Applications: Market Basket Analysis, Recommender Systems
1. Adaboost
1. Introduction: Overview and intuition of the algorithm
2. Components: Weak Learners, Weights, Final Model
3. Hyperparameters: Learning Rate, Number of Estimators
4. Applications: Use Cases in Classification and Regression
2. Stacking
1. Introduction: Concept of model ensembling
2. Steps: Base Models, Meta-Model, Final Prediction
3. Variations: Different approaches and modifications
4. Best Practices: Tips for effective stacking
3. LightGBM
Session 1 on Introduction to LightGBM
1. Introduction and core features
2. Boosting and Objective Function
3. Histogram-Based Split finding
4. Best-fit Tree (Leaf-wise growth strategy)
5. Gradient-based One side sampling(GOSS)
6. Exclusive Feature Bundling (EFB)
Session 2 on LightGBM (GOSS & EFB)
1. Recap - Features and Technical Aspects
2. Revisiting GOSS
3. EFB
4. CatBoost
Session 1 on CatBoost - Practical Introduction
1. Introduction
2. Advantages and Technical Aspects
3. Practical Implementation of CatBoost on Medical Cost Dataset
Miscellaneous Topics
1. NoSQL
1. Introduction: Overview of NoSQL databases
2. Types: Document, Key-Value, Column-Family, Graph
3. Use Cases: When to use NoSQL over SQL databases
4. Popular Databases: MongoDB, Cassandra, Redis, Neo4j
2. Model Explainability
1. Introduction: Importance of interpretable models
2. Techniques: LIME, SHAP, Feature Importance
3. Application: Applying techniques to various models
4. Best Practices: Ensuring reliable and accurate explanations
3. FastAPI
1. Introduction: Modern, fast web framework for building APIs
2. Features: Type checking, Automatic validation, Documentation
3. Building APIs: Steps and best practices
4. Deployment: Hosting and scaling FastAPI applications
4. AWS Sagemaker
1. Introduction: Fully managed service for machine learning
2. Features: Model building, Training, Deployment
3. Usage: Workflow from data preprocessing to model deployment
4. Best Practices: Optimizing costs and performance
Note: The schedule is tentative and topics can be added/removed from it in the
future.