0% found this document useful (0 votes)
3 views

5-Day KVCET Bootcamp - Data Analytics

The 5-day KVCET Bootcamp on Data Analytics offers a hands-on, industry-aligned curriculum focused on transforming raw data into actionable insights. Participants will learn data gathering, cleaning, engineering, and automation techniques using real-world applications and tools like Python, SQL, and Apache Kafka. The program culminates in a final project where students build a complete data pipeline, integrating all learned skills to generate insights from large datasets.

Uploaded by

solomon raju
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

5-Day KVCET Bootcamp - Data Analytics

The 5-day KVCET Bootcamp on Data Analytics offers a hands-on, industry-aligned curriculum focused on transforming raw data into actionable insights. Participants will learn data gathering, cleaning, engineering, and automation techniques using real-world applications and tools like Python, SQL, and Apache Kafka. The program culminates in a final project where students build a complete data pipeline, integrating all learned skills to generate insights from large datasets.

Uploaded by

solomon raju
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

5-day KVCET Bootcamp - Data Analytics

MODE OF DELIVERY:
Offline (in-campus)

METHODOLOGY:
Hands-on & industry-aligned

TENTATIVE PROGRAM DURATION:


5 days

INDICATIVE CURRICULA:

ro
Course Title: "Data Mastery: From Raw Data to Actionable Insights"

This 30-hour course will take students through the entire process of data gathering,
cleaning, and engineering, covering essential techniques to transform raw data into
structured, high-quality datasets ready for analysis or machine learning. The course will
st
focus on real-world applications, using hands-on assignments and advanced industry tools
to prepare students for data-related challenges in the industry.

Unit 1: Data Gathering - Collecting Data from Diverse Sources (6 hours)


Lu
Objective: Introduce students to the different methods of gathering structured and
unstructured data, and prepare them for handling data from a variety of real-world
sources.

Sub-unit 1.1: Introduction to Data Sources and Collection Methods (2 hours)


In

●​ Topics:
○​ Different Types of Data: Structured, semi-structured, and unstructured
○​ APIs: Using REST APIs to collect data (e.g., public datasets, third-party
services)
○​ Web Scraping: Extracting data from websites using tools like BeautifulSoup,
Scrapy, and Selenium
○​ Unstructured Data: Gathering data from non-structured sources like PDFs,
emails, or social media
○​ Cloud and Streaming Data: Collecting real-time data using cloud-based
services (AWS, Google Cloud, and APIs)
●​ Real-World Assignment:
○​ Task: Write a script that scrapes product data (e.g., name, price, and reviews)
from an e-commerce website and stores it in a structured CSV or JSON file.
○​ Deliverables: A Python script that scrapes real-time data from a website
using BeautifulSoup and saves it in the required format.
●​ Test:
○​ Practical: Scrape data from a website and ensure that it's structured in the
correct format (CSV or JSON).
○​ Theory: Multiple-choice questions on types of data and methods of collection.

Sub-unit 1.2: Real-Time Data Collection and Cloud Integration (4 hours)

●​ Topics:

ro
○​ Real-Time Data Streaming: Introduction to Apache Kafka, AWS Kinesis, and
Google Cloud Pub/Sub
○​ Working with REST APIs: Collecting real-time data using API endpoints for
live updates
○​ Integrating with Cloud Data Platforms: Collecting and storing data in
cloud-based databases (e.g., AWS DynamoDB, Google BigQuery)
st
●​ Real-World Assignment:
○​ Task: Set up a real-time data pipeline using Kafka or AWS Kinesis to stream
data from an API (e.g., Twitter) and store it in MongoDB or Google
BigQuery.
○​ Deliverables: A script that sets up real-time streaming and stores data in a
Lu
cloud database.
●​ Test:
○​ Practical: Create a real-time data pipeline that collects and stores data from
a public API and demonstrates handling real-time data streams.
○​ Theory: Questions on real-time data collection and cloud-based data
integration.
In

Unit 2: Data Cleaning - Transforming Raw Data into Structured Format


(10 hours)

Objective: Equip students with the skills to clean, preprocess, and prepare raw data
for analysis or machine learning.

Sub-unit 2.1: Handling Missing Data and Outliers (4 hours)

●​ Topics:
○​ Techniques for Handling Missing Data: Imputation methods (mean, median,
mode, KNN imputation)
○​ Dealing with Duplicates: Identifying and removing duplicates in datasets
○​ Outlier Detection and Treatment: Techniques for identifying and removing
outliers (IQR, Z-Score)
○​ Visualizing Missing Data: Using Python tools like Seaborn and Matplotlib to
visualize missing data
●​ Real-World Assignment:
○​ Task: Clean a messy dataset with missing values and outliers (e.g., a real
estate or customer dataset). Apply imputation and remove outliers where
appropriate.
○​ Deliverables: A cleaned dataset with imputation, outlier removal, and
visualizations showing the impact of these changes.

ro
●​ Test:
○​ Practical: Clean a given dataset by handling missing values and outliers
using different techniques.
○​ Theory: Multiple-choice questions on handling missing data and outliers.
st
Sub-unit 2.2: Data Transformation and Feature Engineering (6 hours)

●​ Topics:
○​ Data Transformation: Scaling, normalization, and standardization techniques
(e.g., Min-Max Scaling, Z-Score Normalization)
Lu
○​ Feature Engineering: Creating meaningful features from raw data (e.g.,
extracting date-related features, text analysis)
○​ Feature Selection and Dimensionality Reduction: Using techniques like PCA
and LDA to reduce the number of features
○​ Handling Categorical Data: Encoding techniques like One-Hot Encoding and
Label Encoding
●​ Real-World Assignment:
In

○​ Task: Transform a dataset using normalization, feature extraction, and feature


selection. Apply one-hot encoding to categorical variables and scale
numerical features.
○​ Deliverables: A transformed dataset with new features and encoded
categorical data, ready for machine learning.
●​ Test:
○​ Practical: Implement feature engineering and data transformation techniques
on a given dataset.
○​ Theory: Short-answer questions on scaling, normalization, and feature
extraction techniques.
Unit 3: Advanced Data Engineering - Building Data Pipelines and
Automating Workflows (8 hours)

Objective: Teach students how to automate and scale their data processing
workflows, handling large datasets efficiently.

Sub-unit 3.1: Building and Automating ETL Pipelines (4 hours)

●​ Topics:
○​ What is ETL? Understanding Extract, Transform, Load processes
○​ Automating ETL: Using Apache Airflow for scheduling and managing ETL
workflows

ro
○​ Building Data Pipelines: Using Python, Airflow, and AWS Lambda for
automating ETL tasks
○​ Data Warehousing: Introduction to cloud data warehouses (e.g., Google
BigQuery, AWS Redshift)
●​ Real-World Assignment:
○​ Task: Set up an automated ETL pipeline using Airflow to extract data from an
st API, transform it (cleaning, feature engineering), and load it into a
cloud-based data warehouse.
○​ Deliverables: A functional ETL pipeline that runs on a schedule, extracts data
from an API, processes it, and stores it in a cloud data warehouse.
●​ Test:
Lu
○​ Practical: Build a simple ETL pipeline using Airflow to automate the process
of data extraction, transformation, and loading into a database.

Sub-unit 3.2: Big Data Engineering and Real-Time Data Processing (4 hours)

●​ Topics:
In

○​ Introduction to Big Data Technologies: Using Apache Spark for distributed


data processing
○​ Working with Hadoop and Spark: Writing Spark jobs for large-scale data
transformation
○​ Real-Time Data Processing: Using Apache Kafka and Apache Flink for
streaming data processing
○​ Cloud Data Processing: Using AWS EMR or Google Dataproc for scalable
data processing in the cloud
●​ Real-World Assignment:
○​ Task: Process a large dataset (e.g., logs or sales data) using Apache Spark
on AWS EMR or Google Dataproc, performing distributed transformations.
○​ Deliverables: A data processing job that uses Spark to process and analyze
a large dataset in a distributed environment.
●​ Test:
○​ Practical: Write a Spark job that processes a large dataset and outputs a
transformed dataset.
○​ Theory: Multiple-choice questions on big data technologies, Spark, and
real-time processing.

Unit 4: Final Project - From Raw Data to Actionable Insights (8 hours)

Objective: Bring together all the concepts and techniques learned in the course to
complete a full data engineering project.

ro
Sub-unit 4.1: Capstone Project Setup and Data Preparation (4 hours)

●​ Topics:
○​ Selecting a Real-World Dataset: Choosing a large dataset from a domain
such as finance, healthcare, or e-commerce
○​ Data Cleaning and Transformation: Applying all learned data cleaning,
st
transformation, and engineering techniques
○​ Exploratory Data Analysis (EDA): Visualizing the data and summarizing key
patterns and insights
●​ Real-World Assignment:
○​ Task: Choose a real-world dataset and apply cleaning, transformation, and
Lu
feature engineering to prepare the data for analysis or machine learning.
○​ Deliverables: A fully cleaned and transformed dataset with visualizations and
insights from exploratory data analysis.
●​ Test:
○​ Practical: Clean and preprocess a real-world dataset and provide EDA
insights.
In

Sub-unit 4.2: Data Pipeline Automation and Reporting (4 hours)

●​ Topics:
○​ Automating the Data Pipeline: Building a full end-to-end ETL pipeline to
automate data processing tasks
○​ Reporting: Generating insights from the processed data and creating reports
or dashboards
○​ Final Presentation: Preparing the final project report and presentation for
stakeholders
●​ Real-World Assignment:
○​ Task: Complete the data pipeline project by setting up automation, generating
reports, and visualizing insights in a dashboard (e.g., using Tableau or Power
BI).
○​ Deliverables: An automated ETL pipeline with a final report or dashboard
showcasing the insights derived from the data.
●​ Test:
○​ Practical: Present a fully functional data pipeline project with automated
processing and real-time insights.

Assessment and Certification:

1.​ Hands-On Projects: Real-world assignments like web scraping, data transformation,

ro
and building automated ETL pipelines.
2.​ Final Project: Build a complete data pipeline that automates the collection,
transformation, and reporting of insights from a large dataset.
3.​ Exams:
○​ Practical: Build a full data pipeline, including collecting, cleaning, and
st transforming data, and generating reports.
○​ Theory: Multiple-choice and short-answer questions on data gathering,
cleaning, and engineering techniques.
Lu
Tools and Technologies Covered:

●​ Languages: Python, SQL


●​ Libraries: Pandas, NumPy, BeautifulSoup, Scrapy, Airflow, Matplotlib, Seaborn
●​ Tools: Apache Kafka, Apache Spark, AWS, Google Cloud, MongoDB Atlas
●​ Other: Jupyter, Tableau, Power BI
In

This 30-hour course ensures students are equipped to handle the entire data pipeline
process, from data collection and cleaning to advanced engineering and real-time
processing, preparing them for real-world data engineering challenges.

Please note: The proposed curricula are only indicative, and will be modified to fit the requirements of
the students at the institution. Estimated program delivery durations may vary based on client
approvals and the actual length of the course.

You might also like