0% found this document useful (0 votes)

282 views

Course Outline Hadoop and Spark For Big Data and Data Science PDF

This course outline covers Hadoop and Spark for big data and data science over 20 weeks. It introduces key concepts like HDFS, YARN, Sqoop, Hive, Impala, Flume and Spark. It teaches how to use these tools to store, process and analyze large datasets in parallel. The course also covers machine learning using Spark MLlib, deep learning with BigDL and deploying Spark applications on cloud platforms.

Uploaded by

Yusra Shaikh

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

282 views

Course Outline Hadoop and Spark For Big Data and Data Science PDF

Uploaded by

Yusra Shaikh

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Course Outline:

Hadoop and Spark for Big Data and Data Science

By Data Science Studio

Week-1: Introduction to Big Data and Data Science

 What is Big Data
 What is Data Science

Week-2: Introduction to Hadoop and the Hadoop Ecosystem

 Problems with Traditional Large Scale Systems
 Hadoop
 Data Storage and Ingest
 Data Processing
 Data Analysis and Exploration
 Other Ecosystem Tools
 Homework Lab: Setup Hadoop

Week-3: Hadoop Architecture and Hadoop Distributed File System

(HDFS)
 Distributed Processing on a Cluster
 Storage: HDFS architecture
 Storage: Using HDFS
 Homework Lab: Access HDFS with Command Line and Hue
 Resource Management: YARN Architecture
 Resource Management: Working with YARN
 Homework Lab: Run a YARN Job

Week-4: Importing Relational Data with Apache Sqoop

 Sqoop Overview
 Basic Imports and Exports
 Limiting Results
 Improving Sqoop’s performance
 Sqoop 2
 Homework Lab: Import Data from MySQL Using Sqoop
Week-5: Introduction to Impala and Hive
 Introduction to Impala and Hive
 Why use Impala and Hive
 Querying Data with Impala and Hive
 Comparing Impala and Hive to Traditional Databases

Week-6: Modeling and Managing Data with Impala and Hive

 Data Storage Overview
 Creating Databases and Tables
 Loading Data into Tables
 HCatalog
 Impala Metadata Caching
 Homework Lab: Create and Populate Tables in Impala or Hive

Week-7: Data Formats

 File Formats
 Avro Schemas
 Avro Schema Evaluation
 Using Avro with Impala, Hive and Sqoop
 Using Parquet with Impala, Hive and Sqoop
 Compression
 Homework Lab: Select a Format for Data File

Week-8: Data File Partitioning

 Partitioning Overview
 Partitioning in Impala and Hive
 Conclusion
 Homework Lab: Partition Data in Impala or Hive

Week-9: Capturing Data with Apache Flume

 What is Apache Flume?
 Basic Flume Architecture
 Flume Sources
 Flume Sinks
 Flume Channels
 Flume Configurations
 Homework Lab: Collect Web Server Logs with Flume

Week-10: Spark Basics

 What is Apache Spark?
 Using the Spark Shell
 RDDs (Resilient Distributed Dataset)
 Functional Programming in Spark
 Homwork Lab:
o View the Spark Documentation
o Explore RDDs Using Spark Shell
o Use RDDs to Transform a Dataset

Week-11: Working with RDDs in Spark

 Creating RDDs
 Other General RDD Operations
 Homework Lab: Process Data Files with Spark

Week-12: Aggregating Data with Pair RDDs

 Key-value Pair RDDs
 MapReduce
 Other Pair RDD Operations
 Homework Lab: Use Pair RDDs to Join Two Datasets

Week-13: Writing and Deploying Spark Applications

 Spark Application vs. Spark Shell
 Creating the SparkContext
 Building a Spark Application (Scala and Java)
 Running a Spark Application
 The Spark Application Web UI
 Homework Lab: Write and Run a Spark Application
 Configure Spark Properties
 Logging
 Homework Lab: Configure a Spark Application

Week-14: Parallel Processing in Spark

 Review: Spark on a Cluster
 RDD Partitions
 Partitioning of File Based RDDs
 HDFS and Data Locality
 Executing Parallel Operations
 Stages and Tasks
 Homework Lab: View Jobs and Stages in the Spark Application UI

Week-15: Spark RDD Persistence

 RDD Lineage
 RDD Persistence Overview
 Distributed Persistence
 Homework Lab: Persist an RDD

Week-16: Common Patterns in Spark Data Processing

 Common Spark Use Cases
 Iterative Algorithms in Spark
 Graph Processing and Analysis
 Machine Learning
 Example: k-means
 Homework Lab: Iterative Processing in Spark

Week-17: Spark SQL and DataFrames

 Spark SQL and the SQL Context
 Creating DataFrames
 Transforming and Querying DataFrames
 Saving DataFrames
 DataFrames and RDDs
 Comparing Spark SQL, Impala and Hive-on-Spark
 Homework Lab: Use Spark SQL for ETL

Weel-18: Running Machine Learning Algorithms Using Spark MLlib

 Machine Learning with Spark
 Preparing Data for Machine Learning
 Building a Linear Regression Model
 Evaluating a Linear Regression Model
 Visualizing a Linear Regression Model

Week-19: BigDL Distributed Deep Learning on Apache Spark

 What is Deep Learning
 What is BigDL
 Why use BigDL
 Installing and Building BigDL
 BigDL examples

Week-20: Working on Spark in the Cloud

 Spark implementation in Databricks
 Spark implementation in Cloudera
 Spark implementation in Amazon Web Service

Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Midhun BIGDATA Curicullum
No ratings yet
Midhun BIGDATA Curicullum
17 pages
Spark Syllabus 1
No ratings yet
Spark Syllabus 1
3 pages
HDInsight Essentials - Second Edition
From Everand
HDInsight Essentials - Second Edition
Rajesh Nadipalli
No ratings yet
Professional Hadoop Solutions
From Everand
Professional Hadoop Solutions
Boris Lublinsky
4/5 (2)
Apache Spark: Dhineshkumar S K
No ratings yet
Apache Spark: Dhineshkumar S K
31 pages
Apache Spark Theory by Arsh
No ratings yet
Apache Spark Theory by Arsh
4 pages
L.V. Siva Reddy: Sivareddylella95@
No ratings yet
L.V. Siva Reddy: Sivareddylella95@
3 pages
Chatgpt
No ratings yet
Chatgpt
7 pages
MongoDB - Course Curriculum
No ratings yet
MongoDB - Course Curriculum
5 pages
MongoDB's Performance Over RDBMS - MongoDB
No ratings yet
MongoDB's Performance Over RDBMS - MongoDB
12 pages
Bigdata 2016 Hands On 2891109
No ratings yet
Bigdata 2016 Hands On 2891109
96 pages
2 Hadoop (Uploaded)
No ratings yet
2 Hadoop (Uploaded)
82 pages
Serverless Architecture For Product Defect Detection Using Computer Vision Ra
No ratings yet
Serverless Architecture For Product Defect Detection Using Computer Vision Ra
1 page
JVM Architecture
No ratings yet
JVM Architecture
6 pages
Big Data Best Practices PDF
No ratings yet
Big Data Best Practices PDF
4 pages
Nirmal Full Stack Developer Resume
No ratings yet
Nirmal Full Stack Developer Resume
4 pages
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
No ratings yet
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
47 pages
3 Lecture 3-ETL
100% (1)
3 Lecture 3-ETL
42 pages
Resume SDLC & Agile Methodology
No ratings yet
Resume SDLC & Agile Methodology
2 pages
Hadoop Hdfs Commands
No ratings yet
Hadoop Hdfs Commands
5 pages
1.hadoop Admin Brochure
No ratings yet
1.hadoop Admin Brochure
11 pages
6 Frequently Asked Hadoop Interview Questions and Answers: Q1.What Is Hadoop?
No ratings yet
6 Frequently Asked Hadoop Interview Questions and Answers: Q1.What Is Hadoop?
8 pages
Hadoop Lab
100% (1)
Hadoop Lab
32 pages
Unstructured Dataload Into Hive Database Through PySpark
No ratings yet
Unstructured Dataload Into Hive Database Through PySpark
9 pages
Hadoop Interview Questions
No ratings yet
Hadoop Interview Questions
28 pages
1 Apache Zookeeper
No ratings yet
1 Apache Zookeeper
7 pages
Apps DBA Interview Question
No ratings yet
Apps DBA Interview Question
12 pages
Hadoop Interview Questions
No ratings yet
Hadoop Interview Questions
28 pages
7 Hive Notes
No ratings yet
7 Hive Notes
36 pages
Mining Data Streams
No ratings yet
Mining Data Streams
67 pages
Edureka Interview Questions - HDFS
No ratings yet
Edureka Interview Questions - HDFS
4 pages
Hadoop and Java Ques - Ans
No ratings yet
Hadoop and Java Ques - Ans
222 pages
Real Time Hadoop Interview Questions From Various Interviews
No ratings yet
Real Time Hadoop Interview Questions From Various Interviews
6 pages
2018 02 08 Whats New in Apache Spark 2 180213220045
No ratings yet
2018 02 08 Whats New in Apache Spark 2 180213220045
57 pages
Apache Cassandra Database - Instaclustr
No ratings yet
Apache Cassandra Database - Instaclustr
8 pages
Hadoop Interview Question
No ratings yet
Hadoop Interview Question
25 pages
DVS SPARK Course Content PDF
No ratings yet
DVS SPARK Course Content PDF
2 pages
Hadoop Interviews Q
No ratings yet
Hadoop Interviews Q
9 pages
Spark Notes
No ratings yet
Spark Notes
6 pages
Facebook Hive POC
No ratings yet
Facebook Hive POC
18 pages
Nagarjuna Hadoop Resume
No ratings yet
Nagarjuna Hadoop Resume
7 pages
Hadoop Overview
100% (1)
Hadoop Overview
16 pages
Spark Training in Bangalore
No ratings yet
Spark Training in Bangalore
36 pages
Azure Cloud Intro
No ratings yet
Azure Cloud Intro
34 pages
Big Data Syllabus For Theory and Lab
No ratings yet
Big Data Syllabus For Theory and Lab
4 pages
Business Intelligence DW
No ratings yet
Business Intelligence DW
17 pages
Databricks Question
No ratings yet
Databricks Question
7 pages
SS1123 - D2T - Apache Cassandra Overview PDF
100% (1)
SS1123 - D2T - Apache Cassandra Overview PDF
45 pages
Hadoop (Big Data) : Skills Gained
No ratings yet
Hadoop (Big Data) : Skills Gained
8 pages
Big Data Masters Certification Learnbay
No ratings yet
Big Data Masters Certification Learnbay
12 pages
Resume - Veera Bhadra Rao
No ratings yet
Resume - Veera Bhadra Rao
4 pages
Name: Wable Snehal Mahesh Subject:-Scala & Spark Div: - Mba Ii Roll No: - 57 Guidence Name: - Prof. Archana Suryawanshi - Kadam
No ratings yet
Name: Wable Snehal Mahesh Subject:-Scala & Spark Div: - Mba Ii Roll No: - 57 Guidence Name: - Prof. Archana Suryawanshi - Kadam
11 pages
Chapter 03 - Arrays & Strings Part-02 Strings
No ratings yet
Chapter 03 - Arrays & Strings Part-02 Strings
25 pages
HBase Interview Questions
No ratings yet
HBase Interview Questions
12 pages
ApacheSpark MyNotes
No ratings yet
ApacheSpark MyNotes
6 pages
4.memory Management in Java
No ratings yet
4.memory Management in Java
9 pages
Shihab Alkaff PDF
No ratings yet
Shihab Alkaff PDF
6 pages
Primer On Big Data Testing
No ratings yet
Primer On Big Data Testing
24 pages
Interview PDF
No ratings yet
Interview PDF
100 pages
Spekaing Rubrics
No ratings yet
Spekaing Rubrics
1 page
Course Outline Hadoop and Spark For Big Data and Data Science
100% (1)
Course Outline Hadoop and Spark For Big Data and Data Science
4 pages
Bonfire Night Jigsaw Reading Texts PDF
No ratings yet
Bonfire Night Jigsaw Reading Texts PDF
1 page
Exercise 1. Insert The Proper Article
No ratings yet
Exercise 1. Insert The Proper Article
3 pages
Active Directory USN Rollback
No ratings yet
Active Directory USN Rollback
1 page
Database Lecture Technics PDF
No ratings yet
Database Lecture Technics PDF
13 pages
Lab 08 MongoDB MFF
No ratings yet
Lab 08 MongoDB MFF
31 pages
Deloite_USI_interview_Questions
No ratings yet
Deloite_USI_interview_Questions
3 pages
Bai Tap Thuc Hanh Phan 1
No ratings yet
Bai Tap Thuc Hanh Phan 1
16 pages
Oracle 1z0-898
No ratings yet
Oracle 1z0-898
40 pages
Bda Unit 4 060115 Big Data Analytics Unit 4
No ratings yet
Bda Unit 4 060115 Big Data Analytics Unit 4
19 pages
Zimbra Mysql Recovery Howto
No ratings yet
Zimbra Mysql Recovery Howto
6 pages
DBMS Concurrency Control Tutorialspoint
No ratings yet
DBMS Concurrency Control Tutorialspoint
3 pages
C ADO - NET. Building Secure and Scalable Data Access 2023 (Theophilus Edet)
No ratings yet
C ADO - NET. Building Secure and Scalable Data Access 2023 (Theophilus Edet)
275 pages
CIS
No ratings yet
CIS
17 pages
SQL - PDF Solution
36% (11)
SQL - PDF Solution
12 pages
Adbms 1
No ratings yet
Adbms 1
28 pages
SQL Level 2 - Powerpoint Joins
No ratings yet
SQL Level 2 - Powerpoint Joins
43 pages
Postgis-2 0 0
No ratings yet
Postgis-2 0 0
612 pages
CC5051NIDatabasesY24AutumnMainSitCWQP - Docx 2152d779 Fc1f 4541 8550 2ff3dac1a7d5 91647
No ratings yet
CC5051NIDatabasesY24AutumnMainSitCWQP - Docx 2152d779 Fc1f 4541 8550 2ff3dac1a7d5 91647
9 pages
Pgdump Pgrestore
No ratings yet
Pgdump Pgrestore
2 pages
Cursor and Trigger
No ratings yet
Cursor and Trigger
24 pages
SQL OLD QUESTIONs Class 12B
No ratings yet
SQL OLD QUESTIONs Class 12B
9 pages
IN3001 MockExam Paper
No ratings yet
IN3001 MockExam Paper
15 pages
Resume of PL SQL Developing PDF
No ratings yet
Resume of PL SQL Developing PDF
4 pages
SQL Server Interview Questions: Explain The Use of Keyword WITH ENCRYPTION. Create A Store Procedure With Encryption
No ratings yet
SQL Server Interview Questions: Explain The Use of Keyword WITH ENCRYPTION. Create A Store Procedure With Encryption
10 pages
MariaDB Galera Cluster
100% (1)
MariaDB Galera Cluster
69 pages
DBMS Lab Manual
No ratings yet
DBMS Lab Manual
74 pages
OOAD: Normalization: Presenter: Dr. Ha Viet Uyen Synh
No ratings yet
OOAD: Normalization: Presenter: Dr. Ha Viet Uyen Synh
40 pages
Dbms - VT Edusat
100% (1)
Dbms - VT Edusat
219 pages
Read Me
No ratings yet
Read Me
2 pages
PostgreSQL Internals Through Pictures
100% (3)
PostgreSQL Internals Through Pictures
72 pages
IBM InfoShpere Datastage Course Content
No ratings yet
IBM InfoShpere Datastage Course Content
5 pages
Expected Viva Questions: Chapter-1 Overview of Computerised Accounting System
100% (1)
Expected Viva Questions: Chapter-1 Overview of Computerised Accounting System
6 pages

Course Outline Hadoop and Spark For Big Data and Data Science PDF

Uploaded by

Course Outline Hadoop and Spark For Big Data and Data Science PDF

Uploaded by

Course Outline:

Hadoop and Spark for Big Data and Data Science

Week-1: Introduction to Big Data and Data Science

Week-2: Introduction to Hadoop and the Hadoop Ecosystem

Week-3: Hadoop Architecture and Hadoop Distributed File System

Week-4: Importing Relational Data with Apache Sqoop

Week-6: Modeling and Managing Data with Impala and Hive

Week-7: Data Formats

Week-8: Data File Partitioning

Week-9: Capturing Data with Apache Flume

Week-10: Spark Basics

Week-11: Working with RDDs in Spark

Week-12: Aggregating Data with Pair RDDs

Week-13: Writing and Deploying Spark Applications

Week-14: Parallel Processing in Spark

Week-15: Spark RDD Persistence

Week-16: Common Patterns in Spark Data Processing

Week-17: Spark SQL and DataFrames

Weel-18: Running Machine Learning Algorithms Using Spark MLlib

Week-19: BigDL Distributed Deep Learning on Apache Spark

Week-20: Working on Spark in the Cloud

You might also like