0% found this document useful (0 votes)
282 views

Course Outline Hadoop and Spark For Big Data and Data Science PDF

This course outline covers Hadoop and Spark for big data and data science over 20 weeks. It introduces key concepts like HDFS, YARN, Sqoop, Hive, Impala, Flume and Spark. It teaches how to use these tools to store, process and analyze large datasets in parallel. The course also covers machine learning using Spark MLlib, deep learning with BigDL and deploying Spark applications on cloud platforms.

Uploaded by

Yusra Shaikh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
282 views

Course Outline Hadoop and Spark For Big Data and Data Science PDF

This course outline covers Hadoop and Spark for big data and data science over 20 weeks. It introduces key concepts like HDFS, YARN, Sqoop, Hive, Impala, Flume and Spark. It teaches how to use these tools to store, process and analyze large datasets in parallel. The course also covers machine learning using Spark MLlib, deep learning with BigDL and deploying Spark applications on cloud platforms.

Uploaded by

Yusra Shaikh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Course Outline:

Hadoop and Spark for Big Data and Data Science


By Data Science Studio

Week-1: Introduction to Big Data and Data Science


 What is Big Data
 What is Data Science

Week-2: Introduction to Hadoop and the Hadoop Ecosystem


 Problems with Traditional Large Scale Systems
 Hadoop
 Data Storage and Ingest
 Data Processing
 Data Analysis and Exploration
 Other Ecosystem Tools
 Homework Lab: Setup Hadoop

Week-3: Hadoop Architecture and Hadoop Distributed File System


(HDFS)
 Distributed Processing on a Cluster
 Storage: HDFS architecture
 Storage: Using HDFS
 Homework Lab: Access HDFS with Command Line and Hue
 Resource Management: YARN Architecture
 Resource Management: Working with YARN
 Homework Lab: Run a YARN Job

Week-4: Importing Relational Data with Apache Sqoop


 Sqoop Overview
 Basic Imports and Exports
 Limiting Results
 Improving Sqoop’s performance
 Sqoop 2
 Homework Lab: Import Data from MySQL Using Sqoop
Week-5: Introduction to Impala and Hive
 Introduction to Impala and Hive
 Why use Impala and Hive
 Querying Data with Impala and Hive
 Comparing Impala and Hive to Traditional Databases

Week-6: Modeling and Managing Data with Impala and Hive


 Data Storage Overview
 Creating Databases and Tables
 Loading Data into Tables
 HCatalog
 Impala Metadata Caching
 Homework Lab: Create and Populate Tables in Impala or Hive

Week-7: Data Formats


 File Formats
 Avro Schemas
 Avro Schema Evaluation
 Using Avro with Impala, Hive and Sqoop
 Using Parquet with Impala, Hive and Sqoop
 Compression
 Homework Lab: Select a Format for Data File

Week-8: Data File Partitioning


 Partitioning Overview
 Partitioning in Impala and Hive
 Conclusion
 Homework Lab: Partition Data in Impala or Hive

Week-9: Capturing Data with Apache Flume


 What is Apache Flume?
 Basic Flume Architecture
 Flume Sources
 Flume Sinks
 Flume Channels
 Flume Configurations
 Homework Lab: Collect Web Server Logs with Flume

Week-10: Spark Basics


 What is Apache Spark?
 Using the Spark Shell
 RDDs (Resilient Distributed Dataset)
 Functional Programming in Spark
 Homwork Lab:
o View the Spark Documentation
o Explore RDDs Using Spark Shell
o Use RDDs to Transform a Dataset

Week-11: Working with RDDs in Spark


 Creating RDDs
 Other General RDD Operations
 Homework Lab: Process Data Files with Spark

Week-12: Aggregating Data with Pair RDDs


 Key-value Pair RDDs
 MapReduce
 Other Pair RDD Operations
 Homework Lab: Use Pair RDDs to Join Two Datasets

Week-13: Writing and Deploying Spark Applications


 Spark Application vs. Spark Shell
 Creating the SparkContext
 Building a Spark Application (Scala and Java)
 Running a Spark Application
 The Spark Application Web UI
 Homework Lab: Write and Run a Spark Application
 Configure Spark Properties
 Logging
 Homework Lab: Configure a Spark Application

Week-14: Parallel Processing in Spark


 Review: Spark on a Cluster
 RDD Partitions
 Partitioning of File Based RDDs
 HDFS and Data Locality
 Executing Parallel Operations
 Stages and Tasks
 Homework Lab: View Jobs and Stages in the Spark Application UI

Week-15: Spark RDD Persistence


 RDD Lineage
 RDD Persistence Overview
 Distributed Persistence
 Homework Lab: Persist an RDD

Week-16: Common Patterns in Spark Data Processing


 Common Spark Use Cases
 Iterative Algorithms in Spark
 Graph Processing and Analysis
 Machine Learning
 Example: k-means
 Homework Lab: Iterative Processing in Spark

Week-17: Spark SQL and DataFrames


 Spark SQL and the SQL Context
 Creating DataFrames
 Transforming and Querying DataFrames
 Saving DataFrames
 DataFrames and RDDs
 Comparing Spark SQL, Impala and Hive-on-Spark
 Homework Lab: Use Spark SQL for ETL

Weel-18: Running Machine Learning Algorithms Using Spark MLlib


 Machine Learning with Spark
 Preparing Data for Machine Learning
 Building a Linear Regression Model
 Evaluating a Linear Regression Model
 Visualizing a Linear Regression Model

Week-19: BigDL Distributed Deep Learning on Apache Spark


 What is Deep Learning
 What is BigDL
 Why use BigDL
 Installing and Building BigDL
 BigDL examples

Week-20: Working on Spark in the Cloud


 Spark implementation in Databricks
 Spark implementation in Cloudera
 Spark implementation in Amazon Web Service

You might also like