Course Outline Hadoop and Spark For Big Data and Data Science PDF
This course outline covers Hadoop and Spark for big data and data science over 20 weeks. It introduces key concepts like HDFS, YARN, Sqoop, Hive, Impala, Flume and Spark. It teaches how to use these tools to store, process and analyze large datasets in parallel. The course also covers machine learning using Spark MLlib, deep learning with BigDL and deploying Spark applications on cloud platforms.
Course Outline Hadoop and Spark For Big Data and Data Science PDF
This course outline covers Hadoop and Spark for big data and data science over 20 weeks. It introduces key concepts like HDFS, YARN, Sqoop, Hive, Impala, Flume and Spark. It teaches how to use these tools to store, process and analyze large datasets in parallel. The course also covers machine learning using Spark MLlib, deep learning with BigDL and deploying Spark applications on cloud platforms.
Week-2: Introduction to Hadoop and the Hadoop Ecosystem
Problems with Traditional Large Scale Systems Hadoop Data Storage and Ingest Data Processing Data Analysis and Exploration Other Ecosystem Tools Homework Lab: Setup Hadoop
Week-3: Hadoop Architecture and Hadoop Distributed File System
(HDFS) Distributed Processing on a Cluster Storage: HDFS architecture Storage: Using HDFS Homework Lab: Access HDFS with Command Line and Hue Resource Management: YARN Architecture Resource Management: Working with YARN Homework Lab: Run a YARN Job
Week-4: Importing Relational Data with Apache Sqoop
Sqoop Overview Basic Imports and Exports Limiting Results Improving Sqoop’s performance Sqoop 2 Homework Lab: Import Data from MySQL Using Sqoop Week-5: Introduction to Impala and Hive Introduction to Impala and Hive Why use Impala and Hive Querying Data with Impala and Hive Comparing Impala and Hive to Traditional Databases
Week-6: Modeling and Managing Data with Impala and Hive
Data Storage Overview Creating Databases and Tables Loading Data into Tables HCatalog Impala Metadata Caching Homework Lab: Create and Populate Tables in Impala or Hive
Week-7: Data Formats
File Formats Avro Schemas Avro Schema Evaluation Using Avro with Impala, Hive and Sqoop Using Parquet with Impala, Hive and Sqoop Compression Homework Lab: Select a Format for Data File
Week-8: Data File Partitioning
Partitioning Overview Partitioning in Impala and Hive Conclusion Homework Lab: Partition Data in Impala or Hive
Week-9: Capturing Data with Apache Flume
What is Apache Flume? Basic Flume Architecture Flume Sources Flume Sinks Flume Channels Flume Configurations Homework Lab: Collect Web Server Logs with Flume
Week-10: Spark Basics
What is Apache Spark? Using the Spark Shell RDDs (Resilient Distributed Dataset) Functional Programming in Spark Homwork Lab: o View the Spark Documentation o Explore RDDs Using Spark Shell o Use RDDs to Transform a Dataset
Week-11: Working with RDDs in Spark
Creating RDDs Other General RDD Operations Homework Lab: Process Data Files with Spark
Week-12: Aggregating Data with Pair RDDs
Key-value Pair RDDs MapReduce Other Pair RDD Operations Homework Lab: Use Pair RDDs to Join Two Datasets
Week-13: Writing and Deploying Spark Applications
Spark Application vs. Spark Shell Creating the SparkContext Building a Spark Application (Scala and Java) Running a Spark Application The Spark Application Web UI Homework Lab: Write and Run a Spark Application Configure Spark Properties Logging Homework Lab: Configure a Spark Application
Week-14: Parallel Processing in Spark
Review: Spark on a Cluster RDD Partitions Partitioning of File Based RDDs HDFS and Data Locality Executing Parallel Operations Stages and Tasks Homework Lab: View Jobs and Stages in the Spark Application UI
Common Spark Use Cases Iterative Algorithms in Spark Graph Processing and Analysis Machine Learning Example: k-means Homework Lab: Iterative Processing in Spark
Week-17: Spark SQL and DataFrames
Spark SQL and the SQL Context Creating DataFrames Transforming and Querying DataFrames Saving DataFrames DataFrames and RDDs Comparing Spark SQL, Impala and Hive-on-Spark Homework Lab: Use Spark SQL for ETL
Weel-18: Running Machine Learning Algorithms Using Spark MLlib
Machine Learning with Spark Preparing Data for Machine Learning Building a Linear Regression Model Evaluating a Linear Regression Model Visualizing a Linear Regression Model
Week-19: BigDL Distributed Deep Learning on Apache Spark
What is Deep Learning What is BigDL Why use BigDL Installing and Building BigDL BigDL examples
Week-20: Working on Spark in the Cloud
Spark implementation in Databricks Spark implementation in Cloudera Spark implementation in Amazon Web Service