0% found this document useful (0 votes)
12 views

Detailed Big Data and Hadoop Notes

The document provides detailed notes on Big Data and Hadoop, covering topics such as Big Data Analytics, the history and ecosystem of Hadoop, and the Hadoop Distributed File System (HDFS). It also discusses MapReduce job anatomy, various Hadoop ecosystem tools like Pig, Hive, and HBase, and introduces data analytics concepts including supervised and unsupervised learning. Additionally, it highlights IBM's integration of Hadoop into enterprise environments for enhanced data management.

Uploaded by

manveerjoc21
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Detailed Big Data and Hadoop Notes

The document provides detailed notes on Big Data and Hadoop, covering topics such as Big Data Analytics, the history and ecosystem of Hadoop, and the Hadoop Distributed File System (HDFS). It also discusses MapReduce job anatomy, various Hadoop ecosystem tools like Pig, Hive, and HBase, and introduces data analytics concepts including supervised and unsupervised learning. Additionally, it highlights IBM's integration of Hadoop into enterprise environments for enhanced data management.

Uploaded by

manveerjoc21
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Detailed Exam Notes: Big Data and Hadoop

Unit I: Introduction to Big Data and Hadoop

1. Big Data Analytics:

- Big Data refers to datasets that are too large or complex to process using traditional methods.

- Big Data Analytics involves analyzing such datasets to uncover hidden patterns, correlations,

and insights.

- Types of Data:

- Structured: Tabular data with rows and columns (e.g., databases).

- Semi-structured: Data with some structure, like JSON or XML.

- Unstructured: Data with no predefined format (e.g., images, videos, emails).

2. History of Hadoop:

- Hadoop was inspired by Google's MapReduce and GFS (Google File System).

- Doug Cutting and Mike Cafarella created Hadoop, which became an open-source framework.

- Yahoo played a major role in the development and adoption of Hadoop.

3. Hadoop Ecosystem:

- Comprises tools that work together to process and analyze Big Data.

- Core components: HDFS (storage), MapReduce (processing).

- Supporting tools: Hive, Pig, HBase, Sqoop, Flume, Oozie, and Zookeeper.

4. IBM Big Data Strategy:

- IBM Infosphere BigInsights integrates Hadoop into enterprise environments for better data

management.

- Provides advanced tools like text analytics, machine learning, and enterprise-grade security.
Unit II: HDFS (Hadoop Distributed File System)

1. HDFS Concepts:

- HDFS is a distributed storage system designed to store very large datasets across multiple

nodes.

- Data is divided into blocks (default size: 128 MB) and stored across a cluster of nodes.

- Features include fault tolerance, high throughput, and scalability.

2. Data Ingestion:

- Flume: Used for collecting, aggregating, and moving large amounts of log data into HDFS.

- Sqoop: Transfers data between HDFS and relational databases like MySQL.

3. Hadoop I/O:

- Compression: Reduces the size of data to save storage and improve performance.

- Serialization: Converts data into a format that can be stored or transmitted (e.g., Avro, Thrift).

Unit III: MapReduce

1. Anatomy of MapReduce Job:

- Splits input data into smaller chunks.

- Mapper processes chunks in parallel and generates key-value pairs.

- Reducer combines and aggregates intermediate outputs from mappers.

2. Shuffle and Sort:

- Sorts mapper outputs by key and distributes them to reducers.

3. Job Scheduling:

- Ensures tasks are executed efficiently.

- Types of schedulers: FIFO (First In First Out), Fair Scheduler, Capacity Scheduler.
Unit IV: Hadoop Ecosystem Tools

1. Pig:

- High-level scripting platform for data transformation and analysis.

- Uses Pig Latin, a language simpler than Java.

- Suitable for tasks like ETL (Extract, Transform, Load).

2. Hive:

- A data warehouse infrastructure on top of Hadoop.

- HiveQL allows querying data using an SQL-like syntax.

- Integrates with HDFS for large-scale data analysis.

3. HBase:

- A NoSQL database built on top of HDFS for real-time processing.

- Stores data in a columnar format, making it faster than relational databases (RDBMS).

Unit V: Data Analytics with R and Machine Learning

1. Supervised Learning:

- Models are trained using labeled data (input-output pairs).

- Examples: Regression (predicting values) and Classification (categorizing data).

2. Unsupervised Learning:

- Works on unlabeled data to identify patterns and relationships.

- Examples: Clustering (grouping similar items) and Dimensionality Reduction.

3. Collaborative Filtering:

- Used in recommender systems (e.g., Amazon, Netflix).

- Based on user behavior or item similarity to suggest relevant items.

You might also like