0% found this document useful (0 votes)
3 views

Big_Data_and_Hadoop_Notes

The document provides an overview of Big Data and Hadoop, covering topics such as Big Data analytics, the history of Hadoop, and its ecosystem tools like HDFS, MapReduce, Pig, and Hive. It also discusses data ingestion methods, job scheduling, and data analytics techniques including supervised and unsupervised learning. Additionally, it highlights IBM's integration of Hadoop with enterprise data management solutions.

Uploaded by

manveerjoc21
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Big_Data_and_Hadoop_Notes

The document provides an overview of Big Data and Hadoop, covering topics such as Big Data analytics, the history of Hadoop, and its ecosystem tools like HDFS, MapReduce, Pig, and Hive. It also discusses data ingestion methods, job scheduling, and data analytics techniques including supervised and unsupervised learning. Additionally, it highlights IBM's integration of Hadoop with enterprise data management solutions.

Uploaded by

manveerjoc21
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Exam-Oriented Notes: Big Data and Hadoop

Unit I: Introduction to Big Data and Hadoop

1. Big Data Analytics:

- Processing large, complex datasets to extract useful patterns and insights.

- Types: Structured, Unstructured, Semi-structured.

2. History of Hadoop:

- Developed by Doug Cutting and Mike Cafarella.

- Inspired by Google's MapReduce and GFS papers.

3. Hadoop Ecosystem:

- Tools like HDFS, MapReduce, Pig, Hive, HBase, Sqoop, Flume, and Oozie.

4. IBM Big Data Strategy:

- Integrates Hadoop with IBM Infosphere BigInsights for enterprise data management.

Unit II: HDFS (Hadoop Distributed File System)

1. HDFS Concepts:

- Distributed storage system for large datasets.

- Data divided into blocks and distributed across nodes.

2. Data Ingestion (Flume and Sqoop):

- Flume: Moves large logs into HDFS.

- Sqoop: Transfers structured data between HDFS and databases.

3. Hadoop I/O:
- Compression: Reduces data size.

- Serialization: Converts data into storable formats.

Unit III: MapReduce

1. Anatomy of MapReduce Job:

- Splits data into tasks, processes them in parallel, and combines results.

2. Shuffle and Sort:

- Organizes data before the reduce phase.

3. Job Scheduling:

- Ensures efficient task execution using schedulers like FIFO, Fair Scheduler.

Unit IV: Hadoop Ecosystem Tools

1. Pig:

- High-level platform for processing data.

- Uses Pig Latin language, easier than Java.

2. Hive:

- Query data using HiveQL (SQL-like language).

- Used for data warehousing and querying.

3. HBase:

- NoSQL database for real-time data.

- Faster than traditional RDBMS.

Unit V: Data Analytics with R and Machine Learning

1. Supervised Learning:
- Uses labeled data to train models.

- Examples: Regression, Classification.

2. Unsupervised Learning:

- Works on unlabeled data to find patterns.

- Examples: Clustering, Dimensionality Reduction.

3. Collaborative Filtering:

- Recommender systems based on user preferences.

You might also like