Detailed Big Data and Hadoop Notes
Detailed Big Data and Hadoop Notes
- Big Data refers to datasets that are too large or complex to process using traditional methods.
- Big Data Analytics involves analyzing such datasets to uncover hidden patterns, correlations,
and insights.
- Types of Data:
2. History of Hadoop:
- Hadoop was inspired by Google's MapReduce and GFS (Google File System).
- Doug Cutting and Mike Cafarella created Hadoop, which became an open-source framework.
3. Hadoop Ecosystem:
- Comprises tools that work together to process and analyze Big Data.
- Supporting tools: Hive, Pig, HBase, Sqoop, Flume, Oozie, and Zookeeper.
- IBM Infosphere BigInsights integrates Hadoop into enterprise environments for better data
management.
- Provides advanced tools like text analytics, machine learning, and enterprise-grade security.
Unit II: HDFS (Hadoop Distributed File System)
1. HDFS Concepts:
- HDFS is a distributed storage system designed to store very large datasets across multiple
nodes.
- Data is divided into blocks (default size: 128 MB) and stored across a cluster of nodes.
2. Data Ingestion:
- Flume: Used for collecting, aggregating, and moving large amounts of log data into HDFS.
- Sqoop: Transfers data between HDFS and relational databases like MySQL.
3. Hadoop I/O:
- Compression: Reduces the size of data to save storage and improve performance.
- Serialization: Converts data into a format that can be stored or transmitted (e.g., Avro, Thrift).
3. Job Scheduling:
- Types of schedulers: FIFO (First In First Out), Fair Scheduler, Capacity Scheduler.
Unit IV: Hadoop Ecosystem Tools
1. Pig:
2. Hive:
3. HBase:
- Stores data in a columnar format, making it faster than relational databases (RDBMS).
1. Supervised Learning:
2. Unsupervised Learning:
3. Collaborative Filtering: