0% found this document useful (0 votes)
3 views

Big Data Unit 5 (Easy Notes ) Edushine Classes

Uploaded by

Yashi Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Big Data Unit 5 (Easy Notes ) Edushine Classes

Uploaded by

Yashi Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Big Data(BCS061/BCDS-601/KOE-097

Unit – 5 Hadoop Ecosystem Frameworks , Pig,


Hive , HBase

Edushine Classes
Follow Us
Download Notes : https://rzp.io/rzp/JV7zlavG
https://telegram.me/rrsimtclasses/
Big Data(BCS061/BCDS-601

🐷 What is Pig? (in Hadoop)


Pig is a data flow language used to analyze big data in Hadoop.
It uses a simple language called Pig Latin, which is easier than writing Java MapReduce code.
 Why Use Pig?
• It helps process huge data sets.
• It reduces coding time (just like SQL is easier than full programming).
• It converts your code into MapReduce jobs automatically.
⚙Execution Modes of Pig :
Pig can run in 2 Modes –
Big Data(BCS061/BCDS-601

➡ You can choose the mode using the command:


pig -x local // for local mode
pig -x mapreduce // for Hadoop cluster

🌟 Features of Pig
i. Easy to Learn – Uses Pig Latin, similar to SQL.
ii. Handles Big Data – Good for analyzing huge datasets.
iii. Extensible – You can write your own functions (called UDFs).
iv. Automatically Converts to MapReduce – No need to write complex code.
v. Supports Joins, Filters, Grouping – Like SQL operations.
vi. Error Handling – Provides good debugging and error messages.
Pig is a tool to process big data using Pig Latin.
It runs in local or MapReduce mode and makes data handling easy and fast in Hadoop.
Big Data(BCS061/BCDS-601

🐷 Pig Latin vs SQL(Database) :

RRSIMT CLASSES WHATSAPP - 9795358008 Follow Us


Big Data(BCS061/BCDS-601

🐷💻 What is Grunt in Pig?(Short Note)


• Grunt is the command-line interface (CLI) of Pig.
• It’s like a place where you type Pig commands and run them step by step.
✅ What You Can Do in Grunt:
• Write and run Pig Latin commands
• Load, filter, join, and process data
• See outputs and debug easily
Big Data(BCS061/BCDS-601

 Syntax and Semantics of Pig Latin :


✅ Syntax of Pig Latin
Pig Latin is a data flow language. Its syntax defines how we write statements to process
data step by step.
It includes commands like:
1.LOAD – To load data from HDFS
data = LOAD 'file.txt' USING PigStorage(',') AS (name:chararray, age:int);
2.FILTER – To select rows based on condition
adults = FILTER data BY age >= 18;
3.FOREACH…GENERATE – To select specific columns
names = FOREACH adults GENERATE name;
4.GROUP – To group records
grouped = GROUP data BY age;
5.JOIN – To combine two datasets
joined = JOIN A BY id, B BY id;
Big Data(BCS061/BCDS-601

6.STORE/DUMP – To save or display the result


DUMP names;
STORE names INTO 'output';

✅ Semantics of Pig Latin


Semantics means the meaning of the Pig Latin statements. Each line is a step in the data
flow and describes how data moves and is processed.
Example :
data = LOAD 'students.csv' AS (name, marks);
passed = FILTER data BY marks >= 33; Pig Latin has a simple syntax and clear
DUMP passed; Meaning: semantics, making it easy to process large
• Load student data data in Hadoop. It supports step-by-step
• Select only those who passed data flow, similar to SQL but more flexible
• Show the result on screen for big data.
Big Data(BCS061/BCDS-601

✅ What is a UDF in Pig?


A User Defined Function (UDF) in Pig is a custom function created by the user to perform
operations that are not available in built-in functions.
Pig has many built-in functions, but if you need something special (like custom string or
math logic), you can create your own.
✅ Language Used:
 UDFs are usually written in Java
 Can also be written in Python, Ruby, or JavaScript
✅ Example Use:
Let’s say you want to convert names to uppercase but there’s no built-in function:
You can write a UDF like ToUpper() and use it in Pig like:
Example :
REGISTER myudfs.jar;
data = LOAD 'file.txt' AS (name:chararray);
upper_names = FOREACH data GENERATE ToUpper(name);
Big Data(BCS061/BCDS-601

 Data Processing Operators in Pig


Pig Latin provides several data processing operators that help in analyzing and transforming
large datasets efficiently. These operators allow step-by-step data processing similar to SQL
but are more suitable for parallel processing in Hadoop.
🔹 1. LOAD
Used to load data from a file or HDFS into a relation.
🔹 2. FILTER
Used to select records that meet a specific condition.
🔹 3. FOREACH…GENERATE
Used to perform operations on each record and generate new output.
🔹 4. GROUP
Used to group records based on the value of a specific field.
🔹 5. JOIN
Used to join two or- more relations
Download based
Notes on a common key.
: https://rzp.io/rzp/JV7zlavG
Big Data(BCS061/BCDS-601

🔹 6. ORDER
Used to sort the data based on one or more fields.
🔹 7. DISTINCT
Used to remove duplicate records from a dataset.
🔹 8. LIMIT
Used to return a specified number of rows.
🔹 9. DUMP
Used to display the result on the console.
🔹 10. STORE
Used to save the result into a file or directory in HDFS.
These operators are essential for performing tasks like filtering, grouping, joining, and
storing data in big data applications using Pig.
Big Data(BCS061/BCDS-601

 Apache Hive and Its Architecture


🔹 What is Hive?
Hive is a data warehouse tool built on top of Hadoop. It helps in reading, writing, and
managing large datasets using HiveQL (a SQL-like language). It converts HiveQL queries
into MapReduce jobs for processing.
Big Data(BCS061/BCDS-601

🏗 Architecture of Hive:
1. User Interfaces:
Used to interact with Hive.
Examples:
• Web UI
• Hive Command Line
• HDInsight
2. Meta Store:
• Stores metadata (info about tables, columns, data types).
• Helps Hive know where and how the data is stored in HDFS.
3. HiveQL Process Engine:
• Receives queries written in HiveQL.
• Checks the syntax and passes the query to the execution engine.
Big Data(BCS061/BCDS-601

4. Execution Engine:
• Converts queries into MapReduce jobs.
• Executes them on the Hadoop cluster.
5. HDFS or HBase Storage:
• Hive stores actual data in HDFS or HBase.
• It just processes queries over this stored data.
Hive lets you run SQL-like queries on big data stored in HDFS. It uses components like
Metastore, HiveQL engine, and Execution engine to turn your queries into results.
Big Data(BCS061/BCDS-601

✍Working of Hive with Hadoop (Step-by-Step)


When a user runs a HiveQL query, this is what happens:
Big Data(BCS061/BCDS-601

🔹 1. Interface (Step 1 & 10):


The user writes the query using Hive Command Line, Web UI, or other interfaces.
🔹 2. Driver (Steps 2, 6, 9):
The driver receives the query and manages the full process:
• Sends the query to the compiler
• Monitors the execution
• Returns results to the user
🔹 3. Compiler (Steps 3 & 5):
The compiler checks the query for errors and converts it into a logical plan.
It also asks the Metastore for table info.
🔹 4. Metastore (Step 4):
Stores metadata (data about data), like table names, columns, data types, location in HDFS.
🔹 5. Execution Engine (Steps 7, 7.1, 8):
The query is passed to the Execution Engine, which converts it into MapReduce jobs.
Big Data(BCS061/BCDS-601

🔹 6. Hadoop Framework (MapReduce + HDFS):


• MapReduce processes the data
• HDFS provides the data from DataNodes
• Once processed, results are sent back to the Hive Execution Engine
🔹 7. Final Result (Step 9 & 10):
The result is collected by the Driver and shown to the user.

Hive converts your SQL-like query into MapReduce jobs, runs them using Hadoop, gets the
results from HDFS, and gives you the answer — just like a smart translator between SQL and big
data.
Big Data(BCS061/BCDS-601

📄 Short Note: Apache Hive Installation :


1.Install Java and Hadoop
• Make sure Java and Hadoop are installed and working properly.
• Set environment variables for both.
2.Download Hive
• Go to the official Hive website and download the Hive software.
• Extract the files and place them in a folder like /usr/local/hive.
3.Set Environment Variables
• Add Hive path to the system using .bashrc or .bash_profile.
4.Create Directories in HDFS
• Make folders /tmp and /user/hive/warehouse in HDFS.
• Give permission using Hadoop commands.
5.Initialize Metastore
•Use Derby database (default) and run command to initialize the schema:
schematool -initSchema -dbType derby
Big Data(BCS061/BCDS-601

6.Start Hive
• Type hive in terminal to open Hive shell and start writing HiveQL queries.
✅ Hive Shell :
Hive Shell is a command-line tool where we write and run Hive queries.
• It looks like a terminal screen where we type HiveQL commands.
• It is used to create tables, load data, and run queries on big data stored in HDFS.
📝 Example:
You open Hive shell by typing hive in the terminal. Then you can write:
SELECT * FROM student;
✅ Hive Services :
Hive has several services that help it work smoothly. Main services are:
1. Driver
Manages query execution and keeps track of its progress.
Big Data(BCS061/BCDS-601

2. Compiler
Checks your Hive query and converts it into a MapReduce job.
3. Metastore
Stores information (metadata) about Hive tables like names, columns, types, etc.
4. Execution Engine
Runs the query and fetches the result using MapReduce.

✅ What is Hive Metastore?


• Hive Metastore is like a library catalog for Hive.
• It stores all the information about Hive tables—like their names, columns, data types,
where data is stored, etc.
📌 Think of it as a database about your data.
Big Data(BCS061/BCDS-601

Hive Metastore is a service that


stores metadata about Hive tables,
columns, data types, and HDFS
locations. It helps Hive know how
and where the data is stored.
Big Data(BCS061/BCDS-601

✅ Comparison: Hive vs Traditional Database


Big Data(BCS061/BCDS-601

✅ 1. What is HiveQL?
HiveQL (Hive Query Language) is a SQL-like language used to interact with Hive.
It helps to create tables, insert data, and run queries on large datasets stored in HDFS.
📌 Example:
SELECT name FROM students WHERE marks > 80;

✅ 2. What is a Hive Table?


A Hive table is like a virtual table where data is stored in HDFS.
It has rows and columns just like in SQL.
📝 Types:
i. Managed Table: Hive manages both data and metadata.
ii. External Table: Hive manages only metadata. Data remains outside.
Big Data(BCS061/BCDS-601

✅ 3. What is Partition in Hive?


Partition means dividing a table into smaller parts based on column values.
Helps in faster query performance by scanning only required parts.
📌 Example:
Partition a sales table by year:
 PARTITIONED BY (year INT)
✅ 4. What is Bucketing in Hive?
Bucketing further divides data inside a partition into equal-sized files (buckets) based on the
hash function.
Helps in faster joins and sampling.
📌 Example:
 CLUSTERED BY (student_id) INTO 4 BUCKETS;
Big Data(BCS061/BCDS-601

✅ 5. Storage Formats in Hive


Hive supports multiple file formats for storing data:

✅ 6. Sorting in Hive
• Sorting means arranging data in ascending or descending order.
• Done using ORDER BY or SORT BY.
📌 Example: SELECT * FROM student ORDER BY marks DESC;
Big Data(BCS061/BCDS-601

✅ 7. Aggregating in Hive
Aggregation means using functions like COUNT, SUM, AVG, MAX, MIN to summarize data.
📌 Example: SELECT AVG(marks) FROM student;

✅ 8. Joins in Hive
Joins are used to combine rows from two or more tables based on a related column.
📌 Types:
 INNER JOIN – returns matching rows
 LEFT OUTER JOIN – returns all from left + match from right
 RIGHT OUTER JOIN – returns all from right + match from left
 FULL OUTER JOIN – all rows from both tables
Example :
SELECT s.name, m.marks
FROM students s
JOIN marks m ON s.id = m.student_id;
Big Data(BCS061/BCDS-601

✅ 9. Subqueries in Hive
A subquery is a query inside another query.
It helps in filtering, grouping, or complex logic.
📌 Example:
SELECT name FROM student
WHERE marks > (SELECT AVG(marks) FROM student);
Big Data(BCS061/BCDS-601

✅ What is HBase?
• HBase is a NoSQL database that runs on top of Hadoop.
• It is used to store and manage very large data (billions of rows) in a table format, just like an
Excel sheet — but distributed across many machines.
• It works well for real-time read and write of big data.
📌 Think of it as a giant Excel sheet spread across many computers!
✨ Features of Hbase :
Big Data(BCS061/BCDS-601

✅ HBase Data Model :


HBase stores data in tables, just like SQL — but the structure is different and more
flexible.
📦 Basic Structure of HBase:
Big Data(BCS061/BCDS-601

 HBase Data Model Components :


Big Data(BCS061/BCDS-601

✅ Client Options for Interacting with HBase Cluster


There are many ways to interact with an HBase cluster:
1. HBase Shell – This is a command-line tool that lets us run commands to create
tables, insert data, read data, and manage the database easily.
2. Java API – Developers can use Java programming to connect with HBase and
perform read/write operations in their programs.
3. REST API – HBase can be accessed using web URLs, which is helpful for web
applications and services.
4. Thrift API – It allows other languages like Python, PHP, and C++ to connect with
HBase.
5. MapReduce – Hadoop's MapReduce can be used to process data stored in HBase
in large batches.
6. Hive Integration – Hive can be used to write SQL-like queries (HiveQL) on HBase
tables for easier data analysis.
Big Data(BCS061/BCDS-601

 Difference between HBase and RDBMS :


Big Data(BCS061/BCDS-601

✅ Schema Design in Hbase :


In HBase, designing the schema means deciding how to organize your data in tables. But
it’s very different from SQL databases.
• HBase is schema-less for columns — you only need to define column families, not
individual columns.
• Each row is identified by a Row Key — it should be unique and well-designed (like a roll
number or user ID).
• Column families group related columns (like student:name, student:marks).
• It’s important to group data that is usually accessed together into the same column
family.
• Avoid putting too many column families because each one is stored separately, which
slows down performance.

Download Notes : https://rzp.io/rzp/JV7zlavG


Big Data(BCS061/BCDS-601

✅ What is Indexing in HBase?


In HBase, data is stored and searched based on Row Keys only.
That means:
If you know the Row Key, data retrieval is very fast.
But if you want to search by some other column, like "name" or "city", it becomes slow —
because HBase doesn't create indexes on those columns by default.

✅ What is Advanced Indexing?


Advanced Indexing means creating a secondary index (extra structure) to make searching
faster by non-key columns.
This helps you search HBase tables like SQL-style queries:
• Search by name, email, or age, not just Row Key.
Big Data(BCS061/BCDS-601

✅ Example :
Suppose you have an HBase table Student:

if you want:
"Find student whose Name = Priya"
➡️ This is slow because HBase will check each row one by one (called full scan).
We can create a Secondary Index Table:
Big Data(BCS061/BCDS-601

Now:
First, you search in the index table using "Priya" → it gives you 1003.
Then, go to the main table with 1003 → get full student data.
✅ Faster than full table scan.
✅ Short Note on ZooKeeper and Its Role in Monitoring a Cluster
ZooKeeper is a tool used in Hadoop and HBase systems to manage and coordinate
different machines (nodes) in a cluster.
It helps in:
• Tracking node status: ZooKeeper keeps an eye on which servers are active and which
are down.
• Leader election: If the main/master server fails, ZooKeeper helps to choose a new
leader automatically.
• Communication: It helps all nodes in the cluster talk to each other smoothly.
• Fail recovery: When a server fails, ZooKeeper informs the system so it can recover
quickly.
Big Data(BCS061/BCDS-601

• ZooKeeper makes sure that the cluster runs smoothly, with less downtime and better
coordination.
✅ IBM Big Data Strategy :
IBM's Big Data strategy focuses on helping businesses use their data in a smart way to make
better decisions, faster.
IBM believes that Big Data is not just about collecting a lot of data, but about using that
data to get useful insights.
✅ Key Points of IBM’s Big Data Strategy:
1. Volume, Variety, Velocity:
IBM handles all types of data – big in size, different in format (text, video, etc.), and
coming at high speed.
2. Unified Platform:
IBM provides a complete platform where you can store, manage, analyze, and visualize
your data in one place.
Big Data(BCS061/BCDS-601

3. Infosphere BigInsights:
IBM offers this tool to process and analyze Big Data using Hadoop technology.
4. Big SQL:
You can use SQL queries to analyze big data easily, even if it’s stored in Hadoop.
5. Security and Governance:
IBM ensures that data is safe, secure, and managed properly, with proper rules.
6. Integration with AI and Cloud:
IBM connects Big Data with AI (Watson) and Cloud to provide real-time intelligence
and smart decisions.
Big Data(BCS061/BCDS-601

✅ 1. InfoSphere (by IBM)


InfoSphere is a set of IBM tools that helps in:
• Collecting, managing, and analyzing big data.
• It makes sure data is clean, organized, and ready to be used in analytics.
•It supports data integration, data quality, and data governance.
📌 In Easy Words:
InfoSphere is IBM’s tool to manage big data properly so companies can trust and use their
data easily.
✅ 2. BigInsights
BigInsights is IBM’s platform for working with Big Data using Hadoop.
• It is built on Apache Hadoop but has extra features like better security, analytics, and a
user-friendly interface.
• Helps to process large data and get useful results.
• Includes tools for developers, data scientists, and business users.
Big Data(BCS061/BCDS-601

📌 In Easy Words:
BigInsights is IBM’s software that adds more power and features to Hadoop for better
big data processing.

✅ 3. BigSheets
BigSheets is a tool in BigInsights that looks like Excel but works on Big Data.
• It allows users to analyze large datasets without coding.
• You can filter, sort, group, and visualize big data using an easy spreadsheet-style interface.
• Great for business users who don’t know programming.
📌 In Easy Words:
BigSheets is like Excel for Big Data. It helps non-technical people explore and analyze big data
in a simple way.
Big Data(BCS061/BCDS-601

✅ What is BigSQL?
BigSQL is a tool by IBM that lets you use SQL queries to work with Big Data stored in Hadoop.
• Just like we use SQL for normal databases (like MySQL, Oracle),
•With BigSQL, we can write same SQL queries to read data from Hadoop (HDFS), Hive, or
HBase.
📌 In Easy Words:
BigSQL helps you use familiar SQL language to work with huge data stored in big data systems
like Hadoop.
✅ Key Features of BigSQL
• ✅ Works with standard SQL
• ✅ Can access data from Hive, HDFS, HBase
• ✅ Faster and more efficient than using Hive alone
• ✅ Supports joins, subqueries, sorting, grouping
• ✅ Provides security and governance features
Big Data(BCS061/BCDS-601

✅ How BigSQL Works?


You write SQL queries
Like:
SELECT * FROM customers WHERE city = 'Lucknow';
2.⚙BigSQL takes your SQL and translates it into commands that Hadoop can understand.
3.🗃It fetches data from different big data sources like HDFS, Hive tables, or HBase.
4.⚡ Processes the data using a powerful engine (faster than normal Hive).
5.📄 Returns results just like a normal SQL database does.

Download Notes : https://rzp.io/rzp/JV7zlavG


Big Data(BCS061/BCDS-601

Thank You….

Download Notes : https://rzp.io/rzp/JV7zlavG

You might also like