Big Data Unit 5 (Easy Notes ) Edushine Classes
Big Data Unit 5 (Easy Notes ) Edushine Classes
Edushine Classes
Follow Us
Download Notes : https://rzp.io/rzp/JV7zlavG
https://telegram.me/rrsimtclasses/
Big Data(BCS061/BCDS-601
🌟 Features of Pig
i. Easy to Learn – Uses Pig Latin, similar to SQL.
ii. Handles Big Data – Good for analyzing huge datasets.
iii. Extensible – You can write your own functions (called UDFs).
iv. Automatically Converts to MapReduce – No need to write complex code.
v. Supports Joins, Filters, Grouping – Like SQL operations.
vi. Error Handling – Provides good debugging and error messages.
Pig is a tool to process big data using Pig Latin.
It runs in local or MapReduce mode and makes data handling easy and fast in Hadoop.
Big Data(BCS061/BCDS-601
🔹 6. ORDER
Used to sort the data based on one or more fields.
🔹 7. DISTINCT
Used to remove duplicate records from a dataset.
🔹 8. LIMIT
Used to return a specified number of rows.
🔹 9. DUMP
Used to display the result on the console.
🔹 10. STORE
Used to save the result into a file or directory in HDFS.
These operators are essential for performing tasks like filtering, grouping, joining, and
storing data in big data applications using Pig.
Big Data(BCS061/BCDS-601
🏗 Architecture of Hive:
1. User Interfaces:
Used to interact with Hive.
Examples:
• Web UI
• Hive Command Line
• HDInsight
2. Meta Store:
• Stores metadata (info about tables, columns, data types).
• Helps Hive know where and how the data is stored in HDFS.
3. HiveQL Process Engine:
• Receives queries written in HiveQL.
• Checks the syntax and passes the query to the execution engine.
Big Data(BCS061/BCDS-601
4. Execution Engine:
• Converts queries into MapReduce jobs.
• Executes them on the Hadoop cluster.
5. HDFS or HBase Storage:
• Hive stores actual data in HDFS or HBase.
• It just processes queries over this stored data.
Hive lets you run SQL-like queries on big data stored in HDFS. It uses components like
Metastore, HiveQL engine, and Execution engine to turn your queries into results.
Big Data(BCS061/BCDS-601
Hive converts your SQL-like query into MapReduce jobs, runs them using Hadoop, gets the
results from HDFS, and gives you the answer — just like a smart translator between SQL and big
data.
Big Data(BCS061/BCDS-601
6.Start Hive
• Type hive in terminal to open Hive shell and start writing HiveQL queries.
✅ Hive Shell :
Hive Shell is a command-line tool where we write and run Hive queries.
• It looks like a terminal screen where we type HiveQL commands.
• It is used to create tables, load data, and run queries on big data stored in HDFS.
📝 Example:
You open Hive shell by typing hive in the terminal. Then you can write:
SELECT * FROM student;
✅ Hive Services :
Hive has several services that help it work smoothly. Main services are:
1. Driver
Manages query execution and keeps track of its progress.
Big Data(BCS061/BCDS-601
2. Compiler
Checks your Hive query and converts it into a MapReduce job.
3. Metastore
Stores information (metadata) about Hive tables like names, columns, types, etc.
4. Execution Engine
Runs the query and fetches the result using MapReduce.
✅ 1. What is HiveQL?
HiveQL (Hive Query Language) is a SQL-like language used to interact with Hive.
It helps to create tables, insert data, and run queries on large datasets stored in HDFS.
📌 Example:
SELECT name FROM students WHERE marks > 80;
✅ 6. Sorting in Hive
• Sorting means arranging data in ascending or descending order.
• Done using ORDER BY or SORT BY.
📌 Example: SELECT * FROM student ORDER BY marks DESC;
Big Data(BCS061/BCDS-601
✅ 7. Aggregating in Hive
Aggregation means using functions like COUNT, SUM, AVG, MAX, MIN to summarize data.
📌 Example: SELECT AVG(marks) FROM student;
✅ 8. Joins in Hive
Joins are used to combine rows from two or more tables based on a related column.
📌 Types:
INNER JOIN – returns matching rows
LEFT OUTER JOIN – returns all from left + match from right
RIGHT OUTER JOIN – returns all from right + match from left
FULL OUTER JOIN – all rows from both tables
Example :
SELECT s.name, m.marks
FROM students s
JOIN marks m ON s.id = m.student_id;
Big Data(BCS061/BCDS-601
✅ 9. Subqueries in Hive
A subquery is a query inside another query.
It helps in filtering, grouping, or complex logic.
📌 Example:
SELECT name FROM student
WHERE marks > (SELECT AVG(marks) FROM student);
Big Data(BCS061/BCDS-601
✅ What is HBase?
• HBase is a NoSQL database that runs on top of Hadoop.
• It is used to store and manage very large data (billions of rows) in a table format, just like an
Excel sheet — but distributed across many machines.
• It works well for real-time read and write of big data.
📌 Think of it as a giant Excel sheet spread across many computers!
✨ Features of Hbase :
Big Data(BCS061/BCDS-601
✅ Example :
Suppose you have an HBase table Student:
if you want:
"Find student whose Name = Priya"
➡️ This is slow because HBase will check each row one by one (called full scan).
We can create a Secondary Index Table:
Big Data(BCS061/BCDS-601
Now:
First, you search in the index table using "Priya" → it gives you 1003.
Then, go to the main table with 1003 → get full student data.
✅ Faster than full table scan.
✅ Short Note on ZooKeeper and Its Role in Monitoring a Cluster
ZooKeeper is a tool used in Hadoop and HBase systems to manage and coordinate
different machines (nodes) in a cluster.
It helps in:
• Tracking node status: ZooKeeper keeps an eye on which servers are active and which
are down.
• Leader election: If the main/master server fails, ZooKeeper helps to choose a new
leader automatically.
• Communication: It helps all nodes in the cluster talk to each other smoothly.
• Fail recovery: When a server fails, ZooKeeper informs the system so it can recover
quickly.
Big Data(BCS061/BCDS-601
• ZooKeeper makes sure that the cluster runs smoothly, with less downtime and better
coordination.
✅ IBM Big Data Strategy :
IBM's Big Data strategy focuses on helping businesses use their data in a smart way to make
better decisions, faster.
IBM believes that Big Data is not just about collecting a lot of data, but about using that
data to get useful insights.
✅ Key Points of IBM’s Big Data Strategy:
1. Volume, Variety, Velocity:
IBM handles all types of data – big in size, different in format (text, video, etc.), and
coming at high speed.
2. Unified Platform:
IBM provides a complete platform where you can store, manage, analyze, and visualize
your data in one place.
Big Data(BCS061/BCDS-601
3. Infosphere BigInsights:
IBM offers this tool to process and analyze Big Data using Hadoop technology.
4. Big SQL:
You can use SQL queries to analyze big data easily, even if it’s stored in Hadoop.
5. Security and Governance:
IBM ensures that data is safe, secure, and managed properly, with proper rules.
6. Integration with AI and Cloud:
IBM connects Big Data with AI (Watson) and Cloud to provide real-time intelligence
and smart decisions.
Big Data(BCS061/BCDS-601
📌 In Easy Words:
BigInsights is IBM’s software that adds more power and features to Hadoop for better
big data processing.
✅ 3. BigSheets
BigSheets is a tool in BigInsights that looks like Excel but works on Big Data.
• It allows users to analyze large datasets without coding.
• You can filter, sort, group, and visualize big data using an easy spreadsheet-style interface.
• Great for business users who don’t know programming.
📌 In Easy Words:
BigSheets is like Excel for Big Data. It helps non-technical people explore and analyze big data
in a simple way.
Big Data(BCS061/BCDS-601
✅ What is BigSQL?
BigSQL is a tool by IBM that lets you use SQL queries to work with Big Data stored in Hadoop.
• Just like we use SQL for normal databases (like MySQL, Oracle),
•With BigSQL, we can write same SQL queries to read data from Hadoop (HDFS), Hive, or
HBase.
📌 In Easy Words:
BigSQL helps you use familiar SQL language to work with huge data stored in big data systems
like Hadoop.
✅ Key Features of BigSQL
• ✅ Works with standard SQL
• ✅ Can access data from Hive, HDFS, HBase
• ✅ Faster and more efficient than using Hive alone
• ✅ Supports joins, subqueries, sorting, grouping
• ✅ Provides security and governance features
Big Data(BCS061/BCDS-601
Thank You….