bda lab record
bda lab record
Aim:
Procedure:
1. Prerequisites
First, we need to make sure that the following prerequisites are installed:
1. Java 8 runtime environment (JRE): Hadoop 3 requires a Java 8 installation. I prefer using
The first step is to download Hadoop binaries from the official website. The binary package
After finishing the file download, we should unpack the package using 7zip int two steps.
First, we should extract the hadoop-3.2.1.tar.gz library, and then, we should unpack the
The tar file extraction may take some minutes to finish. In the end, you may see some
warnings about symbolic link creation. Just ignore these warnings since they are not related to
windows.
Since we are installing Hadoop 3.2.1, we should download the files located
―hadoop-3.2.1\bin‖ directory.
After installing Hadoop and its prerequisites, we should configure the environment variables
To edit environment variables, go to Control Panel > System and Security > System (or right-
click > properties on My Computer icon) and click on the ―Advanced system settings‖ link.
In the ―Environment Variables‖ dialog, press the ―New‖ button to add a new variable.
Note: In this guide, we will add user variables since we are configuring Hadoop for a single
user. If you are looking to configure Hadoop for multiple users, you can define System
variables instead.
Now, we should edit the PATH variable to add the Java and Hadoop binaries paths as shown
Now, let‘s open PowerShell and try to run the following command:
hadoop -version
In this example, since the JAVA_HOME path contains spaces, I received the following error:
JAVA_HOME is incorrectly set
After replacing ―Program Files‖ with ―Progra~1‖, we closed and reopened PowerShell and
tried the same command. As shown in the screenshot below, it runs without errors.
1. %HADOOP_HOME%\etc\hadoop\hdfs-site.xml
2. %HADOOP_HOME%\etc\hadoop\core-site.xml
3. %HADOOP_HOME%\etc\hadoop\mapred-site.xml
4. %HADOOP_HOME%\etc\hadoop\yarn-site.xml
4.1. HDFS site configuration
As we know, Hadoop is built using a master-slave paradigm. Before altering the HDFS
configuration file, we should create a directory to store all master node (name node) data and
another one to store data (data node). In this example, we created the following directories:
E:\hadoop-env\hadoop-3.2.1\data\dfs\namenode
E:\hadoop-env\hadoop-3.2.1\data\dfs\datanode
element:
<property><name>dfs.replication</name><value>1</value></property><property><name>dfs.n
amenode.name.dir</name><value>file:///E:/hadoop-env/hadoop-
3.2.1/data/dfs/namenode</value></property><property><name>dfs.datanode.data.dir</name><va
lue>file:///E:/hadoop-env/hadoop-3.2.1/data/dfs/datanode</value></property>
Note that we have set the replication factor to 1 since we are creating a single node cluster.
Now, we should configure the name node URL adding the following XML code into the
Now, we should add the following XML code into the <configuration></configuration>
After finishing the configuration, let‘s try to format the name node using the following
command:
hdfsnamenode -format
Due to a bug in the Hadoop 3.2.1 release, you will receive the following error:
2020–04–17 22:04:01,503 ERROR namenode.NameNode: Failed to start
namenode.java.lang.UnsupportedOperationExceptionatjava.nio.file.Files.setPosixFilePermissions(
Files.java:2044)at
org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.clearDirectory(Storage.java:45
2)at org.apache.hadoop.hdfs.server.namenode.NNStorage.format(NNStorage.java:591)at
org.apache.hadoop.hdfs.server.namenode.NNStorage.format(NNStorage.java:613)at
org.apache.hadoop.hdfs.server.namenode.FSImage.format(FSImage.java:188)at
org.apache.hadoop.hdfs.server.namenode.NameNode.format(NameNode.java:1206)at
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1649)at
org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1759)2020–04–17
22:04:01,511 INFO util.ExitUtil: Exiting with status 1:
java.lang.UnsupportedOperationException2020–04–17 22:04:01,518 INFO
namenode.NameNode: SHUTDOWN_MSG:
This issue will be solved within the next release. For now, you can fix it temporarily using the
%HADOOP_HOME%\share\hadoop\hdfs
%HADOOP_HOME%\share\hadoop\hdfs
Now, if we try to re-execute the format command (Run the command prompt or PowerShell as
Output
as follows:
Next, we must start the Hadoop Yarn service using the following command:
./start-yarn.cmd
Two command prompt windows will open (one for the resource manager and one for the node
manager) as follows:
Figure 20— Node manager and Resource manager command prompt windows
Result:
successfully executed.
EXP NO:02 Hadoop Implementation of file management tasks, such as
AIM:-
Retrieving files
Deleting Files
DESCRIPTION:-
HDFS is a scalable distributed filesystem designed to scale to petabytes ofdata while running
on top of the underlying filesystem of the operating system.HDFS keeps track of where the
data resides in a network by associating the nameof its rack (or network switch) with the
dataset. Efficientlyscheduletaskstothosenodesthatcontaindata,orwhicharenearesttoit,optimizing
utilitiesthatworksimilarlytotheLinuxfilecommands,andserveasyourprimaryinterface with
HDFS. We‗re going to have a look into HDFS by interacting with itfromthecommandline.
WewilltakealookatthemostcommonfilemanagementtasksinHadoop,whichinclude:
Adding filesanddirectoriestoHDFS
RetrievingfilesfromHDFStolocalfilesystem
DeletingfilesfromHDFS
ALGORITHM:-
SYNTAXAND COMMANDSTOADD,RETRIEVEANDDELETEDATAFROMHDFS
Step-1:AddingFilesandDirectoriestoHDFS
Before you can run Hadoop programs on datastored in HDFS, you‗ll needto put the
data into HDFS first. Let‗s create a directory and put a file in it. HDFShas a default
working directory of /user/$USER, where $USER is your login username. This directory
isn‗t automatically created for you, though, so let‗s create itwith themkdir command. For
the purpose of illustration, we use chuck.
Youshouldsubstituteyourusernameintheexamplecommands.
hadoopfs-mkdir/user/chuck
hadoopfs-putexample.txt
hadoopfs-putexample.txt/user/chuck
Step-2:RetrievingFilesfromHDFS
TheHadoopcommandgetcopiesfilesfromHDFSbacktothelocalfilesystem.Toretrieveex
ample.txt,wecanrunthefollowingcommand.
hadoopfs-catexample.txt
Step-3:DeletingFilesfromHDFS
hadoopfs-rmexample.txt
Commandforcreatingadirectoryinhdfsis“hdfsdfs–mkdir/lendicse”
Addingdirectoryisdonethroughthecommand “hdfsdfs–putlendi_english/”.
Step-4:CopyingDatafromNFSto HDFS
Copyingfromdirectorycommandis “hdfsdfs–copyFromLocal/home/lendi/Desktop/shakes/
glossary/lendicse/”
Viewthefilebyusingthecommand“hdfsdfs–cat/lendi_english/glossary”
CommandforlistingofitemsinHadoopis“hdfsdfs–lshdfs://localhost:9000/”
CommandforDeleting filesis“hdfsdfs–rmr/kartheek”
Output:
Result:
ThustheInstallingofHadoopinthreeoperatingmodeshasbeensuccessfullycompleted
EXP NO:03 Implement of Matrix Multiplication with Hadoop Map
Reduce
Date:
Aim:
Algorithm:
- Create a list `input_data` that contains strings representing each element in the input
is the row index, `j` is the column index, and `value` is the element value.
- For each element in `input_data`, split it into `matrix`, `i`, `j`, and `value`.
- If the element belongs to `matrix_A`, emit intermediate key-value pairs for all possible
- If the element belongs to `matrix_B`, emit intermediate key-value pairs for all possible
- Initialize an empty `result_matrix` with dimensions `rows_A` x `cols_B`, filled with zeros.
- For each element in the mapped data, split it into `i`, `j`, `matrix_id`, `index`, and `value`.
- If the element belongs to `matrix_A`, compute the partial product for the corresponding
- If the element belongs to `matrix_B`, compute the partial product for the corresponding
- Call the `mapper` function with the `input_data` list as input to obtain the mapped data.
- Call the `reducer` function with the mapped data as input to obtain the reduced data.
- For each element in the reduced data, split it into `i`, `j`, and `value`.
- Create an empty `result_matrix` with dimensions `rows_A` x `cols_B`, filled with zeros.
- For each element in `results`, set the corresponding element in `result_matrix` to the value.
9. Print the result:
- Print each row in `result_matrix` to display the final matrix multiplication result.
Program:
import json
# Input matrices in Python lists format (Replace these with your actual matrices)
matrix_A = [
[1, 2],
[3, 4]
matrix_B = [
[5, 6],
[7, 8]
if cols_A != rows_B:
raise ValueError("Number of columns in matrix A must be equal to the number of rows in
matrix B.")
input_data = [
]+[
def mapper(input_data):
if matrix == 'A':
for k in range(cols_B):
yield f"{i},{k},{matrix},{j},{value}"
for k in range(cols_A):
yield f"{k},{j},{matrix},{i},{value}"
# Reducer script for matrix multiplication
def reducer(mapped_data):
if matrix_id == 'A':
yield f"{i},{j},{value}"
mapped_data = mapper(input_data)
reduced_data = reducer(mapped_data)
# Collect the results
i, j, value = line.strip().split(',')
result_matrix[int(i)][int(j)] = float(value)
print(row)
Output:
[37.0, 47.0]
[81.0, 101.0]
Result:
Aim:
To run a basic Word Count Map Reduce program to understand Map Reduce Paradigm
Algorithm:
1. Initialize the input text data `text_data` (Replace this with your own text data).
- For each word in the `text_data`, create a key-value pair where the word is the key, and
- Return a list of tuples containing the word and its corresponding count from the
`word_count_dict`.
4. Execute the Map phase (Mapper):
- Call the `mapper` function with the `text_data` as input to obtain the mapped data.
- Call the `reducer` function with the mapped data as input to obtain the reduced data.
- For each word and its count in the reduced data, print the word and count in the format:
"{word}: {count}".
Program:
# Input text data (Replace this with your own text data)
text_data = '''
MapReduce is a programming paradigm for processing and generating large datasets with a
The MapReduce algorithm contains two important tasks, namely Map and Reduce.
Map takes a set of data and converts it into another set of data, where individual elements are
The Reduce task takes the output of the Map as an input and combines those data tuples into
'''
def mapper(text_data):
words = text_data.strip().split()
return word_count_pairs
def reducer(mapped_data):
word_count_dict = {}
if word in word_count_dict:
word_count_dict[word] += count
else:
word_count_dict[word] = count
return word_count_dict.items()
mapped_data = mapper(text_data)
reduced_data = reducer(mapped_data)
print(f"{word}: {count}")
Output:
MapReduce: 2
is: 1
a: 5
programming: 1
paradigm: 1
for: 1
processing: 1
and: 4
generating: 1
large: 1
datasets: 1
with: 1
parallel,: 1
distributed: 1
algorithm: 2
on: 1
cluster.: 1
The: 2
contains: 1
two: 1
important: 1
tasks,: 1
namely: 1
Map: 3
Reduce.: 1
takes: 2
set: 3
of: 4
data: 2
converts: 1
it: 1
into: 3
another: 1
data,: 1
where: 1
individual: 1
elements: 1
are: 1
broken: 1
down: 1
key-value: 1
pairs.: 1
Reduce: 1
task: 1
the: 2
output: 1
as: 1
an: 1
input: 1
combines: 1
those: 1
tuples: 1
smaller: 1
tuples.: 1
Result:
Thus a basic Word Count Map Reduce program was executed successfully.
EXP NO:05 Installation of Hive along with practice examples.
Date:
Aim:
Procedure:
1.1. 7zip
In order to extract tar.gz archives, you should install the 7zip tool.
To install Apache Hive, you must have a Hadoop Cluster installed and running: You can refer
to our previously published step-by-step guide to install Hadoop 3.2.1 on Windows 10.
In addition, Apache Hive requires a relational database to create its Metastore (where all
metadata will be stored). In this guide, we will use the Apache Derby database 4.
Since we have Java 8 installed, we must install Apache Derby 10.14.2.0 version (check
Once downloaded, we must extract twice (using 7zip: the first time we extract the .tar.gz file,
the second time we extract the .tar file) the content of the db-derby-10.14.2.0-bin.tar.gz
archive into the desired installation directory. Since in the previous guide we have installed
―E:\hadoop-env\db-derby-10.14.2.0\‖ directory.
1.4. Cygwin
Since there are some Hive 3.1.2 tools that aren‘t compatible with Windows (such as
schematool). We will need the Cygwin tool to run some Linux commands.
3.1.2.-bin.tar.gz file.
When the file download is complete, we should extract twice (as mentioned above) the
we decided to use E:\hadoop-env\‖ as the installation directory for all technologies used in the
previous guide.
HIVE_HOME: ―E:\hadoop-env\apache-hive-3.1.2\‖
DERBY_HOME: ―E:\hadoop-env\db-derby-10.14.2.0\‖
HIVE_LIB: ―%HIVE_HOME%\lib‖
HIVE_BIN: ―%HIVE_HOME%\bin‖
HADOOP_USER_CLASSPATH_FIRST: ―true‖
Figure 5 — Adding HIVE_HOME user variable
HADOOP_USER_CLASSPATH_FIRST: ―true‖
Now, we should edit the Path user variable to add the following paths:
%HIVE_BIN%
%DERBY_HOME%\bin
Figure 6 — Editing path environment variable
4. Configuring Hive
Then, we should paste them within the Hive libraries directory (E:\hadoop-env\apache-hive-
3.1.2\lib).
Figure 8 — Paste Derby libraries within Hive libraries directory
3.1.2\conf) create a new file ―hive-site.xml‖. We should paste the following XML code within
this file:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration><property> <name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby://localhost:1527/metastore_db;create=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property><property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>org.apache.derby.jdbc.ClientDriver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<property>
<name>hive.server2.enable.doAs</name>
<description>Enable user impersonation for HiveServer2</description>
<value>true</value>
</property>
<property>
<name>hive.server2.authentication</name>
<value>NONE</value>
<description> Client authentication types. NONE: no authentication check LDAP: LDAP/AD
based authentication KERBEROS: Kerberos/GSSAPI authentication CUSTOM: Custom
authentication provider (Use with property hive.server2.custom.authentication.class)
</description>
</property>
<property>
<name>datanucleus.autoCreateTables</name>
<value>True</value>
</property>
</configuration>
5. Starting Services
To start Apache Hive, open the command prompt utility as administrator. Then, start the
Hadoop services using start-dfs and start-yarn commands (as illustrated in the Hadoop
installation guide).
Then, we should start the Derby network server on the localhost using the following
command:
E:\hadoop-env\db-derby-10.14.2.0\bin\StartNetworkServer -h 0.0.0.0
Now, let try to open a command prompt tool and go to the Hive binaries directory (E:\hadoop-
2.x versions). To get things working, we should download the necessary *.cmd files from the
You can download all *.cmd files from the following GitHub repository
https://github.com/HadiFadl/Hive-cmd
Now if we try to execute the ―hive‖ command, we will receive the following error:
Exception in thread "main" java.lang.NoSuchMethodError:
com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)
V
at org.apache.hadoop.conf.Configuration.set(Configuration.java:1357)
at org.apache.hadoop.conf.Configuration.set(Configuration.java:1338)
at org.apache.hadoop.mapred.JobConf.setJar(JobConf.java:518)
at org.apache.hadoop.mapred.JobConf.setJarByClass(JobConf.java:536)
at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:430)
at org.apache.hadoop.hive.conf.HiveConf.initialize(HiveConf.java:5141)
at org.apache.hadoop.hive.conf.HiveConf.<init>(HiveConf.java:5104)
at org.apache.hive.beeline.HiveSchemaTool.<init>(HiveSchemaTool.java:96)
at org.apache.hive.beeline.HiveSchemaTool.main(HiveSchemaTool.java:1473)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:318)
at org.apache.hadoop.util.RunJar.main(RunJar.java:232)
This error is thrown due to a Bug mentioned in the following Hive issue link: HIVE-22718.
As mentioned in the comments, this issue can be solved by replacing the guava-19.0.jar stored
―E:\hadoop-env\hadoop-3.2.1\share\hadoop\hdfs\lib‖.
Note: This file is also uploaded to the GitHub repository mentioned above.
Now, if we run hive command again, then Apache Hive will start successfully.
7. Initializing Hive
After ensuring that the Apache Hive started successfully. We may not be able to run any
HiveQL command. This is because the Metastore is not initialized yet. Besides HiveServer2
To initialize Metastore, we need to use schematool utility which is not compatible with
windows. To solve this problem, we will use Cygwin utility which allows executing Linux
E:\cygdrive
C:\cygdrive
Now, open the command prompt as administrator and execute the following commands:
mklink /J E:\cygdrive\e\ E:\
mklink /J C:\cygdrive\c\ C:\
These symbolic links are needed to work with Cygwin utility properly since Java may cause
some problems.
Open Cygwin utility and execute the following commands to define the environment
variables:
export HADOOP_HOME='/cygdrive/e/hadoop-env/hadoop-3.2.1'
export PATH=$PATH:$HADOOP_HOME/bin
export HIVE_HOME='/cygdrive/e/hadoop-env/apache-hive-3.1.2'
export PATH=$PATH:$HIVE_HOME/bin
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$HIVE_HOME/lib/*.jar
We can add these lines to the ―~/.bashrc‖ file then you don‘t need to write them each time you
open Cygwin.
We should leave this command prompt open, and open a new one where we should start
In the project we are working on, we need to execute HiveQL statement from SQL Server
Integration Services which can access Hive from the WebHCat server.
To start the WebHCat server, we should open the Cygwin utility and execute the following
command:
$HIVE_HOME/hcatalog/sbin/webhcat_server.sh start
Output:
Result:
examples
Date:
Aim:
Procedure:
Prerequisite
Install Java JDK - You can download it from
this link. (https://www.oracle.com/java/technologies/downloads/)
The Java Development Kit (JDK) is a cross-platform software development environment
that includes tools and libraries for creating Java-based software applications and
applets.
Output:
Result:
Aim:
Procedure:
To practice importing and exporting data from various databases, you'll need to set up the
required software and databases on your system. Below are the software requirements and
steps to get started with importing and exporting data from each database:
1. Cassandra:
- Install Cassandra on your system. You can download it from the Apache Cassandra
website and follow the installation instructions for your operating system.
- Use the `cqlsh` command-line tool to interact with Cassandra and create keyspaces, tables,
- Install Hadoop on your system. You can download it from the Apache Hadoop website
- Use the `hadoop fs` command-line tool to interact with HDFS and perform file operations.
3. Java:
- Install Java Development Kit (JDK) on your system. You can download it from the Oracle
or OpenJDK website and follow the installation instructions for your operating system.
4. Pig:
- Install Pig on your system. You can download it from the Apache Pig website and follow
5. Hive:
- Install Hive on your system. You can download it from the Apache Hive website and
- Start the Hive service using the `hive` command in the terminal.
6. HBase:
- Install HBase on your system. You can download it from the Apache HBase website and
Once you have set up the required software and services, you can start practicing importing
and exporting data from these databases. Here are some common tasks to practice:
- Import data into Cassandra from CSV files or other data sources using `cqlsh` or Python's
`cassandra-driver`.
- Export data from Cassandra to CSV files or other formats using `cqlsh` or Python's
`cassandra-driver`.
- Import data into HDFS from local files or other data sources using the `hadoop fs`
command.
- Export data from HDFS to local files or other formats using the `hadoop fs` command.
- Use Pig to perform data transformations and analysis on data stored in HDFS.
- Create Hive tables to manage structured data and perform SQL-like queries on the data.
- Import data into Hive tables from CSV files or other data sources.
- Import data into HBase from CSV files or other data sources using Java or Python.
- Export data from HBase to CSV files or other formats using Java or Python.
Program:
session = cluster.connect()
session.execute("INSERT INTO test.users (id, name, age) VALUES (uuid(), 'John Doe',
30)")
session.execute("INSERT INTO test.users (id, name, age) VALUES (uuid(), 'Jane Smith',
25)")
cluster.shutdown()
# Connect to HDFS
hdfs_client.upload('/user/your_username/data', 'local_file.txt')
hdfs_client.download('/user/your_username/data/local_file.txt', 'downloaded_file.txt')
files = hdfs_client.list('/user/your_username/data')
print(files)
# Connect to Hive
age INT)")
results = cursor.fetchall()
conn.close()
Output:
Assume you have a CSV file named `users.csv` with the following data:
```
id,name,age
1,John Doe,30
2,Jane Smith,25
3,Bob Johnson,40
```
After importing the data into a Cassandra table named `users`, the output of a query to
id | name | age
----+------------+-----
1 | John Doe | 30
2 | Jane Smith | 25
3 | Bob Johnson| 40
```
After exporting the `users` table from Cassandra to a CSV file named `users_export.csv`, the
```
id,name,age
1,John Doe,30
2,Jane Smith,25
3,Bob Johnson,40
```
3. Sample output for importing data into HDFS:
Assume you have a local file named `data.txt` with the following data:
```
Hello, world!
```
After importing the `data.txt` file into HDFS at the path `/user/your_username/data.txt`, you
```
Hello, world!
```
```
Hello, world!
```
Assume you have a CSV file named `sales.csv` with the following data:
```
product,quantity,price
Apple,5,1.20
Banana,3,0.80
Orange,2,0.90
```
You can use Pig to load this data, perform transformations, and then store the results into
price:double);
```
```
Apple,6.0
Banana,2.4
Orange,1.8
```
6. Sample output for using Hive to perform SQL-like queries:
Assume you have a Hive table named `employees` with the following data:
```
id,name,department,salary
1,John Doe,HR,50000
2,Jane Smith,Engineering,60000
3,Bob Johnson,Finance,55000
```
```
```
```
Assume you have a CSV file named `products.csv` with the following data:
```
product_id,product_name,category,price
101,Widget A,Electronics,25.99
102,Widget B,Electronics,19.99
103,Widget C,Home,12.99
```
After importing the data into an HBase table named `products`, you can use the HBase shell
```
hbase(main):001:0> scan 'products'
ROW COLUMN+CELL
3 row(s)
```
After exporting the data from the `products` table in HBase to a CSV file named
product_id,product_name,category,price
101,Widget A,Electronics,25.99
102,Widget B,Electronics,19.99
103,Widget C,Home,12.99
```
Result:
Thus importing and exporting data from various databases executed successfully.