2016 09 05 Raspberry Pi Hadoop Setup v1
2016 09 05 Raspberry Pi Hadoop Setup v1
0, Scala
2.11.8, and Spark 2.0 on Raspberry Pi Cluster of 3
Nodes
1
NOTES
Please follow instructions PARTS in order because the results are cumulative
(i.e. PART I, then PART II, then PART III, then PART IV, then PART V). PARTS
III, IV and V are optional.
2. I am using 3 Raspberry Pi model 3.0 with a 8-port switch. They each have a
32 GB micro sd card (you have to buy this separately) and a case (also
bought separately). They also each come with 1 GB RAM (not upgradable).
They also have wireless capability built-in, so you may try it without the
8-port switch, but I'm choosing wired.
6. If you get stuck, you might try these websites for reference though they
seem to have errors:
http://scn.sap.com/community/bi-platform/blog/2015/04/25/a-hadoop-
data-lab-project-on-raspberry-pi--part-14
http://scn.sap.com/community/bi-platform/blog/2015/05/03/a-haddop-
data-lab-project-on-raspberry-pi--part-24
http://scn.sap.com/community/bi-platform/blog/2015/07/10/a-hadoop-
data-lab-project-on-raspberry-pi--part-44
http://www.widriksson.com/raspberry-pi-hadoop-cluster/
http://spark.apache.org/docs/latest/spark-standalone.html
2
Part I: Basic Setup
1. Download the Raspbian Jessie OS disk image. (I didn't use the lite
version, though you could try as this might save disk space--not sure if
you will have to install java or other components that might be missing if
you use the lite version)
2. Burn the disk image to a micro SD card using Win32 Disk Imager (Windows)
4. SSH into Raspberry Pi using Putty (have to find out what IP is given to it
using an network scanning tool, I used one I put on my phone). Default
username is "pi" and password is "raspberry"
3
(to save, press CTRL-X, and press y, and then hit enter, I won't repeat
this for nano editing in future)
-Type "sudo nano /etc/hosts" and delete everything then enter the
following:
127.0.0.1 localhost
192.168.0.110 node1
Make sure that is all that is in that file and no other items exist
such as ipv6, etc.
node1
9. Type "java -version" and make sure you have the correct java version. I
am using java version "1.8.0_64" i.e. Java 8. If you don't have the
correct version, type "sudo apt-get install oracle-java8-jdk". You might
have multiple Java versions installed. You can use the command "sudo
update-alternatives --config java" to select the correct one.
10. Now, we set up a group and user that will be used for Hadoop. We also
make the user a superuser.
-Type "sudo addgroup hadoop"
-Type "sudo adduser --ingroup hadoop hduser"
-Type "sudo adduser hduser sudo"
4
11. Next, we create a RSA key pair so that the master node can log into
slave nodes through ssh without password. This will be used later when we
have multiple slave nodes.
-Type "su hduser"
-Type "mkdir ~/.ssh"
-Type "ssh-keygen -t rsa -P """
-Type "cat ~/.ssh/id_rsa.pub > ~/.ssh/authorized_keys"
-Verify by typing "ssh localhost"
13. Login as hduser and make sure you can access the Internet (note your
Putty now should use 192.168.0.110 to access the raspberry pi).
-Type "ping www.cnn.com"
-Press CTRL-C when finished.
If you can't access the Internet something is wrong with your network setup
(probably you aren't hooked up to a router, you misspelled something, or your
Internet isn't working).
5
Part II: Hadoop 2.7.3 / Yarn Installation :
Single Node Cluster
http://apache.cs.utah.edu/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz
8. Now that we have hadoop, we have to configure it before it can launch its
daemons (i.e. namenode, secondary namenode, datanode, resourcemanager, and
nodemanager). Make sure you are logged in as hduser.
-Type "su hduser"
10. Many configuration files for Hadoop and its daemons are located in the
/opts/hadoop/etc/hadoop folder. We will edit some of these files for
configuration purposes. Note, there are a lot of configuration parameters
to explore.
-Type "cd /opts/hadoop/etc/hadoop"
-Type "sudo nano hadoop-env.sh"
-Edit the line (there should be place to edit an existing line)
export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")
-Edit the line (there should be place to edit an existing line)
6
export HADOOP_HEAPSIZE=250
The default is 1000 MB of heap per daemon launched by HADOOP, but we
are dealing with limited memory Raspberry Pi (1GB).
11. Many configuration files for Hadoop and its daemons are located in the
/opts/hadoop/etc/hadoop folder. We will edit some of these files for
configuration purposes. Note, there are a lot of configuration parameters
to explore. Now we will edit the core Hadoop configuration in core-
site.xml.
-Type "sudo nano core-site.xml"
-Add the following properties between the <configuration> and
</configuration> tags.
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/hdfs/tmp</value>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://node1:54310</value>
</property>
</configuration>
12. Now edit the hdfs (hadoop file system) configuration in hdfs-site.xml.
-Type "sudo nano hdfs-site.xml"
-Add the following properties between the <configuration> and
</configuration> tags.
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
13. Now edit the YARN (Yet Another Resource Negotiator) configuration in
yarn-site.xml.
-Type "sudo nano hdfs-site.xml"
-Add the following properties between the <configuration> and
</configuration> tags.
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>node1</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>1024</value>
</property>
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>4</value>
</property>
7
</configuration>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
18. Create a location for hdfs (see core-site.xml) and format hdfs.
-Type "sudo mkdir -p /hdfs/tmp"
-Type "sudo chown hduser:hadoop /hdfs/tmp"
-Type "sudo chmod 750 /hdfs/tmp"
-Type "hadoop namenode -format"
19. Start Hadoop (hdfs) and YARN (resource scheduler). Ignore any warning
messages that may occur (as mentioned in notes, most are due to 32-bit
binary running on 64-bit platform)
-Type "cd ~"
-Type "start-dfs.sh"
-Type "start-yarn.sh"
8
If you don't see ResourceManager and Nodemanager, probably something is
incorrectly setup in .bashrc, yarn-site.xml, or mapred-site.xml.
21. You can test a calculation using examples provided in the distribution.
Here we put a local file into hdfs. Then we execute a Java program that
counts the frequency of words in the file located on hdfs now. Then we
grab the output file from hdfs and put it on the local computer.
-Type "hadoop fs -copyFromLocal /opt/hadoop/LICENSE.txt /license.txt"
-Type "hadoop jar /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-
examples-2.7.3.jar wordcount /license.txt /license-out.txt"
-Type "hadoop fs -copyToLocal /license-out.txt"
-Type "more ~/license-out.txt/part-r-00000"
Here you can see the output that counts the frequency of words in the
LICENSE.txt file.
22. You can view the setup in your Windows browser by following these URLs.
NAMENODE INFORMATION
http://192.168.0.110:50070
23. There are a lot of commands to explore (there are also hdfs commands
which I believe are considered more modern than hadoop commands, but not
sure yet). Here are a few to try out:
9
Part III: Hadoop 2.7.3 / Yarn Installation :
Multi-Node Cluster
1. On node1, login as hduser.
127.0.0.1 localhost
192.168.0.110 node1
192.168.0.111 node2
192.168.0.112 node3
Make sure that is all that is in that file and no other items exist
such as ipv6, etc.
5. Now we will clone the single node we created onto 2 other SD cards for the
other two raspberry pis. Then we will change the configuration for each to
setup the cluster. Node 1 will be the master node. Nodes 2 and 3 will be
the slave nodes.
6. We will now copy the node1 32 GB micro SD card to the other two blank SD
cards.
-Unplug the raspberry pi from power.
-Remove the SD card from the raspberry pi.
-Using a micro SD card reader and Win 32 Disk Imager, "READ" the SD card
to an .img file on your Windows computer (you can choose any name for your
.img file like node1.img). Warning: this file will be approximately 32 GB
so have room where you want to create the image on your Windows computer.
-After the image is created, put your node1 micro SD card back into the
original raspberry pi. Get your other two blank micro SD cards for the
other two raspberry pis and "WRITE" the node1 image you just created to
them one at a time.
-After making the images, put the micro SD cards back to their respective
raspberry pis and set them aside for now.
7. Now plug in the raspberry pi you want for node2 to the network and power
it up. (it should be the only one attached to the network switch). Login
to it using hduser using Putty.
10
-Type "sudo reboot"
9. Now plug in the raspberry pi you want for node3 to the network and power
it up. (node2 and node3 should be the only one attached to the network
switch). Login to it using hduser using Putty.
11. Now attach node1 to the network switch and power it up. Login to node1
(192.168.0.110) using Putty as hduser. You should now see 192.168.0.110,
192.168.0.111, and 192.168.0.112 on your network.
11
node2
-Type "sudo nano masters"
-Edit the file so it only contains the following:
node1
-Type "sudo reboot"
12
24. Test Hadoop and YARN (see if daemons are running)
-Type "jps"
25. You can view the setup in your Windows browser by following these URLs.
NAMENODE INFORMATION
http://192.168.0.110:50070
13
14
Part IV: Hive 2.1.0 Installation
1. Here we will install Hive on node1. Hive only needs to be installed on
the master node.
6. Log back into node1 as hduser. We shall now start up hdfs and yarn
services and make some directories.
-Type "start-dfs.sh"
-Type "start-yarn.sh"
8. On node1, you can start the hive command line interface (cli).
-Type "hive"
15
16
Part IV: Spark 2.0 Installation
1. Here we will install Spark (Standalone Mode) on node1, node2, and node3.
Then we will configure each node separately. node1 will be master node
for spark and node2 and node3 slave nodes for spark. Before we install
spark, we will install Scala on node1.
17
9. On node3 we configure Spark. Login to node1 as hduser.
-Type "cd ~"
-Type "sudo nano .bashrc"
Add the following to bottom of .bashrc file
export PATH=$PATH:/opt/spark/bin
-Type "sudo reboot"
12. You can check the spark monitoring website as well in a web browser at
http://192.168.0.110:8080.
18