Introduction:
This write-up contains detailed instructions to create a low-cost, high-performance Big Data cluster using Raspberry Pi – 4.
The buzzword in the industry nowadays is “Big Data” and got inspired by the knowledge-sharing sessions of Prof. Saurabh in our EPDT program, I decided to initiate this project.
This project aims to get familiar with Hadoop, Spark, and Cluster computing, without a big investment of time or money.
This project will use Raspberry Pi-4 to build a networked cluster of 3 nodes, communicating with each other through a network switch, install HDFS, and have Spark running in distributed processing jobs via YARN across the whole cluster. This write-up will be the full-featured introduction to the hardware and software involved in setting up the Hadoop & Spark cluster and can scale to any number or size of machines.
We will cover:
- Individual Raspberry Pi-4 setup with Ubuntu Server LTS 20.04 Installation
- Physical Cluster Setup
- Cluster Setup – Public Key SSH Authentication, Static IP, Host/Hostnames Configuration
- Hadoop Installation – Single Node and Multi-Node; Hadoop 3.2.1
- Spark Installation – Spark Jobs via YARN and the Spark Shell; Spark 3.0.1
Hardware components:
Qty Item Configuration 3 Raspberry Pi – 4 :8GB RAM 4 Cat 6 Lan cable :1 Foot each 3 Power adaptor or multiport USB Adaptor :4 USB port – 5.2v min. 2.5A 3 SD cards :Class 10, high speed 1 4 Port Ethernet switch :4 Port Gigabit Switch Hub
Cluster Configuration:
CPU each node: Quad-core Cortex-A72 (ARM v8) 64-bit SoC @1.5GHz RAM : 8 x 3 = 24 GB LPDDR4 Storage: 64 x 3 = 192 GB
Source:
- A Data Science/Big Data Laboratory by Pier Taranti
- Build Raspberry Pi Hadoop/Spark Cluster from Scratch by Henry Liang
- How to Install and Set Up a 3-Node Hadoop Cluster by Linode (Contributions from Florent Houbart)
- Install, Configure, and Run Spark on Top of a Hadoop YARN Cluster
- Apache Hadoop Documentation
- Apache Spark Documentation
Physical Cluster Setup:
1. Cost-effective Server Rack
To start the setup, install your Raspberry Pi on the stack of spacers. You can find these spacers on any e-commerce site.
2. Network Connection
I used a wired Ethernet connection from my router connected via a 4-port TP-link switch for my setup. Further, each RPi-4 is connected to the switch via a 1-foot cat-6 LAN cable.
3. Power Supply
There are several options to configure the power supply –
- Use a multi-port high power USB power adaptor, capable of 5.2v 2.5A to 3A on each port.
- Official raspberry pi foundation’s power supply to power the nodes individually is my preferred way as RPi-4 is power-sensitive when it comes to high performance.
4. Individual Raspberry Pi Setup
4.1. Ubuntu Server LTS 20.04 Installation
Use the Raspberry Pi Imager to write the Ubuntu Server LTS 20.04 64 Bit to each pi.
.1. Pi Configuration
We will do an SSH into each RPi and setup some basic configurations. If you are finding it difficult to get the IP addresses of each Pi, try the following method.
1. Install the Fing mobile application and connect to your wifi network, this will show the list of connected devices with the name Ubuntu and their IP address.
2. log in to your Router and check the connected devices list, with the name Ubuntu.
3. Please download and install an SSH client at your convenience, I prefer Putty for its simplicity and versatility in usages.
Note: Plug-in one Pi at a time; finish the setup configuration before moving to the next Pi.
Default –
User Name: Ubuntu Password: Ubuntu
Once you’re connected, you’ll be prompted to change the default password. Make sure the same updated password on each RPi, should be something secure and easy to recall.
Ensure that the Pi has the time-synchronized using the following command:
timedatectl status – This should return
Last login: Tue Aug 3 23:55:14 2021 from 192.168.0.xxx
Last login: Tue Aug 3 23:55:14 2021 from 192.168.0.xxx ubuntu@pi01:~$ timedatectl status Local time: Fri 2021-08-06 23:26:52 IST Universal time: Fri 2021-08-06 17:56:52 UTC RTC time: n/a Time zone: Asia/Kolkata (IST, +0530) System clock synchronized: yes NTP service: active RTC in local TZ: no
If the system clock is synchronized and the NTP service is active, you’re good to go.
Else, use the following commands to configure.
timedatectl list-timezones – will show the list of timezones available.
sudo timedatectl set-timezone Asia/Kolkata – if you are in India, otherwise select the timezone accordingly.
To update the latest configuration of Ubuntu, run the following commands to finish the individual configuration:
sudo apt update sudo apt upgrade sudo reboot
If you get a cache lock error after the update command, reboot the Pi try again by –
Ctrl+c
Sudo reboot
Cluster Setup
If you can connect to all your RPi’s with the new password, we are good to proceed with the cluster setup. Here we will setup static IPs, hosts/hostname, and public-key SSH authentication to facilitate each pi to communicate without a password using encrypted keys.
1. Static IP Setup
The following steps will need to be done on each Pi.
Ubuntu Server LTS 20.04 requires Netplan for network configuration. Specifically, editing a few yaml files.
SSH into the Pi and find the name of your network interface by running:
ip a
The output information should look like this:
ubuntu@pi01:~$ ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether dc:a6:32:f8:a5:ce brd ff:ff:ff:ff:ff:ff inet 192.168.0.xx/24 brd 192.168.0.255 scope global eth0 valid_lft forever preferred_lft forever inet6 fe80::dea6:32ff:fef8:a5ce/64 scope link valid_lft forever preferred_lft forever 3: wlan0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether dc:a6:32:f8:a5:cf brd ff:ff:ff:ff:ff:ff
The network interface name is eth0 tag. Keep this in mind or create a notepad to tabulate these details and save them for future reference.
Next, will nano to edit the configuration files.
Use the following commands to edit the configuration file to disable automatic network configuration.
sudo nano /etc/cloud/cloud.cfg.d/99-disable-config.cfg
All you need to add in the newly created file:
network: {config: disabled}
Now we will, setup the static IP by editing the 50-cloud-init.yaml file. Use the following command:
sudo nano /etc/netplan/50-cloud-init.yaml
The basic template to set a static IP is:
network: ethernets: {Network Interface Name}: dhcp4: false addresses: [{Specifc IP Adress}/24] gateway4: {Gateway Address} nameservers: addresses: [{Gateway Address}, 8.8.8.8] version: 2
My configuration file looked like so: (X being the last digit of the specific IP address for each Pi; 192.168.0.116, 192.168.0.115, 192.168.0.114, etc.)
GNU nano 4.8 /etc/netplan/50-cloud-init.yaml # This file is generated from information provided by the datasource. Changes # to it will not persist across an instance reboot. To disable cloud-init's # network configuration capabilities, write a file # /etc/cloud/cloud.cfg.d/99-disable-network-config.cfg with the following: # network: {config: disabled} network: ethernets: eth0: dhcp4: false addresses: [192.168.0.xx/24] gateway4: 192.168.0.1 nameservers: addresses: [192.168.0.xx, 8.8.8.8] version: 2
After editing the file, apply the settings by using the following commands:
sudo netplan apply
This command will make your SSH session hang, create a new terminal session, and SSH into the Pi using the new IP address as per the above configuration.
Then reboot the Pi and confirm the static IP address is set correctly by using :
ip a
2. Hosts/Hostname Configuration
The following steps will need to be done on each Pi.
To get all the clusters to talk to each other flawlessly, we need to configure the hosts and hostnames files to the specific RPi information.
First, we’ll SSH into the RPi and update the hostname file by using:
sudo nano /etc/hostname
The hostname file should like so (X being the last digit of the specific IP address for each Pi):
pi0X
For example, the hostname file for pi01 will look like this:
pi01
Next, we’ll have to update the hosts’ file using the following command:
sudo nano /etc/hosts
The host’s file should look like so after editing:
# The following lines are desirable for IPv6 capable hosts ::1 ip6-localhost ip6-loopback fe00::0 ip6-localnet ff00::0 ip6-mcastprefix ff02::1 ip6-allnodes ff02::2 ip6-allrouters ff02::3 ip6-allhosts 192.168.0.xxx pi01 192.168.0.xxx pi02 192.168.0.xxx pi03
Make sure to delete the localhost 127.0.0.1 line from the file.
Addressing is simple,
{IP Address} {hostname}
Please Note, the hostname file will be different for each RPi, but the hosts’ file should be the same.
Once all are configured, reboot the RPis and move on to SSH key authentication.
3. SSH Key Authentication Configuration
Perform these steps on the master Pi until only until directed to do otherwise.
First, edit the ssh config file on the master RPi using the following command:
nano ~/.ssh/config
Add all nodes to the config file, including the Host, User, and Hostname for each Pi.
The template for adding nodes to the config file is:
Host piXX User ubuntu Hostname {IP Address}
My config file looked like this after adding all of my nodes to the file:
Host pi01 User ubuntu Hostname 192.168.0.xxx Host pi02 User ubuntu Hostname 192.168.0.xxx Host pi03 User ubuntu Hostname 192.168.0.xxx Now, create an SSH key pair on the Pi using:
Now, create an SSH key pair on the Pi using:
ssh-keygen -t rsa -b 4096
Press enter through all the prompts because the key pair needs to be saved in the .ssh directory, and the key pair should be passwordless.
The output should look similar to this:
Your identification has been saved in /your_home/.ssh/id_rsa Your public key has been saved in /your_home/.ssh/id_rsa.pub The key fingerprint is: SHA256:/hk7MJ5n5aiqdfTVUZr+2Qt+qCiS7BIm5Iv0dxrc3ks user@host The key's randomart image is: +---[RSA 4096]----+ | .| | + | | + | | . o . | |o S . o | | + o. .oo. .. .o| |o = oooooEo+ ...o| |.. o *o+=.*+o....| | =+=ooB=o.... | +----[SHA256]-----+
Repeat the SSH keygen on your nodes RPi 2 and RPi 3.
Then use the following command on all Pis (including master RPi 1) to copy the public keys into Pi 1’s authorized key list:
ssh-copy-id pi01
Finally, you can copy Pi 1’s configuration files to the rest of the Pi’s using the following commands:
Note, below commands needs to run only on Master RPi01 and XX stands for the specific digit identifier for the RPi (Rpi02, Rpi03).
scp ~/.ssh/authorized_keys piXX:~/.ssh/authorized_keys scp ~/.ssh/config piXX:~/.ssh/config
Now you should be able to ssh into any node from any RPis without providing a password.
If not, please refer to the configuration setup again and resolve it before you proceed further.
4. Cluster Setup using bashrc
This step starts with editing the .bashrc file to create some custom functions for ease of use.
On the master Pi, we’ll first edit the ~/.bashrc file:
nano ~/.bashrc
Within this file, add the following code to the bottom of the file:
# Hadoop cluster management functions # list what other nodes are in the cluster function cluster-other-nodes { grep "pi" /etc/hosts | awk '{print $2}' | grep -v $(hostname) } # execute a command on all nodes in the cluster function cluster-cmd { for node in $(cluster-other-nodes); do echo $node; ssh $node "$@"; done cat /etc/hostname; $@ } # reboot all nodes in the cluster function cluster-reboot { cluster-cmd sudo reboot now } # shutdown all nodes in the cluster function cluster-shutdown { cluster-cmd sudo shutdown now } function cluster-scp { for node in $(cluster-other-nodes); do echo "${node} copying..."; cat $1 | ssh $node "sudo tee $1" > /dev/null 2>&1; done echo 'all files copied successfully' } # start yarn and dfs on cluster function cluster-start-hadoop { start-dfs.sh; start-yarn.sh } # stop yarn and dfs on cluster function cluster-stop-hadoop { stop-dfs.sh; stop-yarn.sh }
Source the .bashrc file on the master Pi:
source ~/.bashrc
Now use the following command to copy the .bashrc to all the worker nodes:
cluster-scp ~/.bashrc
Lastly, run the following command to source the .bashrc file on all nodes:
cluster-cmd source ~/.bashrc
You now have a functioning cluster computer.
To start running parallel processing tasks, we’ll install Hadoop first and then Spark.
Hadoop 3.2.2 Installation
1. Install Java 8
To install Java 8 on each node, use the following command:
cluster-cmd sudo apt install openjdk-8-jdk
Then, because we haven’t done it in a while, use the following command to reboot the Pis:
cluster-reboot
After everything is rebooted, SSH into the master Pi and run the following command to verify Java was installed correctly:
cluster-cmd java -version
2. Hadoop Single Node Installation
Steps only for the master Pi (RPi 01) until directed to do otherwise.
Next, use wget to download it onto the master RPi.
wget https://downloads.apache.org/hadoop/common/hadoop-3.2.2/hadoop-3.2.2.tar.gz
Next, extract the tar file and move the binary to the /opt directory by using the following command:
sudo tar -xvf hadoop-3.2.2.tar.gz -C /opt/ && cd /opt
Change the name of the directory from hadoop-3.2.2 to hadoop:
sudo mv hadoop-3.2.2 hadoop
Change the permissions on the directory.
sudo chown ubuntu:ubuntu -R /opt/hadoop
Setup .profile, .bashrc, and hadoop-env.sh Environment Variables edit .profile to add Hadoop binaries to PATH:
nano /opt/hadoop/.profile
Add the following line:
PATH=/opt/hadoop/hadoop/bin:/opt/hadoop/hadoop/sbin:$PATH
Edit the .bashrc file by
nano ~/.bashrc
append the following environmental variables at the bottom of the file.
# path and options for java export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-arm64 # path and options for Hadoop export HADOOP_HOME=/opt/hadoop export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin export HADOOP_INSTALL=$HADOOP_HOME export HADOOP_MAPRED_HOME=$HADOOP_HOME export HADOOP_COMMON_HOME=$HADOOP_HOME export HADOOP_HDFS_HOME=$HADOOP_HOME export HADOOP_INSTALL=$HADOOP_HOME export YARN_HOME=$HADOOP_HOME export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
Then source the .bashrc file to ensure it updates.
source ~/.bashrc
Next, set the value of JAVA_HOME in /opt/hadoop/etc/hadoop/hadoop-env.sh. You’ll have to scroll down to find the correct line.
The line will be commented out. Uncomment the line and add the correct path to the variable.
It should look like this:
... # The java implementation to use. By default, this environment # variable is REQUIRED on ALL platforms except OS X! export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-arm64 ...
Setup the core-site.xml and hdfs-site.xml
Use the following command to edit the core-site.xml file.
nano /opt/hadoop/etc/hadoop/core-site.xml
It should look like so after editing:
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://pi01:9000</value> </property> </configuration>
Then edit the hdfs-site.xml file:
nano /opt/hadoop/etc/hadoop/hdfs-site.xml
After editing:
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>
Test MapReduce
Format the NameNode Caution; all data will be deleted!:
hdfs namenode –format
Start the NameNode and DataNode:
start-dfs.sh
Make the required directories to run MapReduce jobs:
hdfs dfs -mkdir /user hdfs dfs -mkdir /user/ubuntu
Copy input files into the distributed filesystem:
hdfs dfs -mkdir input hdfs dfs -put /opt/hadoop/etc/hadoop/*.xml input
Run the test example:
hadoop jar /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.2.jar grep input output 'dfs[a-z.]+'
View the output on the distributed file system:
hdfs dfs -cat output/*
When done, stop the NameNode and DataNode:
stop-dfs.sh
Test YARN
Configure the following parameters in the configuration files:
/opt/hadoop/etc/hadoop/mapred-site.xml:
<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <property> <name>mapreduce.application.classpath</name> <value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*</value> </property> </configuration>
/opt/hadoop/etc/hadoop/yarn-site.xml:
<configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.env-whitelist</name> <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value> </property> </configuration>
Start ResourceManager and NodeManager:
start-yarn.sh
Start NameNode and DataNode:
start-dfs.sh
Test if all daemons are running:
jps
You should see this output after running the jps command:
5616 SecondaryNameNode 5760 Jps 5233 NameNode 4674 NodeManager 5387 DataNode 4524 ResourceManager