Big Data – Cluster Environment: Powered by Raspberry Pi-4, Hadoop, and Spark

Introduction:

This write-up contains detailed instructions to create a low-cost, high-performance Big Data cluster using Raspberry Pi – 4.

The buzzword in the industry nowadays is “Big Data” and got inspired by the knowledge-sharing sessions of Prof. Saurabh in our EPDT program, I decided to initiate this project.

This project aims to get familiar with Hadoop, Spark, and Cluster computing, without a big investment of time or money.

This project will use Raspberry Pi-4 to build a networked cluster of 3 nodes, communicating with each other through a network switch, install HDFS, and have Spark running in distributed processing jobs via YARN across the whole cluster. This write-up will be the full-featured introduction to the hardware and software involved in setting up the Hadoop & Spark cluster and can scale to any number or size of machines.

We will cover:

Individual Raspberry Pi-4 setup with Ubuntu Server LTS 20.04 Installation
Physical Cluster Setup
Cluster Setup – Public Key SSH Authentication, Static IP, Host/Hostnames Configuration
Hadoop Installation – Single Node and Multi-Node; Hadoop 3.2.1
Spark Installation – Spark Jobs via YARN and the Spark Shell; Spark 3.0.1

Hardware components:

Qty	Item				Configuration


 3	Raspberry Pi – 4 	    :8GB RAM
 4	Cat 6 Lan cable		    :1 Foot each
 3	Power adaptor               
         or 
    multiport USB Adaptor    :4 USB port – 5.2v min. 2.5A 

 3	SD cards				:Class 10, high speed
 1	4 Port Ethernet switch	:4 Port Gigabit Switch Hub

Cluster Configuration:

CPU each node: Quad-core Cortex-A72 (ARM v8) 64-bit SoC @1.5GHz
         RAM : 8 x 3 = 24 GB LPDDR4
      Storage: 64 x 3 = 192 GB

Source:

Physical Cluster Setup:

1. Cost-effective Server Rack

To start the setup, install your Raspberry Pi on the stack of spacers. You can find these spacers on any e-commerce site.

2. Network Connection

I used a wired Ethernet connection from my router connected via a 4-port TP-link switch for my setup. Further, each RPi-4 is connected to the switch via a 1-foot cat-6 LAN cable.

3. Power Supply

There are several options to configure the power supply –

Use a multi-port high power USB power adaptor, capable of 5.2v 2.5A to 3A on each port.
Official raspberry pi foundation’s power supply to power the nodes individually is my preferred way as RPi-4 is power-sensitive when it comes to high performance.

4. Individual Raspberry Pi Setup

4.1. Ubuntu Server LTS 20.04 Installation

Use the Raspberry Pi Imager to write the Ubuntu Server LTS 20.04 64 Bit to each pi.

.1. Pi Configuration

We will do an SSH into each RPi and setup some basic configurations. If you are finding it difficult to get the IP addresses of each Pi, try the following method.

1. Install the Fing mobile application and connect to your wifi network, this will show the list of connected devices with the name Ubuntu and their IP address.

2. log in to your Router and check the connected devices list, with the name Ubuntu.

3. Please download and install an SSH client at your convenience, I prefer Putty for its simplicity and versatility in usages.

Note: Plug-in one Pi at a time; finish the setup configuration before moving to the next Pi.

Default –

User Name: Ubuntu

 Password: Ubuntu

Once you’re connected, you’ll be prompted to change the default password. Make sure the same updated password on each RPi, should be something secure and easy to recall.

Ensure that the Pi has the time-synchronized using the following command:

timedatectl status – This should return

Last login: Tue Aug 3 23:55:14 2021 from 192.168.0.xxx

Last login: Tue Aug  3 23:55:14 2021 from 192.168.0.xxx
ubuntu@pi01:~$ timedatectl status
               Local time: Fri 2021-08-06 23:26:52 IST
           Universal time: Fri 2021-08-06 17:56:52 UTC
                 RTC time: n/a
                Time zone: Asia/Kolkata (IST, +0530)
System clock synchronized: yes
              NTP service: active
          RTC in local TZ: no

If the system clock is synchronized and the NTP service is active, you’re good to go.

Else, use the following commands to configure.

timedatectl list-timezones – will show the list of timezones available.

sudo timedatectl set-timezone Asia/Kolkata – if you are in India, otherwise select the timezone accordingly.

To update the latest configuration of Ubuntu, run the following commands to finish the individual configuration:

sudo apt update
sudo apt upgrade
sudo reboot

If you get a cache lock error after the update command, reboot the Pi try again by –
Ctrl+c

Sudo reboot

Cluster Setup
If you can connect to all your RPi’s with the new password, we are good to proceed with the cluster setup. Here we will setup static IPs, hosts/hostname, and public-key SSH authentication to facilitate each pi to communicate without a password using encrypted keys.

1. Static IP Setup
The following steps will need to be done on each Pi.

Ubuntu Server LTS 20.04 requires Netplan for network configuration. Specifically, editing a few yaml files.

SSH into the Pi and find the name of your network interface by running:

ip a

The output information should look like this:

ubuntu@pi01:~$ ip a

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000

    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever

2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether dc:a6:32:f8:a5:ce brd ff:ff:ff:ff:ff:ff
    inet 192.168.0.xx/24 brd 192.168.0.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::dea6:32ff:fef8:a5ce/64 scope link
       valid_lft forever preferred_lft forever

3: wlan0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000

    link/ether dc:a6:32:f8:a5:cf brd ff:ff:ff:ff:ff:ff

The network interface name is eth0 tag. Keep this in mind or create a notepad to tabulate these details and save them for future reference.

Next, will nano to edit the configuration files.

Use the following commands to edit the configuration file to disable automatic network configuration.

sudo nano /etc/cloud/cloud.cfg.d/99-disable-config.cfg

All you need to add in the newly created file:

network: {config: disabled}

Now we will, setup the static IP by editing the 50-cloud-init.yaml file. Use the following command:

sudo nano /etc/netplan/50-cloud-init.yaml

The basic template to set a static IP is:

network:
    ethernets:
        {Network Interface Name}:
            dhcp4: false
            addresses: [{Specifc IP Adress}/24]
            gateway4: {Gateway Address}
            nameservers:
                addresses: [{Gateway Address}, 8.8.8.8]
    version: 2

My configuration file looked like so: (X being the last digit of the specific IP address for each Pi; 192.168.0.116, 192.168.0.115, 192.168.0.114, etc.)

GNU nano 4.8            /etc/netplan/50-cloud-init.yaml

# This file is generated from information provided by the datasource.  Changes
# to it will not persist across an instance reboot.  To disable cloud-init's
# network configuration capabilities, write a file
# /etc/cloud/cloud.cfg.d/99-disable-network-config.cfg with the following:
# network: {config: disabled}
network:
    ethernets:
        eth0:
            dhcp4: false
            addresses: [192.168.0.xx/24]
            gateway4: 192.168.0.1
            nameservers:
                addresses: [192.168.0.xx, 8.8.8.8]
    version: 2

After editing the file, apply the settings by using the following commands:

sudo netplan apply

This command will make your SSH session hang, create a new terminal session, and SSH into the Pi using the new IP address as per the above configuration.

Then reboot the Pi and confirm the static IP address is set correctly by using :

ip a

2. Hosts/Hostname Configuration
The following steps will need to be done on each Pi.

To get all the clusters to talk to each other flawlessly, we need to configure the hosts and hostnames files to the specific RPi information.

First, we’ll SSH into the RPi and update the hostname file by using:

sudo nano /etc/hostname

The hostname file should like so (X being the last digit of the specific IP address for each Pi):

pi0X

For example, the hostname file for pi01 will look like this:

pi01

Next, we’ll have to update the hosts’ file using the following command:

sudo nano /etc/hosts

The host’s file should look like so after editing:

# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts
 
192.168.0.xxx pi01
192.168.0.xxx pi02
192.168.0.xxx pi03

Make sure to delete the localhost 127.0.0.1 line from the file.

Addressing is simple,

{IP Address} {hostname}

Please Note, the hostname file will be different for each RPi, but the hosts’ file should be the same.

Once all are configured, reboot the RPis and move on to SSH key authentication.

3. SSH Key Authentication Configuration
Perform these steps on the master Pi until only until directed to do otherwise.

First, edit the ssh config file on the master RPi using the following command:

nano ~/.ssh/config

Add all nodes to the config file, including the Host, User, and Hostname for each Pi.

The template for adding nodes to the config file is:

Host piXX
User ubuntu

Hostname {IP Address}

My config file looked like this after adding all of my nodes to the file:

Host pi01
User ubuntu
Hostname 192.168.0.xxx
 
Host pi02
User ubuntu
Hostname 192.168.0.xxx
 
Host pi03
User ubuntu
Hostname 192.168.0.xxx
Now, create an SSH key pair on the Pi using:

Now, create an SSH key pair on the Pi using:

ssh-keygen -t rsa -b 4096

Press enter through all the prompts because the key pair needs to be saved in the .ssh directory, and the key pair should be passwordless.

The output should look similar to this:

Your identification has been saved in /your_home/.ssh/id_rsa
Your public key has been saved in /your_home/.ssh/id_rsa.pub
The key fingerprint is:
SHA256:/hk7MJ5n5aiqdfTVUZr+2Qt+qCiS7BIm5Iv0dxrc3ks user@host
The key's randomart image is:
+---[RSA 4096]----+
|                .|
|               + |
|              +  |
| .           o . |
|o       S   . o  |
| + o. .oo. ..  .o|
|o = oooooEo+ ...o|
|.. o *o+=.*+o....|
|    =+=ooB=o.... |
+----[SHA256]-----+

Repeat the SSH keygen on your nodes RPi 2 and RPi 3.

Then use the following command on all Pis (including master RPi 1) to copy the public keys into Pi 1’s authorized key list:

ssh-copy-id pi01

Finally, you can copy Pi 1’s configuration files to the rest of the Pi’s using the following commands:

Note, below commands needs to run only on Master RPi01 and XX stands for the specific digit identifier for the RPi (Rpi02, Rpi03).

scp ~/.ssh/authorized_keys piXX:~/.ssh/authorized_keys

scp ~/.ssh/config piXX:~/.ssh/config

Now you should be able to ssh into any node from any RPis without providing a password.

If not, please refer to the configuration setup again and resolve it before you proceed further.

4. Cluster Setup using bashrc
This step starts with editing the .bashrc file to create some custom functions for ease of use.

On the master Pi, we’ll first edit the ~/.bashrc file:

nano ~/.bashrc

Within this file, add the following code to the bottom of the file:

# Hadoop cluster management functions
 
#   list what other nodes are in the cluster
function cluster-other-nodes {
    grep "pi" /etc/hosts | awk '{print $2}' | grep -v $(hostname)
}
 
#   execute a command on all nodes in the cluster
function cluster-cmd {
    for node in $(cluster-other-nodes);
    do
        echo $node;
        ssh $node "$@";
    done
    cat /etc/hostname; $@
}
 
#   reboot all nodes in the cluster
function cluster-reboot {
    cluster-cmd sudo reboot now
}
 
#   shutdown all nodes in the cluster
function cluster-shutdown {
    cluster-cmd sudo shutdown now
}
 
function cluster-scp {
    for node in $(cluster-other-nodes);
    do
        echo "${node} copying...";
        cat $1 | ssh $node "sudo tee $1" > /dev/null 2>&1;
    done
    echo 'all files copied successfully'
}
 
#   start yarn and dfs on cluster
function cluster-start-hadoop {
    start-dfs.sh; start-yarn.sh
}
 
#   stop yarn and dfs on cluster
function cluster-stop-hadoop {
    stop-dfs.sh; stop-yarn.sh

}

Source the .bashrc file on the master Pi:

source ~/.bashrc

Now use the following command to copy the .bashrc to all the worker nodes:

cluster-scp ~/.bashrc

Lastly, run the following command to source the .bashrc file on all nodes:

cluster-cmd source ~/.bashrc

You now have a functioning cluster computer.

To start running parallel processing tasks, we’ll install Hadoop first and then Spark.

Hadoop 3.2.2 Installation
1. Install Java 8
To install Java 8 on each node, use the following command:

cluster-cmd sudo apt install openjdk-8-jdk

Then, because we haven’t done it in a while, use the following command to reboot the Pis:

cluster-reboot

After everything is rebooted, SSH into the master Pi and run the following command to verify Java was installed correctly:

cluster-cmd java -version

2. Hadoop Single Node Installation
Steps only for the master Pi (RPi 01) until directed to do otherwise.

Next, use wget to download it onto the master RPi.

wget https://downloads.apache.org/hadoop/common/hadoop-3.2.2/hadoop-3.2.2.tar.gz

Next, extract the tar file and move the binary to the /opt directory by using the following command:

sudo tar -xvf hadoop-3.2.2.tar.gz -C /opt/ && cd /opt

Change the name of the directory from hadoop-3.2.2 to hadoop:

sudo mv hadoop-3.2.2 hadoop

Change the permissions on the directory.

sudo chown ubuntu:ubuntu -R /opt/hadoop

Setup .profile, .bashrc, and hadoop-env.sh Environment Variables edit .profile to add Hadoop binaries to PATH:

nano /opt/hadoop/.profile

Add the following line:

PATH=/opt/hadoop/hadoop/bin:/opt/hadoop/hadoop/sbin:$PATH

Edit the .bashrc file by

nano ~/.bashrc

append the following environmental variables at the bottom of the file.

# path and options for java
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-arm64
 
# path and options for Hadoop
export HADOOP_HOME=/opt/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_INSTALL=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"

Then source the .bashrc file to ensure it updates.

source ~/.bashrc

Next, set the value of JAVA_HOME in /opt/hadoop/etc/hadoop/hadoop-env.sh. You’ll have to scroll down to find the correct line.

The line will be commented out. Uncomment the line and add the correct path to the variable.

It should look like this:

...
# The java implementation to use. By default, this environment
# variable is REQUIRED on ALL platforms except OS X!
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-arm64
...

Setup the core-site.xml and hdfs-site.xml
Use the following command to edit the core-site.xml file.

nano /opt/hadoop/etc/hadoop/core-site.xml

It should look like so after editing:

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://pi01:9000</value>
    </property>
</configuration>

Then edit the hdfs-site.xml file:

nano /opt/hadoop/etc/hadoop/hdfs-site.xml

After editing:

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>

Test MapReduce
Format the NameNode Caution; all data will be deleted!:

hdfs namenode –format

Start the NameNode and DataNode:

start-dfs.sh

Make the required directories to run MapReduce jobs:

hdfs dfs -mkdir /user
hdfs dfs -mkdir /user/ubuntu

Copy input files into the distributed filesystem:

hdfs dfs -mkdir input
hdfs dfs -put /opt/hadoop/etc/hadoop/*.xml input

Run the test example:

hadoop jar /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.2.jar grep input output 'dfs[a-z.]+'

View the output on the distributed file system:

hdfs dfs -cat output/*

When done, stop the NameNode and DataNode:

stop-dfs.sh

Test YARN
Configure the following parameters in the configuration files:

/opt/hadoop/etc/hadoop/mapred-site.xml:

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
    <property>
        <name>mapreduce.application.classpath</name>
        <value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*</value>
    </property>
</configuration>

/opt/hadoop/etc/hadoop/yarn-site.xml:

<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.nodemanager.env-whitelist</name>        <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
    </property>
</configuration>

Start ResourceManager and NodeManager:

start-yarn.sh

Start NameNode and DataNode:

start-dfs.sh

Test if all daemons are running:

jps

You should see this output after running the jps command:

5616 SecondaryNameNode
5760 Jps
5233 NameNode
4674 NodeManager
5387 DataNode
4524 ResourceManager

Big Data – Cluster Environment: Powered by Raspberry Pi-4, Hadoop, and Spark

Add proper shutdown button to Raspberry Pi-4 The Easiest Way!

Digitization vs. Digitalization