Instructions for Building Your Orange Pi Zero Compute Cluster
July 25, 2017

Directions by Sam Holt, Drew Meaux, Jacob Roth, Yin Song, & Dave Toth

Phase 1: Download and install all necessary software.
 
For both nodes, do these steps:
  1. Download the operating system for the boards. We used Armbian_5.30_Orangepizero_Debian_jessie_default_3.4.113 from https://dl.armbian.com/orangepizero/. Extract the disk image from the file and flash the image to a microSD card for each node. We used Etcher on our Linux system, but you can use dd on Linux, Win32DiskImager on Windows, or other options.
  2. Hook the nodes up to your network and boot them. They'll get IP addresses from DHCP. You'll need to use wireshark or some other method of discovering what IP addresses they got. I do it at home so that I can use my wireless router to find their IP addresses and skip using wireshark. There's also an app for iPads/iPhones called Fing that can help you figure out their IP addresses easily. Ssh to the nodes with the login root and the password 1234. Change the password to orangepi. Then create a user orangepi with password orangepi. That's not the most secure setup, but our goal here is just to give you reproducible instructions. You can change the passwords to something else you like. Now reboot the boards with reboot.
  3. ssh to both boards as orangepi
  4. sudo apt-get update
  5. sudo chmod 755 /var/log/
  6. sudo apt-get install g++ gfortran nfs-common nfs-kernel-server rpcbind slurm-llnl
  7. wget http://www.mpich.org/static/downloads/3.2/mpich-3.2.tar.gz
  8. wget https://computing.llnl.gov/tutorials/mpi/samples/C/mpi_hello.c
Phase 2: Set static IP addresses and hostnames.
 
For both nodes, do these steps:
  1. (as superuser) edit the /etc/hosts file so it has only these 3 lines in it:
     
    127.0.0.1      localhost
    192.168.1.100      top-master
    192.168.1.101      bottom-slave

  2. (as superuser) edit the /etc/network/interfaces file, adding these 4 lines at the end for the top node:
     
    auto eth0
    iface eth0 inet static
    address 192.168.1.100
    netmask 255.255.255.0

     
    and these 4 lines at the end for the bottom node:
     
    auto eth0
    iface eth0 inet static
    address 192.168.1.101
    netmask 255.255.255.0

     
    Ignore the 4 lines of comments in the file:
     
    # This file intentionally left blank
    #
    # All interfaces are handled by network-manager, use nmtui or nmcli on
    # server/headless images or the "Network Manager" GUI on desktop images

  3. (as superuser) edit the /etc/hostname file so the top node has top-master in it and the bottom node has bottom-slave in it.

  4. Power the nodes off with sudo poweroff.
Phase 3: Configure NFS
  1. Hook the nodes up to their own switch (not connected to the Internet) where the only other computer on the switch has a static IP address (but not 192.168.1.100 or 192.168.1.101). Boot the nodes and that other computer. Ssh to the nodes from that other computer logging in as user orangepi.
  2. For both nodes, run these commands:
     
    sudo mkdir /sharedFiles
    sudo chown orangepi:orangepi /sharedFiles
     
  3. On the top-master node, (as superuser) edit the /etc/exports file, adding the line:
     
    /sharedFiles *(rw,sync,no_root_squash,no_subtree_check)
     
    Then restart the nfs server: sudo service nfs-kernel-server restart
     
  4. On the bottom-slave node, (as superuser) edit the /etc/fstab file, adding the line:
     
    top-master:/sharedFiles      /sharedFiles      nfs nfsvers=3,_netdev 0 0
     
  5. Reboot top-master, wait for the top node to boot up, and then reboot bottom-slave (if the master node doesn't come up before the slave one, you'll likely have NFS issues).
  6. Make sure NFS is working by having each node create a file in /sharedFiles and then ensuring both nodes see those files when they each ls /sharedFiles.
Phase 4: Configure MPI
 
On both nodes, run these commands:
  1. tar xzf mpich-3.2.tar.gz
  2. cd mpich-3.2
  3. ./configure --prefix=/home/orangepi/mpich-install 2>&1 | tee c.txt
  4. make 2>&1 | tee m.txt
  5. make install 2>&1 | tee mi.txt
  6. PATH=/home/orangepi/mpich-install/bin:$PATH ; export PATH
  7. cd /home/orangepi/mpich-install/bin
  8. sudo cp * /usr/bin/
  9. cd /usr/bin
  10. sudo rm mpic++ mpiexec mpif77 mpif90 mpirun
  11. sudo ln -s /home/orangepi/mpich-install/bin/mpicxx ./mpic++
  12. sudo ln -s /home/orangepi/mpich-install/bin/mpiexec.hydra ./mpiexec
  13. sudo ln -s /home/orangepi/mpich-install/bin/mpifort ./mpif77
  14. sudo ln -s /home/orangepi/mpich-install/bin/mpifort ./mpif90
  15. sudo ln -s /home/orangepi/mpich-install/bin/mpiexec.hydra ./mpirun
  16. cd
  17. ssh-keygen -t rsa (accept the default file location and use empty password)
  18. cd .ssh
  19. cat id_rsa.pub >> authorized_keys
  20. On top-master, run:
    ssh-copy-id bottom-slave
  21. On bottom-slave run:
    ssh-copy-id top-master
    On both nodes run
    cd
  22. On top-master:
    cp mpi_hello.c /sharedFiles
    cd /sharedFiles and create a file called machines that contains the following lines:
     
    top-master:4
    bottom-slave:4
     
    Compile the mpi_hello.c file with mpicc mpi_hello.c -o hellompi
    Run the program with mpirun -n 8 -f ./machines ./hellompi
    You should get 8 lines of output with 4 referring to top-master and 4 referring to bottom-slave.
Phase 5: Configure SLURM
 
  1. You'll need a slurm.conf file. There are two ways to get one.
    1. Download one of ours and use it.
       
      Here's one for the Orange Pi Zero with 256 MB of RAM. You need to extract the slurm.conf file from the zip file here.
      Here's one for the Orange Pi Zero with 512 MB of RAM. You need to extract the slurm.conf file from the zip file here.
       
      Once you have downloaded the file, move it to /etc/slurm-llnl/slurm.conf.

    2. Make your own. If you are working on a system with a web browser (not the Orange Pi Zero boards), do the following steps. Otherwise, do the steps on a separate system that also has slurm installed and copy the resulting file to /etc/slurm-llnl/slurm.conf on top-master.
       
      Open this form with a web browser: /usr/share/doc/slurmctld/slurm-wlm-configurator.easy.html
       
      lscpu to see device information. Then fill these options:
       
      Control Machines
      ControlMachine: top-master
      ControlAddr: 192.168.1.100
       
      Compute Machines Note since our cluster only has 2 nodes, we're making both nodes workers. On a normal cluster, there is a dedicated node that you log into to submit jobs that doesn't function as a worker.
      NodeName: bottom-slave,top-master
      NodeAddr: 192.168.1.101,192.168.1.100
      CPUs: 4
      Sockets: 1
      CoresPerSocket: 4
      ThreadsPerCore: 1
      RealMemory: 494 (This value can be found by using the command free -m and recording total value. Our Orange Pi Zero with 512 MB of RAM was 494. Our Orange Pi Zero with 256 MB of RAM was 241)
       
      Resource Selection
      SelectType: Cons_res
      SelectTypeParameters: CR_CPU
       
      Submit the form.
      Copy the output text and as superuser, paste it into a file named /etc/slurm-llnl/slurm.conf.
      Then find this line: SelectType=select/cons_res
      and beneath it, add this line: SelectTypeParameters=CR_CPU

  2. On top-master
     
    Create a munge key
    sudo /usr/sbin/create-munge-key
    Select yes to overwrite current munge key
    sudo chown orangepi:orangepi /etc/munge/munge.key
     
    Create a slurm key
    sudo openssl genrsa -out /etc/slurm-llnl/slurm.key 1024
     
    Create a slurm certification
    sudo openssl rsa -in /etc/slurm-llnl/slurm.key -pubout -out /etc/slurm-llnl/slurm.cert
     
    These two commands copy slurm.conf, slurm.key, slurm.cert, and munge.key from the master to the worker's home directory.
    sudo scp /etc/slurm-llnl/* orangepi@bottom-slave:/home/orangepi/
    sudo scp /etc/munge/munge.key orangepi@bottom-slave:/home/orangepi/

  3. On bottom-slave
     
    sudo mv /home/orangepi/slurm* /etc/slurm-llnl/
    sudo mv /home/orangepi/munge.key /etc/munge/munge.key
     
  4. On top-master
     
    sudo chmod g-w /var/log
    sudo chown slurm:slurm /var/lib/slurm-llnl/*
    sudo chmod 755 /var/lib/slurm-llnl/*
    sudo chown orangepi:orangepi /etc/munge/munge.key

  5. Make sure all the nodes are time-synchronized. Use date to check each node's time. To synchronize all of them, do this on bottom-slave: sudo date --set="$(ssh top-master date)"

  6. Test SLURM: On all the nodes:
     
    sudo chown munge:munge /etc/munge/munge.key
    sudo /etc/init.d/munge start
    sudo systemctl enable slurmctld
    sudo systemctl enable slurmd
    sudo systemctl start munge
    sudo chown munge:munge /etc/munge/munge.key
    sudo /etc/init.d/munge start
    sudo /etc/init.d/slurmctld start
     
    On workers (that includes top-master and bottom-slave in our 2-node cluster):
     
    sudo /etc/init.d/slurmd start
     
    On all the nodes:
     
    sudo systemctl enable munge.service
     
    On top-master
     
    Go to home/orangepi and create a file testslurm.sh
    cd /home/orangepi
     
    Put these lines in testslurm.sh
     
    #!/bin/bash
    #SBATCH -p debug
    #SBATCH -n 1
    #SBATCH -t 12:00:00
    #SBATCH -J slurmJob
    /sharedFiles/slurmTest.sh
     
    Now we make the program that will be run by our test. Create the file /sharedFiles/slurmTest.sh and put these lines in it:
     
    #!/bin/bash
    sleep 30
    hostname >> /home/orangepi/slurmOut.txt
     
    Make the file executable: sudo chmod +x /sharedFiles/slurmTest.sh
     
    Now run these 8 identical commands in quick succession:
    sbatch testslurm.sh
    sbatch testslurm.sh
    sbatch testslurm.sh
    sbatch testslurm.sh
    sbatch testslurm.sh
    sbatch testslurm.sh
    sbatch testslurm.sh
    sbatch testslurm.sh
     
    Once they have completed, if you open slurmOut.txt on each node, you should see a file containing a list of the node's hostname (4 copies of the hostname per file).

NOTES

When rebooting reboot top-master and when it comes up reboot bottom-slave.
When powering off, power off bottom-slave and then top-master.
When powering on, power on top-master and then bottom-slave.


SLURM Troubleshooting Tips

Some common Q&A for slurm can be found here: https://slurm.schedmd.com/troubleshoot.html
When you are done, stop everything properly, otherwise something unpleasant might happen next time. Do these on all the nodes:
sudo /etc/init.d/munge stop
sudo /etc/init.d/slurmctld stop
sudo /etc/init.d/slurmd stop