1/2-day cluster tutorial

PDF of the slides: here

Solutions: solutions.tgz. Look in the source code to see how the jobs are supposed to be submitted.

Accessing and copying data

From the frontal nodes (access1-cp and access2-cp) you have access to Internet, not from the computing nodes.

Never run any intensive computation on the frontal nodes.

The SIC cluster is convenient because the homedir is shared with your normal homedir (unlike most clusters). Data is stored on a scratch partition, /services/scratch that is visible on the frontal node and the compute nodes.

Please create a

/services/scratch/YOUR_TEAM_NAME/YOUR_LOGIN
for the data related to the exercises.

If you are root on your INRIA machine, the scratch space can be mounted with (replace lear with your team name):

mount ral-nas5.inrialpes.fr:/vol/ral_scratch/scratch/lear /mnt/0

Assignments

Source of the assignments: assignments.tgz.

You can use scp as above or just right-click "Copy link" and wget that link from the command line, eg.

wget http://sed.inrialpes.fr/~douze/cluster_tutorial/assignments.tgz

The assignments consist in:

  1. unzip the code and data
  2. use an interactive session oarsub -I to run the small version of the assignments
  3. estimate the total runtime for the large version
  4. split the large version into 1 to 3 minute OAR jobs. Figure out how to pass parameters from oarsub to the code.
  5. validate the job's code in the interactive session
  6. launch the series of jobs
  7. keep an eye on the run
  8. write and run code to merge the job results

C

Compute the multiplication between 2 square matrices in C-storage, loaded from files. Triple loop (never do this in reality, use BLAS!)

Versions:

Split the computation in slices. Each task computes a part of the matrix Merging code stacks the slices

Bonus: combine with multithreading #pragma omp parallel for Data files for C: data_c.tgz

Matlab

Run an extremely slow circle detector on a set of images

Matlab not available on cluster (would consume too many licenses anyway). Workarounds:

Small case: process 3 images. Large case: process 30.

Turn matlab script into a function that takes parameters.

Passing parameters to Octave: a m-file is a script, its parameters are available as a cell array from argv(), see This documentation.

Passing parameters to Matlab: two cases:

Data files for Matlab/octave: data_matlab.tgz

Python

Program that process a set of text files extracted from PDFs to Construct the sparse document-word matrix.

Three stages:

  1. Collect all words (does a pass over the documents)
  2. Select words to make a dictionary (remove too frequent and infrequent words)
  3. Build matrix (second pass over documents)

Cases: small= 2700 files, large=17000 files

Just parallelize stage 3, reusing the dictionary from the small case. Data files for Python:

Quick commands

Below are commands typically used on clusters that you can copy/paste and adapt to your needs.

SSH

SSH is the main tool to access the frontal node and copy data.

From an INRIA machine (including bastion.inrialpes.fr)

Replace douze with your login.

ssh douze@access1-cp 

To copy a file (.bashrc) to a directory on the frontal node (/tmp)

scp .bashrc douze@access1-cp:/tmp

To mount a directory (/services/scratch/lear/douze) from the cluster on a local mount point (/mnt/cluster_scratch), use sshfs (available on linux and the mac with eg. port install)

sshfs douze@access1-cp:/services/scratch/lear/douze /mnt/cluster_scratch

From elsewhere

This includes the visitor network. You can use bastion as a proxy. For all ssh commands, add
-o ProxyCommand="ssh douze@bastion.inrialpes.fr -W access1-cp:22 "
to the ssh commands.

Examples:

ssh -o ProxyCommand="ssh douze@bastion.inrialpes.fr -W access1-cp:22 " douze@localhost

Copy data to the cluster

scp -o ProxyCommand="ssh douze@bastion.inrialpes.fr -W access1-cp:22 " .bashrc douze@localhost:/tmp

Mount directory from cluster

sshfs -o ProxyCommand="ssh douze@bastion.inrialpes.fr -W access1-cp:22 " douze@localhost:/services/scratch/lear/douze /mnt/cluster_scratch

Access the cluster's web server that hosts the Monika and Gantt tools from outside: make a ssh tunnel:

ssh douze@bastion.inrialpes.fr -L 8080:visu-cp.inrialpes.fr:80

Then point your browser to the tunneled connection: http://localhost:8080/monika.

Within the cluster

OAR controls ssh traffic between nodes, so you should use the following wrappers:

Quick OAR recap

The job management system is OAR. Crashcourse: See the man pages or the online documentation for more info.

Quick Bash syntax

Automating cluster jobs is a lot easier with minimal knowledge about the shell (~ $ is the prompt not bash code :-) ).
~ $ # setting a variable
~ $ a=3
~ $ # accessing the value of a variable
~ $ echo $a
3
~ $ # doing computations (only on integers)
~ $ echo $[(a+7)/2]
5
~ $ # loop with fixed bounds
~ $ for i in {0..4}; do
 echo $i
 done
0
1
2
3
4
~ $ # loop with variable bounds and step
~ $ for((b = a;b >= 0; b--)); do
 echo $b
done
3
2
1
0
~ $ # quote expansion rules
~ $ echo X${a}Y
X3Y
~ $ # aY not set
~ $ echo X$aY
X
~ $ # double quote does expand
~ $ echo "X$a"
X3
~ $ # single quote does not expand
~ $ echo 'X$a    b'
X$a    b
~ $ # env displays variables known to subprocesses
~ $ env
MANPATH=/opt/local/share/man:
....
DISPLAY=/tmp/launch-2Q1sSj/org.macosforge.xquartz:0
_=/usr/bin/env
~ $ env | grep a= # at this point subprocesses like env do not know about a
~ $ # export a variable to make it known to subprocesses
~ $ export a
~ $ env | grep a=
a=3
~ $ # important variables
~ $ # directories where to search for commands
~ $ echo $PATH
/opt/local/bin:/opt/local/sbin:/opt/local/bin:/opt/local/sbin:/opt/local/bin:/opt/local/sbin:/Users/matthijs/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/opt/X11/bin
~ $ # directories where to search for dynamic libraries (.so files). Empty = use default /lib64 /usr/lib64 etc.
~ $ echo $LD_LIBRARY_PATH

Shell commands can be put in a shell script to be run with bash script_name.sh. Other useful commands for shell scripts:

screen

GNU screen is a terminal multiplexer. Leave it open an disconnect from an ssh session. Get tabs in text mode.

Crashcourse:

# start screen
screen

# connect to running screen
screen -x

# make tabs visible 
cat > ~/.screenrc << EOF
caption always
caption string "%{kw}%-w%{wr}%n %t%{-}%+w"

EOF

Keyboards shortcuts within screen always start with Ctrl-A.

More information about bash: info bash or online.

Useful links

INRIA Grenoble's production cluster documentation: here