This page has mostly migrated to Gitlab. Information on this page may be outdated.
The IGR has a small number of Linux servers for use by group members. You can view the current usage of the IGR compute nodes from inside the Uni firewall at http://wiay.astro.gla.ac.uk:3000 (username igrguest, password K3lv1n). There are machines with large amounts of RAM, multiple GPUs, and high core counts, which are suitable for most student projects, testing purposes, or development work as a step to very large scale analyses. LIGO-Virgo-KAGRA members also have access to the LIGO Data-Grid and in particular the UK's Hawk computing cluster located at Cardiff, which is more suitable for very large scale jobs that require thousands of CPU-hours or more.
This page documents some of the common tips for using these resources. In particular we will describe things like conda environments, condor pools, using containers which you might encounter as part of a project.
New users may log in with their GUID using `ssh GUID@wiay.astro.gla.ac.uk` . Before you use your GUID for the first time, you must visit https://it.physics.gla.ac.uk/identity and self-enrol your GUID in the system. There are also old style “astro accounts” which many of the staff have, but these are in a separate system.
Users can log into individual machines, but are encouraged to use the condor pool which can be accessed via wiay.astro.gla.ac.uk
for any serious long-running jobs. This helps to ensure that the resources are used fairly and that jobs do not misbehave and crash a machine for other users. Condor is also used on the LIGO DataGrid for large scale work so it's a useful skill to know.
Our machines are maintained by primarily by Norman Gray and Jamie Scott. If you think a machine is down and needs rebooting, or to request your disk quota be increased, please email phas-it@glasgow.ac.uk
.
For discussion of computing with the IGR resources, or for generic computing discussion amongst IGR group members, please use the Data-Computing Teams Channel.
You will probably find that running python programs which you're developing on one of the shared compute machines like deimos or wiay is difficult thanks to differing versions of things like python libraries.
Virtual environments are a Python feature which allow you to isolate your code from the rest of the system, so that you can always run your code against the versions of the packages which you expect. For the remainder of this section I'll assume that you're using Python 3 (I strongly advise that if you can you should write Python 3 rather than Python 2; if you're stuck in a situation where you need Python 2 then you'll need to read the section on virtual environments in Python 2 at the bottom of this page instead).
If your system default python is a version of python2 then replace
python
with
python3
in the examples below.
Assuming you're running on Python3.3 or later (which you can check by running
$ python --version
[where I type lines starting with a \$ it indicates the remainder of that line should be typed into the terminal, without the \$ sign]) then you can run
$ python -m venv /path/to/new/virtual/environment
As to the location of the venv, I keep a directory in my home directory at ~daniel/.virtualenvs/<MACHINE> for each machine I run on, so I'd keep a deimos keras environment in e.g.
/home/daniel/.virtualenvs/deimos/keras
To run code in your virtual environment you need to activate it. In my above example I'd do this by running
$ source /home/daniel/.virtualenvs/deimos/keras/bin/activate
but you should replace your path accordingly.
Once you've activated your virtual environment you can run python scripts the normal way, but they'll now be isolated from the rest of the machine (this is good, as the versions of python libraries on the machine probably don't match what you need, but can't, for other reasons, be changed easily). This means you'll need to reinstall e.g keras and tensorflow-gpu. You can do this (once your venv is activated) by running
$ pip install numpy keras tensorflow-gpu
and so-on.
It can be helpful to keep a list of the packages which you need to run your code in a file called “requirements.txt” in your project; python can help you make this. If you have a working virtual environment with all of the packages you need you can run
$ pip freeze
to produce a list of all the packages installed in the virtual environment.
If you're working on LALSuite virtual environments can be especially helpful, as they'll allow you to isolate different branches and versions, and allow you to work on unstable code without jeopardising the stability of your main LALSuite installation. If you just need the latest LALSuite installation you can run
$ pip install lalsuite
However, if you're installing from source you'll need to tell Make where to install the files to. You can do this by setting the prefix when you run each ./configure stage:
$ ./configure --prefix=$VIRTUAL_ENV
where $VIRUTAL_ENV is an environment variable which is set when your virtual environment is active (so you'll need to run your installation after activating the virtual environment). You can find a script which handles this process on Github here, although it is now slightly out of date, as glue and pylal have been moved out of the main source tree for LALSuite.
An alternative to (and generalisation of) virtual environments are provided by conda. Conda environments can reproduce a full setup, somewhere between a virtual env and a container (see below).
Conda is installed on wiay, and is available with the conda
command. Some environments are available at /data/wiay/conda_envs/
, but you can also clone and create your own. You can see the list of available environments using conda env list
.
The LIGO Scientific Collaboration publishes conda environments with the latest releases of standard packages such as gwpy, lalsuite, pycbc, astropy, ligo.skymap and so on. LIGO standard conda environments are also available via cvmfs at the path /cvmfs/oasis.opensciencegrid.org/ligo/sw/conda/envs/
. To have conda automatically look in this path, I have the following in my '~/.condarc' file:
jveitch@wiay:~$ cat .condarc auto_activate_base: false envs_dirs: - /home/jveitch/.conda/envs - /cvmfs/oasis.opensciencegrid.org/ligo/sw/conda/envs/ - /data/wiay/conda_envs pkgs_dirs: - /home/jveitch/.conda/pkgs/
For the latest stable release of the International GW Network software stack with python 3.8 I can then use
conda activate igwn-py38
While virtual environments can substantially ease the process of running code on machines like deimos, it can be easier to turn to a technology known as “containerisation”, which allows code to be run inside self-contained environments which isolate them (almost) completely from the underlying operating system. This can make deploying an analysis to a large cluster environment, such as the LIGO Data Grid easier, as you don't need to rebuild your virtual environment for each architecture you run on.
The IGR machines support two different containerisation techniques: docker, and singularity (where it's possible you should aim to use Singularity, which does not require root-level permission to execute code, and is therefore more widely supported on HPC platforms).
In order to run code in a container you need to produce an “image”–effectively a virtual operating system for the code's execution to take place. To do this you either need to write a script which informs the container software how to build the image (this is a Dockerfile for Docker, and a recipe for Singularity), or find and download a suitable pre-built image from a container registry. LIGO runs such a registry, alongside Dockerhub and SingularityHub, which are respectively the standard registries for Docker and Singularity images. For all three of these registries images are automatically rebuilt to reflect the most up-to-date state of a given codebase, but older tagged releases are normally available as well, in a similar way to a tagged git repository.
If you need to use docker on any of the IGR machines you'll need to be given permission to do this (you need to be added to the docker group on these machines). If you can manage without docker-specific features, please try to use Singularity.
Singularity is a containerisation programme which is designed for use in High Performance Computing, and which allows containers to run without the need for a daemon with elevated privileges, and the associated security concerns which docker introduces.
Singularity images can be constructed from docker images, or from directly from Singularity recipes, which are akin to Dockerfiles. If you need to build singularity images you should be a member of the singularity group on wiay. Ask Jamie or John V if you need to be added.
Name | Purpose | Location | CUDA Support |
---|---|---|---|
machine-learning:latest | GPU-accelerated machine-learning applications | /data/wiay/containers/machine-learning:latest | Y |
LIGO Open Data is available via cvmfs, which is set up on wiay. To access open data with gw_data_find, the public server URL is datafind.ligo.org:443, which is set by default for all users on wiay. LIGO Collaboration-private data is technically available but can be a bit troublesome to authenticate with. To do so, you first need to run
ligo-proxy-init albert.einstein
where albert.einstein is replaced with your ligo username. The frame files are then found at /cvmfs/ligo.osgstorage.org/frames
.
For admins, the documentation for setting up cvmfs with LIGO authentication is located here: LIGO CVMFS setup info
HTCondor, or “condor”, is a job scheduling system which is used to run computing jobs on shared cluster resources. We currently have a small condor “pool” within the IGR, and much larger pools exist within the LSC.
When you submit a job to condor your program gets added to a queue, and when resources are available it is sent to an available compute “node” to run. This means you have less interactive control of your program, but also means that you can set up a large number of jobs and just wait for them to be completed. It also helps with the sharing of resources between members of the group.
If you have a job which is likely to take a long time to run (hours or more), you should definitely consider submitting it via condor rather than trying to run it directly on the machine.
Machine | Submit Node | Compute Node | Cores | Memory | GPUs |
---|---|---|---|---|---|
wiay | yes | yes | 44 | 128 GB | 3 |
deimos | no | yes | 20 | 110 GB | 2 |
muck | no | yes | 16 | 128 GB | 1 |
serenity | no | yes | 6 | 80 GB | 2 |
hermes | no | yes | 12 | 64 GB | 0 |
In order to run a job through condor you need to provide some metadata, which is done in the form of a condor “subfile”. Here's an example, which would run a program called “run_tensorflow.sh”.
# Typical submit file options universe = vanilla log = $(Cluster).$(Process).log error = $(Cluster).$(Process).err output = $(Cluster).$(Process).out # Fill in with your own script, arguments and input files # Note that you don't need to transfer any software executable = run_tensorflow.sh arguments = transfer_input_files = # Resource requirements request_cpus = 1 request_memory = 2GB request_disk = 4GB # Number of jobs queue 1
The top block of code tells condor where the output of the program should be directed to (so standard error is written to the file specified in error, and standard output to the file in output), and where execution-based logs (log) should be written.
The next section tells condor how to run the program; you tell it the executable, which arguments to pass to it, and which files need to be copied from the machine which you submit the job from.
Then we need to tell condor what our program will need to run, in terms of memory, cpu requirements, and disk space. This allows condor to hold your program back until there are sufficient available resources to run it.
The final block contains the “queue” command, telling condor to add this job to the queue.
If you're running code in a python virtual environment it can be helpful to have a wrapper script to launch this when the job goes to the compute node. For example:
#!/bin/bash # A script to set up a specific environment and to run a python script in it # # Activate the virtual env source /home/daniel/.virtualenvs/wiay/heron-stable/bin/activate python $@
you can then set this script as the executable (and remember to make it executable with chmod +x <script.sh>) in the submit file, and provide the python script and its arguments in the arguments field of the submit file, e.g.
executable = shim.sh arguments = run_tensorflow.py more arguments
If you are running a python job in a conda environment, the best way to use condor is to make the python executable from the conda env the executable that condor knows about, and pass the script as an argument.
If your conda environment is in /data/wiay/conda_envs/nessai_test
and you want to run testpython.py
, the submit file would have lines like
executable = /data/wiay/conda_envs/nessai_test/bin/python arguments = testpython.py
Now that you've written your submit file, you'll need to submit your job to the condor pool. To do this you'll need to run
condor_submit submit_file.sub
on one of the submit machines (see above) where submit_file.sub is the name of the submit file you wrote.
You can check the status of your jobs by running
condor_q
on any submit machine.
If your jobs are running they will have a “R” status code. If they have status “I” it means they are idle, and probably waiting for a compute slot to become available.
If any of your jobs have a “H” status code, this means they are held for some reason, and will not run.
To investigate your job's status further, use the condor_q -better-analyse
command, which will explain in words why a held job cannot run.
Condor is configured to allow you to request a GPU for your job.
You'll need to add an extra directive to your sub file to ensure that your job is run when the GPU is available: request_gpus = 1
. So by adding
# Resource requirements request_gpus = 1
to the submit file above you can require that condor waits until a machine with a GPU is available.
If you require a specific GPU type, or other constraints on the type of GPU, then you can pass this with the RequireGPUs
submit file option.
For example to request a V100 with 32GB of RAM,
request_gpus = 1 require_gpus = (DeviceName == "Tesla V100-PCIE-32GB")
To require a card with at least 20 GB of RAM:
request_gpus = 1 require_gpus = (GlobalMemoryMb >= 20000)
Alternatively you can require that a particular card has some amount of _available_ memory (e.g. 10 GB) with
require_gpus = (GlobalMemoryMb - (MemoryUsage?:0) > 10000 )
This is helpful when there are other jobs running on a GPU that are not under condor control and is preferable to the above option if you do not need to use the entire card RAM.
To see the other GPU options that condor allows you to specify, see /usr/lib/condor/libexec/condor_gpu_discovery -extra -mixed
, and the attributes listed in the square-bracketed list are allowable. For example if you need a server-grade GPU, not a consumer grade one, you can tell it not to give you a “GeForce”-branded consumer card with
require_gpus = (!regexp(".*GeForce.*",DeviceName))
In order to ensure that your job runs in a singularity container you'll need to add a few lines to your submit file.
# Singularity settings +SingularityImage = "path_to_image" Requirements = HAS_SINGULARITY == True
where you replace the path_to_image with the location of the image you wish to use.
If you're running on a system where you have sudo access you can follow the instructions here.
Python 2 doesn't support virtual environments out-of-the-box in the way that Python 3 does, so you'll need to install a package to do this for you. (You should check if the system you're using already has this installed by running
$ which virtualenv
If the response is blank you'll need to follow the installation instructions, otherwise you can skip ahead to making a virtual environment.
If you don't have sudo access you'll need to install this locally by running
$ pip install --user virtualenv
You should now be able to make virtual environments by running
$ virtualenv /path/to/new/virtual/environment
and then following along with the instructions above for Python 3 virtual environments.
To find the version of cuDNN run
less /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2
where the path to cudnn should match the route to nvcc,
which nvcc
Currently this section is something we don't need to worry about, but if we end up with a more heterogeneous GPU setup in the future where different machines need different driver versions we'll need to think about this.
In order to run portable GPU code on singularity it is necessary to load the drivers from the compute node (in the case that the compute nodes may have heterogeneous hardware). Running interactively this is straight-forward; the standard IGR cuda image expects the drivers to be bound in /nvlib and /nvbin. So, if you need version 390.87 of the nvidia driver you can bind this into the image like so:
cd /scratch/aries/daniel/ singularity shell -B nvidia390d87:/nvlib,nvidia390d87:/nvbin cuda.simg
Ideally we would set up the singularity configuration on each machine to bind the correct drivers. There are details about this here: http://gpucomputing.shef.ac.uk/education/creating_gpu_singularity/ which amount to adding
bind path = ~/mynvdriver:/nvbin bind path = ~/mynvdriver:/nvlib
to the singularity configuration.