Python on the BMRC Cluster
Quick Links
- Pre-installed python software
- Jupyter Notebook for remote interactive Python coding
- Creating and managing your own python virtual environments
- Using Conda
USING PYTHON ON THE BMRC CLUSTER
The principal method for using Python on the BMRC cluster is to load one of our pre-installed software modules. To see which versions of Python are available run (noting the capital letter):
module avail Python
Our pre-installed Python modules include a number of common packages. To see which packages are included run e.g.:
module whois Python/3.7.4-GCCcore-8.3.0
and then check the Extensions list.
In addition to Python itself, we have a number of auxiliary Python modules which can be loaded in order to access other widely used packages. For example, scipy, numpy and pandas are available through the SciPy-bundle-... modules. To see which versions of SciPy-bundle-... are available run:
module avail SciPy-bundle
To find a SciPy-bundle module that is compatible with your chosen Python module, check the Python version noted in the name and the toolchain. For example, SciPy-bundle/2019.10-foss-2019b-Python-3.7.4 is compatible with Python/3.7.4-GCCcore-8.3.0, because
- they use the same version of Python, and
- they have compatible toolchains because the GCCcore-8.3.0 toolchain is part of the foss-2019b toolchain (to verify this, use module show foss/2019b).
If in doubt, simply try to load both modules together - if they are incompatible, an error will be reported.
Jupyter Notebook for Remote Python Coding
Jupyter Notebook is a popular Python-based software that allows you to edit and run Python code remotely over a web-browser. Here's how to use it.
- Login to cluster1 or cluster2.
- If you plan to use Jupyter notebook only with software available through our modules system, then proceed to step 3. Otherwise, if you will need to install your own python modules then you need to setup your own python virtual environments as described in the section below. You will need to create two python virtual environments, one for Skylake and one for Ivybridge (as explained in the section below). Inside each virtual environment, manually install Jupyter Notebook using pip install --force notebook.
- While logged in to cluster1 (aka rescomp1) or cluster2 (aka rescomp2), start an interactive cluster session using e.g. srun -p short --pty bash Make a note of which node is running your interactive session by using the hostname -s command or checking your prompt.
- If you are using only modules, then at this point you can load the iPython module:
module load IPython/7.9.0-foss-2019b-Python-3.7.4
Alternatively, if you are using your own python virtual environment then you need to load the same Python module that you used to create your virtual environment and then load your virtual environment. Run echo $MODULE_CPU_TYPE to see whether you are currently on a skylake or ivybridge host and then activate the appropriate python virtual environment. - Start jupyter notebook: jupyter notebook --no-browser --ip=*
After running this command, you will see several lines of text appear on screen. The last few lines will look as below - you need only look at the line which begins http://127.0.0.1...
To access the notebook, open this file in a browser:
file:///[....].local/share/jupyter/runtime/nbserver-16902-open.html
Or copy and paste one of these URLs:
http://<interactive_host_name>:8888/?token=59836e245b9bc3ba915d3d7ab31f3fc15f257972ed5c5ea3
or http://127.0.0.1:8888/?token=59836e245b9bc3ba915d3d7ab31f3fc15f257972ed5c5ea3
Note that your own port number and token may differ from those shown here and <interactive_host_name> will be the full version of your interactive hostname eg. compc001.hpc.in.ox.ac.uk that you discovered above. In the following instructions, make sure to use the information shown on your own screen. - At this point, you need to create a tunnelled connection from your own computer to your qlogin session. First take note of the port number which could be 8888 as shown above or another number.
- Then open a new terminal window ON YOUR OWN COMPUTER (i.e. not on rescomp1 or rescomp2) and create an SSH tunnel following this template:
ssh -L 8888:interactivehostname:8888 username@cluster1.bmrc.ox.ac.uk
Remember to use your own port number in place of 8888, as well as your own qloginhostname (the short version is sufficient) and your own username.
NB1 If you are also running jupyter notebook locally on your own computer, it is likely that port 8888 on your own computer is already in use. If so, change the first port number (before qloginhostname) to e.g. 9999. Make sure that the second port number (after qloginhostname) matches what you saw in Step 5.
NB2 If you have configured your local SSH client to re-use a single SSH connection (i.e. if you have configured "ControlMaster auto" or similar in your ~/.ssh/config) then you should create your SSH tunnel via cluster2 instead of cluster1. - After running the tunnel command your terminal will appear to be logged into cluster1 and (invisibly to you) an additional connection now exists between your computer and your qlogin host. Now open a web browser on your own computer and copy the line from your own terminal corresponding to the http://127.0.0.1... line above.
NB If you have changed your local port number when creating the SSH tunnel, you will also need to change the port number in the https://127.0.0.1.... url. - Paste the newly copied line into your web browser and Jupyter notebook will appear
- To close down, click the Quit button in the top right of Jupyter notebook and then close all your terminal windows.
Python Virtual Environments and Local Packages
In most cases if you require some software or Python packages which is not yet installed on the cluster, it is best to email us to request it. When sending software requests, please ensure that you send us sufficient information including the software name, its homepage or download page, and whether you wish to use it in conjunction with any other particular software modules.
In some cases, however, you may wish to try out software packages or install them for testing purposes. In these cases, installing your own packages via a Python virtual environment may be the best way.
On the BMRC cluster, we recommend the use of Python virtual environments in preference to other ways of handling multiple python installations. A python virtual environment provides you with a local copy of python over which you have full control. including which packages to install.
The need for dual virtual environments
At any one time the BMRC cluster comprises computers with different generations of CPU architecture. Currently, these fall into two groups. Our C and D nodes, as well as rescomp3 use Ivybridge-compatible CPUs while our E and F nodes, as well as cluster1-2 use skylake CPUs. Software built for skylake will not run on Ivybridge, while software built for Ivybridge can run on Skylake but will not take advantage of the newer capabilities. For this reason, we in fact maintain two separate libraries for our pre-installed software - one for Ivybridge and one for Skylake - although this is normally invisible to the user because our system chooses automatically which software version to make available when you load something. When creating and managing your own environments, however, you will need to manage this yourself.
Creating and managing your own python virtual environments
Here is an example of how to create and manage your own python virtual environments. Using this method, you create local package libraries on disk. Once configured, you can then install or remove packages using e.g. pip as you wish.
In order to ensure that your code will work across all cluster nodes (whether those nodes using ivybridge or skylake CPUs), the overall goal is to create two near-identical local package libraries, one for skylake CPUs and one for Ivybridge CPUs, and to select the correct one automatically when needed.
- First login to either rescomp1 or rescomp2, which use skylake CPUs. Use module avail Python to list and choose a suitable version of Python e.g. Python/3.7.4-GCCcore-8.3.0 and then module load Python/3.7.4-GCCcore-8.3.0 to load it.
- We will assume you wish to create a python virtual environment called projectA. First, find a suitable place on disk to store all your python virtual environments e.g. /well/<group>/users/<username>/python/ . Create this directory before continuing and then cd into it.
- Once inside your python directory, run
python -m venv projectA-skylake
This will create a new python virtual environment in the projectA-skylake sub-folder. Once this is created, you must activate it before using it by running
source projectA-skylake/bin/activate
Notice that your shell prompt changes to reflect virtual environment. Once it is activated, you can now proceed to install software e.g. by using the pip search XYZ to search for software and then pip install XYZ to install it. Repeat the process to install all the packages you need. - Once you have installed all the packages you need in projectA-skylake run pip freeze > requirements.txt . This will put a list of all your installed packages and their versions into the file requirements.txt . We will use this file to recreate this environment for Ivybridge.
- Run deactivate to deactivate your projectA-skylake environment and then ssh to rescomp3. Note you can only reach rescomp3 by first logging into rescomp1-2 and then typing ssh rescomp3 .
- Once logged into rescomp3, you should load the same Python module your previously loaded on rescomp1-2 e.g. module load Python/3.7.4-GCCcore-8.3.0 . Note that our system automatically takes care to load the Ivybridge version of this software now that you are on rescomp3.
- cd to your python folder (i.e. the parent folder in which projectA-skylake is located) and now create a second virtual environment by running
python -m venv projectA-ivybridge
Once this is created, activate it by running
source projectA-ivybridge/bin/activate . - With the projectA-ivybridge environment activated, you can copy all the same packages that were previously installed into the skylake repository by running pip install -r /path/to/requirements.txt i.e. using the requirements.txt file you created earlier. Once python has finished installing all the packages from requirements.txt, run deactivate to deactivate your current python environment.
- You now have two identical python virtual environments, one built for skylake and the other is built for ivybridge.
Now that you have two identical environments, one for ivybridge and one for skylake, it only remains to choose the correct one to activate in your job submissions scripts. To do that, you can copy or amend the following sample submission script:
#!/bin/bash
# note that you must load whichever main Python module you used to create your virtual environments before activating the virtual environment
module load Python/3.7.4-GCCcore-8.3.0
# Activate the ivybridge or skylake version of your python virtual environment
# NB The environment variable MODULE_CPU_TYPE will evaluate to ivybridge or skylake as appropriate
source /path/to/projectA-${MODULE_CPU_TYPE}/bin/activate
# continue to use your python venv as normal
Conda, Anaconda and Miniconda
For all software requirements, we strongly recommend that you make use of our software modules and the methods for handling python virtual environments described above. While conda is often useful in the context of personal machines, it is often a less good fit for a cluster environment.
Where use of conda is preferred, we recommend that you make use of the supplied Anaconda modules rather than installing conda yourself. You can see which versions are available by running
module avail Anaconda
As with python virtual environments, special handling is required for conda as further described below.
Conda Configuration
Using conda is likely to run immediately into problems which can be solved using the configuration below.
By default, conda will store your environments and downloaded packages in your home directory under ~/.conda - this will quickly cause your home directory to run out of space. So conda needs to be configured to store your files in your group folder.
As with python virtual environments (described above), there will also be issues of CPU compatibility (ivybridge vs skylake).
The configuration described below is intended to address both of the above problems.
- Login to cluster1 and create a dedicated conda folder in your group home folder with subdirectories for packages and environments. NB replace group and username with your own group and username
mkdir /well/group/users/username/conda - Create the file ~/.condarc containing the following configuration.
NB1 indented lines are indented two spaces
NB2 Replace group and username with your own group and username
channels:
- conda-forge
- bioconda
- defaults
pkgs_dirs:
- /well/group/users/username/conda/${MODULE_CPU_TYPE}/pkgs
envs_dirs:
- /well/group/users/username/conda/${MODULE_CPU_TYPE}/envs - Before activating and using a conda environment, you must initialise conda itself. You can do this, either in the shell or in your qsub scripts as follows (using the Anaconda3/2022.05 module as an example):
module load Anaconda3/2022.05
eval "$(conda shell.bash hook)" - In order to allow your conda environments to work well on both Ivybridge and Skylake CPUs, you must create identical conda environments using both cluster1 or cluster2 (for skylake) and rescomp3 (for ivybridge). The condarc file above allows you to re-use the same environment name, but it is essential that you create one environment for skylake and one for ivybridge. The basic workflow would be:
Starting on cluster1 or cluster2...
module load Anaconda3/2022.05
eval "$(conda shell.bash hook)"
conda create -n myproject python=3
conda install....
Then repeat the above on rescomp3. - When submitting jobs scripts that rely on conda, remember to load and activate conda within your script using:
module load Anaconda3/2022.05
eval "$(conda shell.bash hook)"
conda activate..
python ...