Python on the BMRC Cluster
Using Python on the BMRC Cluster
The principal method for using Python on the BMRC cluster is to load one of our pre-installed software modules. To see which versions of Python are available run (noting the capital letter):
module avail Python
Our pre-installed Python modules include a number of common packages. To see which packages are included run e.g.:
module show Python/3.11.3-GCCcore-12.3.0
and then check the Extensions list.
In addition to Python itself, we have a number of auxiliary Python modules which can be loaded in order to access other widely used packages. For example, scipy, numpy and pandas are available through the SciPy-bundle-... modules. To see which versions of SciPy-bundle-... are available run:
module avail SciPy-bundle
To find a SciPy-bundle module that is compatible with your chosen Python module, check the Python version noted in the name and the toolchain. For example, SciPy-bundle/2023.07-gfbf-2023a is compatible with Python/3.11.3-GCCcore-12.3.0, because
- they use the same version of Python, and
- they have compatible toolchains because the GCCcore-12.3.0 toolchain is part of the gfbf-2023a toolchain (to verify this, use module show gfbf/2023a).
Another useful module is Python-bundle-PyPI/2023.06-GCCcore-12.3.0.
If in doubt, simply try to load both modules together - if they are incompatible, an error will be reported.
Jupyter Notebook for Remote Python Coding
Jupyter Notebook is a popular Python-based software that allows you to edit and run Python code remotely over a web-browser. BMRC recommends Open OnDemand for Jupyter notebooks.
The need for multiple virtual environments
At any one time the BMRC cluster comprises computers with different generations of CPU architecture. Currently, we have Intel CPUs of the Skylake, Cascadelake, Icelake and Sapphire Rapids architectures, and AMD CPUs of the Rome architecture (and hopefully the Turin architecture in the near future). The different CPU architectures have slightly different instruction sets, and software builds tuned for one architecture might not run on another. By default, compilers tend to tune for the architecture on which they are running.
Within a particular vendor's products, it is generally the case that newer architectures are backwards-compatible with older ones, so software built for skylake will run on sapphire_rapids but the reverse might not be true. However, software built for skylake will not take advantage of the additional instructions availale on sapphire_rapids, hence it may not perform as well as it could on the newer nodes. Software builds are less likely to be compatible across CPUs from different vendors, whose instruction sets have might have a much older common ancestor (e.g. Intel and AMD), or use different instructions altogether (e.g. Arm). Currently, BMRC targets skylake when building centrally installed modules, because the output is compatible with the newer Intel CPUs. The AMD nodes and older Intel nodes require separate builds. If you want your jobs to run on any node in the cluster then you may need to have multiple architecture-specific builds and select the correct one for the node at run time.
The different CPU types can be targeted using the '--constraint' argument to sbatch and srun. See the 'Submitting to specific host groups or nodes' section of the cluster documentation for more information. BMRC sets two environment variables to help identify the nature of the nodes. MODULE_CPU_TYPE reports the architecture-specific component of the default MODULEPATH set on the particular node. BMRC_GCC_NATIVE_ARCH reports the architecture that the latest version of GCC would use by default, or if passed '-march=native', on the particular node. The former is useful as a marker for significantly-different architectures, and in most cases is sufficient distinction for your needs. The latter gives you a marker for the most appropriate compiler tuning for the node, which might give a performance boost to your application.
BMRC's next compute refresh is likely to add a signficant number of AMD Turin CPUs. A separate module tree will be necessary to support that architecture, along with new values for MODULE_CPU_TYPE and BMRC_GCC_ARCH_NATIVE. BMRC endeavours to hide some of the complexity that arises as a result but if you are creating and managing your own environments and you want to run on both the Intel and AMD nodes then you will need your own architecture-specific builds.
Creating and managing your own python virtual environments
Here is an example of how to create and manage your own python virtual environments for multiple architectures. In this example, there is a project projectA that requires a python package that you know to have a significant performance improvement from the AVX-512 extension instructions introduced with icelake. As such, you want a python venv specifically for icelake and another venv for older cpus, for which the base level is skylake.
- Login to cluster1, which has skylake CPUs: MODULE_CPU_TYPE=skylake and BMRC_GCC_ARCH_NATIVE=skylake-avx512. The base build will be done on this node.
- Find a suitable place on disk to store all your python virtual environments e.g. /well/<group>/users/<username>/python-venvs/ . Create this directory before continuing and then cd into it.
- Use module avail Python to list and choose a suitable version of Python e.g. Python/3.11.3-GCCcore-12.3.0 and then module load Python/3.11.3-GCCcore-12.3.0 to load it.
- Whilst in your python directory, run:
python -m venv projectA-${MODULE_CPU_TYPE}
This will create a new python virtual environment in the projectA-skylake sub-folder. Once this is created, you must activate it before using it by running:
source projectA-skylake/bin/activate
Notice that your shell prompt changes to reflect virtual environment. Once it is activated, you can now proceed to install software e.g. by using the pip search XYZ to search for software and then pip install XYZ to install it. Repeat the process to install all the packages you need. - Once you have installed all the packages you need in projectA-skylake run pip freeze > requirements.txt . This will put a list of all your installed packages and their versions into the file requirements.txt . We will use this file to recreate this environment for other architectures.
- Run deactivate to deactivate your projectA-skylake environment.
- Connect to an icelake node with srun --pty --constraint=icelake bash. On this node, check that MODULE_CPU_TYPE=skylake and BMRC_GCC_ARCH_NATIVE=icelake.
- Load the same Python module your previously loaded on cluster1, e.g. module load Python/3.11.3-GCCcore-12.3.0 .
- cd to your python folder (i.e. the parent folder in which projectA-skylake is located) and now create a second virtual environment by running
python -m venv projectA-${BMRC_GCC_ARCH_NATIVE}
This will create a new python virtual environment in the projectA-icelake sub-folder. Once this is created, activate it by running
source projectA-${BMRC_GCC_ARCH_NATIVE}/bin/activate - With the projectA-icelake environment activated, you can install the same packages that were previously installed into your new skylake venv by running pip install -r requirements.txt i.e. using the requirements.txt file you created earlier. Once python has finished installing all the packages from requirements.txt, run deactivate to deactivate your current python environment.
- You now have two identical python virtual environments, one built for icelake and one for skylake.
Now that you have the two venvs, you should activate the correct one in your job submissions scripts. That can be done as follows:
#!/bin/bash
# NB you must load the Python module from which your venvs were derived
module load Python/3.11.3-GCCcore-12.3.0
# Activate the architecture-appropriate version of your venv
venv="projectA"
venvs_dir="/path/to/python-venvs/"
if [ -f "${venvs_dir}/${venv}-${BMRC_GCC_ARCH_NATIVE}/bin/activate" ] ; then
source "${venvs_dir}/${venv}-${BMRC_GCC_ARCH_NATIVE}/bin/activate"
elif [ -f "${venvs_dir}/${venv}-${MODULE_CPU_TYPE}/bin/activate" ]; then
source "${venvs_dir}/${venv}-${MODULE_CPU_TYPE}/bin/activate"
else
echo "Failed to identify suitable venv on $(hostname -s): MODULE_CPU_TYPE=${MODULE_CPU_TYPE}; BMRC_GCC_ARCH_NATIVE=${BMRC_GCC_ARCH_NATIVE}" 1>&2
exit 1
fi
# Do work with your venv as normal
Conda: Miniforge3, Anaconda and Miniconda
For all software requirements, we strongly recommend that you make use of our software modules and the methods for handling python virtual environments described above. While conda is often useful in the context of personal machines, it is often a less good fit for a cluster environment.
Conda has traditionally been associated with the Anaconda and Miniconda software packages. However, the use of both Anaconda itself and the Anaconda-provided package repositories require the purchase of a software license for business use, and the University does not have one. The use of Miniconda without a license is allowed providing you actively remove 'defaults' from the list of channels, as the inclusion of that channel is a miniconda default. See Anaconda's FAQ, license terms and pricing for more information. Please note that (as of the time of writing) whilst the FAQ includes research as a license-exempt use, the license itself makes clear that the exemption for educational entities only covers use for a curriculum-based course.
Where use of conda is preferred, we recommend that you make use of the supplied Miniforge3 modules rather than installing conda yourself. Miniforge3 does not include the 'defaults' channel in its default configuration. You can see which versions of Minforge3 are available by running:
module avail Miniforge3
Please note that the license requirement covers the 'defaults' channel, whether you are using Miniforge3 or Miniconda.
As with python virtual environments, special handling is required for conda as further described below.
Conda Configuration
Using conda is likely to run immediately into problems which can be solved using the configuration below.
By default, conda will store your environments and downloaded packages in your home directory under ~/.conda - this will quickly cause your home directory to run out of space. So conda needs to be configured to store your files in your group folder.
As with python virtual environments (described above), there may also be issues with CPU compatibility.
The configuration described below is intended to address both of the above problems.
- Login to cluster1 and create a dedicated conda folder in your group home folder with subdirectories for packages and environments. NB replace group and username with your own group and username
mkdir /well/group/users/username/conda - Create the file ~/.condarc containing the following configuration.
NB1 indented lines are indented two spaces
NB2 Replace group and username with your own group and username
channels:
- conda-forge
- bioconda
pkgs_dirs:
- /well/group/users/username/conda/${MODULE_CPU_TYPE}/pkgs
envs_dirs:
- /well/group/users/username/conda/${MODULE_CPU_TYPE}/envs - Before activating and using a conda environment, you must initialise conda itself. You can do this, either in the shell or in your sbatch scripts as follows (using the Miniforge3/24.1.2-0 module as an example):
module load Miniforge3/24.1.2-0
eval "$(conda shell.bash hook)" - If you wish to use the same environment on nodes with different settings of MODULE_CPU_TYPE, you will need to create identical conda environments using nodes with the relevant MODULE_CPU_TYPE settings, e.g. cluster1 for skylake and epyc001 for epyc. The condarc file above allows you to re-use the same environment name. The basic workflow would be:
Starting on cluster1:
module load Miniforge3/24.1.2-0
eval "$(conda shell.bash hook)"
conda create -n myproject python=3
conda install....
Then repeat the above in an interactive srun session on epyc001. - When submitting jobs scripts that rely on conda, remember to load and activate conda within your script so that MODULE_CPU_TYPE is evaluated in the context of the execution node rather than the job submission node, using:
module load Miniforge3/24.1.2-0
eval "$(conda shell.bash hook)"
conda activate..
python ... - If you have an example similar to the venv config above where MODULE_CPU_TYPE is insufficiently specific and you want to use BMRC_GCC_ARCH_NATIVE instead, then just replace the one with the other in the instructions above, though you may have to create environments for all relevant values of BMRC_GCC_ARCH_NATIVE. You could possibly use soft links in the /well/group/users/username/conda/ folder to map some values of MODULE_CPU_TYPE and/or BMRC_GCC_ARCH_NATIVE to a common base dir, to give a similar effect to the venv example, though I expect it would prove easier in the long run just to populate them properly.