0% found this document useful (0 votes)
74 views

Introduction To AlphaFold RCS 2022

This document provides an introduction to running AlphaFold for 3D protein structure prediction on the Grace supercomputer. It discusses the resources and limitations of AlphaFold, the database files available on Grace, and different methods for running AlphaFold including via Google Colab, ChimeraX, or directly on Grace GPU nodes. It also covers visualizing results and example job scripts for Grace.

Uploaded by

nitn385
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
74 views

Introduction To AlphaFold RCS 2022

This document provides an introduction to running AlphaFold for 3D protein structure prediction on the Grace supercomputer. It discusses the resources and limitations of AlphaFold, the database files available on Grace, and different methods for running AlphaFold including via Google Colab, ChimeraX, or directly on Grace GPU nodes. It also covers visualizing results and example job scripts for Grace.

Uploaded by

nitn385
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Introduction to AlphaFold for 3D Protein

Structure Prediction on Grace


Texas A&M Research Computing Symposium
May 25, 2022

High Performance Research Computing | hprc.tamu.edu 1


AlphaFold for 3D Protein Structure
Prediction on Grace
● Resources and Limitations
● Database Files
● Running AlphaFold
○ Google Colab
○ ChimeraX + Google Colab
○ Grace GPU or non-GPU nodes
● Visualization of Results
○ job resource usage
○ view predictions in ChimeraX
○ plotting pLDDT values
● Alternative Workflows
https://hprc.tamu.edu/wiki/SW:AlphaFold

High Performance Research Computing | hprc.tamu.edu 2


Resource Limitations
● AlphaFold
○ currently AlphaFold can only utilize one GPU
○ about 90% of processing is done on CPU when using
DeepMind's workflow
● AlphaFold on HPRC Grace
○ sometimes GPU not detected on certain nodes
● AlphaFold in Google Colab (web browser or ChimeraX app)
○ no guarantee of available resources in Colab
○ runs as a Jupyter notebook on Google Colab cloud servers
■ 12GB RAM max
■ a notebook can run for up to 12 hours per day
● 24 hours per day with Colab Pro ($9.99/month)
■ not suitable for large predictions

High Performance Research Computing | hprc.tamu.edu 3


AlphaFold Databases on Grace
/scratch/data/bio/alphafold/2.2.0
├── bfd ├── pdb70
│ ├── [1.4T] bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffdata │ ├── [ 410] md5sum
│ ├── [1.7G] bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffindex │ ├── [ 53G] pdb70_a3m.ffdata
│ ├── [ 16G] bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffdata │ ├── [2.0M] pdb70_a3m.ffindex
│ ├── [1.6G] bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffindex │ ├── [6.6M] pdb70_clu.tsv
│ ├── [304G] bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffdata │ ├── [ 21M] pdb70_cs219.ffdata
│ └── [124M] bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffindex │ ├── [1.5M] pdb70_cs219.ffindex
├── mgnify │ ├── [3.2G] pdb70_hhm.ffdata
│ └── [ 64G] mgy_clusters_2018_12.fa │ ├── [1.8M] pdb70_hhm.ffindex
├── mgnify_2019_05 │ └── [ 19M] pdb_filter.dat
│ └── [271G] mgy_proteins.fa ├── pdb_mmcif
├── params │ ├── [9.4M] mmcif_files
│ ├── [ 18K] LICENSE │ └── [141K] obsolete.dat
│ ├── [356M] params_model_1_multimer_v2.npz ├── pdb_seqres
│ ├── [356M] params_model_1.npz │ └── [218M] pdb_seqres.txt
│ ├── [356M] params_model_1_ptm.npz ├── small_bfd
│ ├── [356M] params_model_2_multimer_v2.npz │ └── [ 17G] bfd-first_non_consensus_sequences.
│ ├── [356M] params_model_2.npz ├── uniclust30
│ ├── [356M] params_model_2_ptm.npz │ ├── [ 87G] uniclust30_2018_08
│ ├── [356M] params_model_3_multimer_v2.npz │ └── [ 206G] uniclust30_2021_03
│ ├── [354M] params_model_3.npz ├── uniprot
│ ├── [355M] params_model_3_ptm.npz │ └── [104G] uniprot.fasta
│ ├── [356M] params_model_4_multimer_v2.npz └── uniref90
│ ├── [354M] params_model_4.npz └── [ 62G] uniref90.fasta
│ ├── [355M] params_model_4_ptm.npz
│ ├── [356M] params_model_5_multimer_v2.npz
│ ├── [354M] params_model_5.npz total size: 2.6TB
│ └── [355M] params_model_5_ptm.npz number of files: 183,000+

High Performance Research Computing | hprc.tamu.edu 4


Resources for Running AlphaFold

● Run as a Jupyter Notebook on Google Colab in web browser


● Run as a Jupyter Notebook on Google Colab in ChimeraX
○ ChimeraX Interactive App on HPRC Grace Portal
■ https://portal-grace.hprc.tamu.edu
● Run as a Slurm job script on Grace

https://hprc.tamu.edu/wiki/SW:AlphaFold

High Performance Research Computing | hprc.tamu.edu 5


ColabFold AlphaFold2 Jupyter Notebook
Enter an amino acid sequence
NLYIQWLKDGGPSSGRPPPS

High Performance Research Computing | hprc.tamu.edu 6


ChimeraX
● Can be used to visualize protein structures
● Can be launched using the Grace portal
○ portal-grace.hprc.tamu.edu
● Can be used to run AlphaFold using the daily build version (2022.02.22+)
○ uses Google Colab with limited resources

High Performance Research Computing | hprc.tamu.edu 7


● Launch a ChimeraX job on the Grace
portal portal-grace.hprc.tamu.edu
○ select Node Type: ANY
○ AlphaFold runs on Google Colab
GPUs so we can use a non-GPU
for running ChimeraX
● ChimeraX will be used later for the
following
○ run AlphaFold in Google Colab
○ visualize results from an
AlphaFold job on Grace

High Performance Research Computing | hprc.tamu.edu 8


Running AlphaFold on Grace
● Can be run as a job script requesting one GPU
● shared databases are available: 2.6TB total size
benchmark monomer_ptm 98 aa multimer 98 & 73 aa

A100 1 hour 26 minutes 4 hours 49 minutes

RTX 6000 2 hours 39 minutes 4 hours 44 minutes

T4 2 hours 35 minutes 4 hours 45 minutes

CPU only 2 hours 50 minutes 6 hours 11 minutes

● Can be run in ChimeraX from the Grace portal


○ using the ChimeraX AlphaFold option
○ all processing done on Google cloud servers
■ monomer_ptm (98 aa) on GPU = 1 hour 14 minutes
High Performance Research Computing | hprc.tamu.edu 9
Finding AlphaFold template job scripts
using GCATemplates on Grace
● Genomic Computational Analysis
Templates have example input data so
you can run the script for demo purposes
mkdir $SCRATCH/af2demo
cd $SCRATCH/af2demo
gcatemplates
● Type s for search then enter alphafold to
search for the alphafold 2.2.0 template
script and select the reduced_dbs script
● Review the script and submit the job
script which takes about 30 minutes to
complete
sbatch run_alphafold_2.2.0_reduced_dbs_monomer_ptm_grace.sh

High Performance Research Computing | hprc.tamu.edu 10


Monitoring AlphaFold GPUs Found
● Check to make sure that AlphaFold can detect GPUs
○ wait a few minutes after the job starts
○ search for the text "No GPU/TPU found, falling back to CPU."
■ grep CPU stderr*
● If the job did not detect GPUs
○ find the compute node name in the NodeList column
■ sacct
sacct-j-j jobID
jobID
○ cancel the job
■ scancel jobID
scancel jobID
○ add a line in your job script to ignore the compute node
■ #SBATCH --exclude=g016
○ submit your updated job script
○ send an email to the HPRC helpdesk with the node name
[email protected]

High Performance Research Computing | hprc.tamu.edu 11


AlphaFold with
ChimeraX + Google Colab

High Performance Research Computing | hprc.tamu.edu 12


Maximize ChimeraX Window

Right click and select


Maximize if the
ChimeraX window is off
the screen

High Performance Research Computing | hprc.tamu.edu 13


AlphaFold with ChimeraX
● Launch ChimeraX using the HPRC Grace portal
● Select the AlphaFold Structure Prediction option

High Performance Research Computing | hprc.tamu.edu 14


AlphaFold with ChimeraX
● Enter an amino acid sequence
NLYIQWLKDGGPSSGRPPPS
○ or paste in Clipboard first
then paste in Sequence
field
● Click Predict
● A Google Colab page will
start and prompt you for
your Google login
● Login to your Google
account to begin processing
● Prediction completes in
about 1 hour
● Not ideal for large prediction
jobs
High Performance Research Computing | hprc.tamu.edu 15
AlphaFold Grace Job Scripts

High Performance Research Computing | hprc.tamu.edu 16


Example AlphaFold Job Script
#!/bin/bash
#SBATCH --job-name=alphafold # job name
#SBATCH --time=2-00:00:00 # max job run time dd-hh:mm:ss

● multimer #SBATCH --ntasks-per-node=1


#SBATCH --cpus-per-task=24
#
#
tasks (commands) per compute node
CPUs (threads) per command

○ dbs in red required for


#SBATCH --mem=180G # total memory per node
#SBATCH --gres=gpu:a100:1 # request 1 A100 GPU
#SBATCH --output=stdout.%x.%j # save stdout to file

multimer #SBATCH --error=stderr.%x.%j # save stderr to file

AlphaFold can only use one


module purge
● export SINGULARITYENV_TF_FORCE_UNIFIED_MEMORY=1

GPU so reserve half the CPU


export SINGULARITYENV_XLA_PYTHON_CLIENT_MEM_FRACTION=4.0

DOWNLOAD_DIR=/scratch/data/bio/alphafold/2.2.0

and memory resources so # run gpustats in the background (&) to monitor gpu usage in order to create a graph later

another job can use the other


gpustats &

singularity exec --nv /sw/hprc/sw/bio/containers/alphafold/alphafold_2.2.0.sif \

GPU python /app/alphafold/run_alphafold.py \


--use_gpu_relax \

○ Grace compute nodes


--data_dir=$DOWNLOAD_DIR \
--uniref90_database_path=$DOWNLOAD_DIR/uniref90/uniref90.fasta \
--mgnify_database_path=$DOWNLOAD_DIR/mgnify/mgy_clusters_2018_12.fa \

have 360GB of available --uniclust30_database_path=$DOWNLOAD_DIR/uniclust30/uniclust30_2021_03/UniRef30_2021_03 \


--bfd_database_path=$DOWNLOAD_DIR/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \

memory and 48 cores


--model_preset=multimer \
--pdb_seqres_database_path=$DOWNLOAD_DIR/pdb_seqres/pdb_seqres.txt \
--uniprot_database_path=$DOWNLOAD_DIR/uniprot/uniprot.fasta \
--template_mmcif_dir=$DOWNLOAD_DIR/pdb_mmcif/mmcif_files \
--obsolete_pdbs_path=$DOWNLOAD_DIR/pdb_mmcif/obsolete.dat \
--max_template_date=2022-1-1 \
--db_preset=full_dbs \
--output_dir=out_alphafold \
--fasta_paths=$DOWNLOAD_DIR/example_data/T1083_T1084_multimer.fasta

# run gpustats to create a graph of gpu usage for this job


gpustats

High Performance Research Computing | hprc.tamu.edu 17


Example AlphaFold Job Script
#!/bin/bash
#SBATCH --job-name=alphafold # job name
#SBATCH --time=2-00:00:00 # max job run time dd-hh:mm:ss

● monomer #SBATCH --ntasks-per-node=1


#SBATCH --cpus-per-task=24
#
#
tasks (commands) per compute node
CPUs (threads) per command

○ dbs in red required for


#SBATCH --mem=180G # total memory per node
#SBATCH --gres=gpu:a100:1 # request 1 A100 GPU
#SBATCH --output=stdout.%x.%j # save stdout to file

monomer #SBATCH --error=stderr.%x.%j # save stderr to file

monomer_ptm
module purge
● export SINGULARITYENV_TF_FORCE_UNIFIED_MEMORY=1

○ will produce pTM scores


export SINGULARITYENV_XLA_PYTHON_CLIENT_MEM_FRACTION=4.0

DOWNLOAD_DIR=/scratch/data/bio/alphafold/2.2.0

that can be graphed # run gpustats in the background (&) to monitor gpu usage in order to create a graph later

using AlphaPickle
gpustats &

singularity exec --nv /sw/hprc/sw/bio/containers/alphafold/alphafold_2.2.0.sif \

● AlphaFold can only use one python /app/alphafold/run_alphafold.py \


--use_gpu_relax \

GPU so reserve half the CPU


--data_dir=$DOWNLOAD_DIR \
--uniref90_database_path=$DOWNLOAD_DIR/uniref90/uniref90.fasta \
--mgnify_database_path=$DOWNLOAD_DIR/mgnify/mgy_clusters_2018_12.fa \

and memory resources so --uniclust30_database_path=$DOWNLOAD_DIR/uniclust30/uniclust30_2021_03/UniRef30_2021_03 \


--bfd_database_path=$DOWNLOAD_DIR/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \

another job can use the


--model_preset=monomer \
--pdb70_database_path=$DOWNLOAD_DIR/pdb70/pdb70 \
--template_mmcif_dir=$DOWNLOAD_DIR/pdb_mmcif/mmcif_files \

other GPU --obsolete_pdbs_path=$DOWNLOAD_DIR/pdb_mmcif/obsolete.dat \


--max_template_date=2022-1-1 \
--db_preset=full_dbs \
--output_dir=out_alphafold \
--fasta_paths=$DOWNLOAD_DIR/example_data/T1083.fasta

# run gpustats to create a graph of gpu usage for this job


gpustats

High Performance Research Computing | hprc.tamu.edu 18


Example AlphaFold Job Script
#!/bin/bash
#SBATCH --job-name=alphafold # job name
#SBATCH --time=2-00:00:00 # max job run time dd-hh:mm:ss

● monomer + reduced_dbs #SBATCH --ntasks-per-node=1


#SBATCH --cpus-per-task=24
#
#
tasks (commands) per compute node
CPUs (threads) per command

dbs in red required for


#SBATCH --mem=180G # total memory per node
● #SBATCH --gres=gpu:a100:1
#SBATCH --output=stdout.%x.%j
#
#
request 1 A100 GPU
save stdout to file

monomer + reduced_dbs #SBATCH --error=stderr.%x.%j # save stderr to file

small_bfd_database is a
module purge
● export SINGULARITYENV_TF_FORCE_UNIFIED_MEMORY=1

subset of BFD and is


export SINGULARITYENV_XLA_PYTHON_CLIENT_MEM_FRACTION=4.0

DOWNLOAD_DIR=/scratch/data/bio/alphafold/2.2.0

generated by taking the first # run gpustats in the background (&) to monitor gpu usage in order to create a graph later

non-consensus sequence
gpustats &

singularity exec --nv /sw/hprc/sw/bio/containers/alphafold/alphafold_2.2.0.sif \

from every cluster in BFD python /app/alphafold/run_alphafold.py \


--use_gpu_relax \

AlphaFold can only use one


--data_dir=$DOWNLOAD_DIR \
● --uniref90_database_path=$DOWNLOAD_DIR/uniref90/uniref90.fasta \
--mgnify_database_path=$DOWNLOAD_DIR/mgnify/mgy_clusters_2018_12.fa \

GPU so reserve half the CPU --small_bfd_database_path=$DOWNLOAD_DIR/small_bfd/bfd-first_non_consensus_sequences.fasta \


--model_preset=monomer \

and memory resources so


--pdb70_database_path=$DOWNLOAD_DIR/pdb70/pdb70 \
--template_mmcif_dir=$DOWNLOAD_DIR/pdb_mmcif/mmcif_files \
--obsolete_pdbs_path=$DOWNLOAD_DIR/pdb_mmcif/obsolete.dat \

another job can use the other --max_template_date=2022-1-1 \


--db_preset=reduced_dbs \
--output_dir=out_alphafold \
GPU --fasta_paths=$DOWNLOAD_DIR/example_data/T1083.fasta

# run gpustats to create a graph of gpu usage for this job


gpustats

High Performance Research Computing | hprc.tamu.edu 19


Unified Memory
#!/bin/bash
#SBATCH --job-name=alphafold # job name
#SBATCH --time=2-00:00:00 # max job run time dd-hh:mm:ss
#SBATCH --ntasks-per-node=1 # tasks (commands) per compute node
#SBATCH --cpus-per-task=24 # CPUs (threads) per command
#SBATCH --mem=180G # total memory per node
#SBATCH --gres=gpu:a100:1 # request 1 A100 GPU
#SBATCH --output=stdout.%x.%j # save stdout to file
#SBATCH --error=stderr.%x.%j # save stderr to file

export SINGULARITYENV_TF_FORCE_UNIFIED_MEMORY=1
export SINGULARITYENV_XLA_PYTHON_CLIENT_MEM_FRACTION=4.0

● unified memory can be used to request more than just the total GPU
memory for the JAX step in AlphaFold
○ A100 GPU has 40GB memory
○ GPU total memory (40) * XLA_PYTHON_CLIENT_MEM_FRACTION (4.0)
○ XLA_PYTHON_CLIENT_MEM_FRACTION default = 0.9
● this example script has 160 GB of unified memory
○ 40 GB from A100 GPU + 120 GB DDR from motherboard

High Performance Research Computing | hprc.tamu.edu 20


AlphaFold Results
Visualization

High Performance Research Computing | hprc.tamu.edu 21


Review GPU usage for a Job
The gpustats command monitors GPU resource usage and create a graph

#!/bin/bash
#SBATCH --job-name=my_gpu_job
#SBATCH --time=1-00:00:00
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=24
#SBATCH --mem=180G
#SBATCH --gres=gpu:a100:1
#SBATCH --output=stdout.%x.%j
#SBATCH --error=stderr.%x.%j

module purge

# run gpustats in the background (&) to monitor gpu usage


gpustats &

my_alphafold_command

# run gpustats to create a graph of gpu usage for this job


gpustats

When the job is complete, login with ssh -X option and eog stats_gpu.3411850.png
view graph of GPU usage stats using the eog command

High Performance Research Computing | hprc.tamu.edu 22


Review CPU usage for a Job
The seff command displays CPU resource usage

seff 3411850

Job ID: 3411850


Cluster: grace
User/Group: netid/netid
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 24
CPU Utilized: 06:24:58
CPU Efficiency: 2.70% of 9-21:48:24 core-walltime
Job Wall-clock time: 09:54:31
Memory Utilized: 54.59 GB
Memory Efficiency: 30.33% of 180.00 GB

usage stats are not accurate until the job is complete

High Performance Research Computing | hprc.tamu.edu 23


Visualize Results with ChimeraX

Click the minimize


button to return to
ChimeraX or click
Reconnect then
minimize

High Performance Research Computing | hprc.tamu.edu 24


Visualize AlphaFold Google Colab Results
with ChimeraX

High Performance Research Computing | hprc.tamu.edu 25


View PDB Structure if Available

type open and the


protein name 1L2Y

High Performance Research Computing | hprc.tamu.edu 26


Visualize AlphaFold Grace Results
with ChimeraX

High Performance Research Computing | hprc.tamu.edu 27


AlphaFold Confidence Metrics

High Performance Research Computing | hprc.tamu.edu 28


Visualize AlphaFold .pkl Results

> 90 = Very high


70 - 90 = Confident
50 - 70 = Low
< 50 = Very low

eog output_dir/protein_dir/ranked_0_pLDDT.png

you may get different results compared to the image above when using reduced_dbs

High Performance Research Computing | hprc.tamu.edu 29


Visualize AlphaFold PAE Results (monomer_ptm)
● Low Predicted Aligned Error
(PAE) value has higher
confidence of accuracy
● Must use monomer_ptm or
Aligned residue

multimer as model_preset to
create PAE image
● The colour at position (x, y)
indicates AlphaFold’s
expected position error at
residue x, when the predicted
and true structures are
aligned on residue y.
Scored residue

eog output_dir/protein_dir/ranked_0_PAE.png https://alphafold.ebi.ac.uk

High Performance Research Computing | hprc.tamu.edu 30


Evaluating Models

ranked_0 ranked_1 ranked_2 ranked_3 ranked_4

"plddts": {
"model_1_ptm_pred_0": 94.16585478746399,
"model_2_ptm_pred_0": 94.64120852328334,
"model_3_ptm_pred_0": 89.94980057627299,
"model_4_ptm_pred_0": 77.53515668415058,
see which model has the top rank based on pLDDT score },
"model_5_ptm_pred_0": 88.40610380463586

"order": [
"model_2_ptm_pred_0", ranked_0
"model_1_ptm_pred_0",
"model_3_ptm_pred_0",
cat out_IL2Y_reduced_dbs_monomer_ptm/IL2Y/ranking_debug.json "model_5_ptm_pred_0",
"model_4_ptm_pred_0"
]

High Performance Research Computing | hprc.tamu.edu 31


AlphaFold Workflow
Alternatives

High Performance Research Computing | hprc.tamu.edu 32


ParallelFold
● ParallelFold (ParaFold) breaks the AlphaFold workflow into two steps
○ processing of the three CPU steps in parallel
○ processing of the GPU step
● The parallel portion for CPU steps is not implemented yet resulting in
similar or longer runtimes than the DeepMind approach
○ the first three CPU steps, jackhammer, jackhammer and HHblits
are supposed to run as three separate processes in parallel but
currently this parallelization step is not implemented yet
○ when the CPU processing is parallelized, it will require 21 cores
■ 8 cores for each of the two jackhammer steps
■ 5 cores for the HHblits step.
● These steps may be implemented in parallel in AlphaFold soon
https://github.com/Zuricho/ParallelFold

High Performance Research Computing | hprc.tamu.edu 33


Databases and References

High Performance Research Computing | hprc.tamu.edu 34


DeepMind and EMBL’s
European Bioinformatics
Institute (EMBL-EBI) have
partnered to create
AlphaFold DB to make
these predictions freely
available to the scientific
community.

High Performance Research Computing | hprc.tamu.edu 35


References

Arnold, M. J. (2021) AlphaPickle doi.org/10.5281/zenodo.5708709

High Performance Research Computing | hprc.tamu.edu 36

You might also like