Philly To AML Migration: Prerequisites
Philly To AML Migration: Prerequisites
Dashboard / … / VI CV models
If you looking for a collection of general information, go to Cognition AML Resources List.
The example project for this tutorial can be found here.
Author: @ Baoguang Shi Date: 10 Jun 2019
Overview
This page provides a tutorial for setting up distributed training jobs on AzureML using PyTorch 1.1 (or higher). Although AzureML provides comprehensive docs and tutorials, the information
about distributed training is a bit scattered and multiple solutions exist. This tutorial aims to provide a solution that is flexible enough for hefty large-scale training jobs. To this end, we have
the following considerations:
• We will be using custom Docker images build by ourselves instead of AzureML's built-in Estimators. Although the built-in PyTorch Estimator also allows installing extra Conda
packages, it might not be a good solution for preparing complex training environments that, for example, requires specific CUDA/NCCL versions that have certain features or bug fixes.
• We will make inter-node communication efficient by leveraging Infiniband and Horovod. Infiniband allows much faster communication than Ethernet (Note: InfiniBand is not yet enabled
as of 5/30/2019 but will be soon, according to Philly team); Horovod is a well optimized distributed backend that implements efficient synchronization.
• We will take care of the storage and reading of large datasets by using Azure Blob Storage properly.
Prerequisites
Before continue, the following requirements need to be met:
• Python 3.6 or later on dev machine
• AzureML SDK for Python, installed on dev machine, for submitting jobs
• You have an Azure subscription and have an AzureML workspace with GPU Compute. Without GPU Compute, you can still run AzureML jobs on your local machine, but non-
distributed.
You are free to add other packages or choose other versions. But be aware:
• It is important to install the correct NCCL version, which is specified in format "$(nccl-version)+$(cuda-version)>". Simply running "apt install libnccl2" may end up installing
incompatible nccl version that causes hidden issues
• Not all combinations of CUDA and Python versions have official PyTorch install package. The combinations can be found on the PyTorch installation page.
Next, use pip to install Python packages, including PyTorch.
Again, you can install other Python packages based on your need. Notice the azureml-sdk package here. We have it both on our dev machine and inside the Docker image. In Docker
image, azureml-sdk provides AzureML runtime features such as logging accuracy numbers to a dashboard on Azure.
We use Horovod to enable distributed training. Horovod depends on OpenMPI so we need to install it first. OpenMPI initiates communication between multiple containers using ssh, which
needs to be configured to enable that.
# Based on https://github.com/horovod/horovod/blob/master/Dockerfile
Having finished writing the Dockerfile, use docker build in the containing directory to build the image.
If no error happens, the created docker image is now on your machine. AzureML needs to load that image to run jobs, so we need to push this image to a remote container registry. In this
tutorial, we will push the image to Azure Container Registry (ACR).
docker push <registry>/<repo>:<tag>
After the push, it may take another few minutes before the image can be accessed by AzureML.
Here, the password can be found in the "Access keys" tab in your ACR service. You can use tools such as passpy to safely cache this password on your dev machine. Do not save
credentials in source code!
This image_registry_details will later be passed as a parameter into the constructor of Estimator. To read more about custom Docker images, this AzureML doc provides more
details.
Submitting Jobs
Workspace and Compute Target
A training job runs on a compute target of a workspace. You can create workspace and compute target on Azure Portal. See this tutorial for more.
Then, to connect to the workspace and the compute target, run
workspace = Workspace.from_config()
compute_target = workspace.compute_targets[compute_name]
Data storage
AzureML supports a variety of data storage services through its Datastore wrapper. We will use Azure Blob Storage for reading large dataset and writing experiment outputs. Other storage
types can be used similarly. To begin with, you need a Blob Storage account and a container. This page gives a nice tutorial. Also, it is strongly recommended that you pack your dataset
images into a compact binary file. Options include the LMDB, the .tsv format, and many others. Since Blob Storage struggles reading millions of small images, using packed dataset greatly
reduces the data loading overhead. In our group, we use Protocol Buffers to serialize a sample of any structure (image, annotations, file name, etc.) into a binary string and store it into
LMDB.
Suppose we have created the training and validation LMDBs and uploaded them to the Blob Storage. Next, we need to register the data store to the workspace so that it can be accessed
during training.
datastore = Datastore.register_azure_blob_container(
workspace,
datastore_id,
container_name,
storage_account,
account_key=account_key)
The datastore_id is chosen by yourself. The registration only needs to be done once for a storage account. The account key will be safely managed by AzureML. Next time, use the
datastore_id to retrieve the datastore from the workspace.
datastore = workspace.datastores[datastore_id]
to get the mount point in the run. dir_path is the path Blob Storage directory. If your LMDB is stored as train_data/lmdb/train.lmdb in the Blob Storage, you can pass
train_data/lmdb to dir_path.
as_mount() returns an environment variable, e.g. "$AZUREML_DATAREFERENCE_xxxx", which refers to the mount point on the remote machine. Since we are still on the dev machine,
this variable needs to be passed as a parameter to the training script.
The training script will receive the path to the mounted directory, which contains the train.lmdb. When the training script reads train.lmdb for the first time, the whole file will be
automatically downloaded to local to minimize data loading overhead.
It is recommended that you have your Blob Storage account at the same location as your AzureML workspace. This greatly increases data transfer speed and reduces unnecessary network
cost.
Submit
Having set up the workspace, compute target, and data storage, we can now submit the job.
experiment = Experiment(workspace, name=experiment_name)
estimator = Estimator(
source_directory=local_source_dir,
compute_target=compute_target,
entry_script=entry_script_name,
script_params=script_params,
node_count=node_count,
distributed_training=mpi_configuration,
custom_docker_image=image_name,
environment_variables={
'NCCL_DEBUG': 'INFO',
'NCCL_IB_DISABLE': '0'
},
image_registry_details=image_registry_details,
shm_size='16G',
user_managed=True)
In the constructor of Estimator we need to specify all the necessary information about this job, specifically:
• source_directory: path to the directory on your dev machine. Source code under this directory will be uploaded.
• compute_target: the compute target to use (see above)
• entry_script: is the name of the entry script, e.g. "main.py"
• node_count is the number of nodes to use. It depends on the compute type and the number of GPUs you want to use. If the compute type has 4 GPUs each and we need 16 GPUs
in total, then node_count should be 4.
• distributed_training: A class specifying how many processes to run per node. Typically, we run one process per GPU, so it should be the number of GPUs per node. (docs)
• custom_docker_image: name of the docker image, in "{repo}:{tag}" format.
• image_registry_details, a ContainerRegistry describing registry details. (docs)
• environment_variables: environment variables in the run. Setting NCCL_DEBUG=INFO so that NCCL will output details such as what device is used for communication (Infiniband
or Ethernet), ring-reduce topology, etc., which are useful for debugging; NCCL_IB_DISABLE may have been set to 1 by default. Setting it 0 to enable Infiniband when it is available.
• shm_size: the default shared memory size is often insufficient and will crash training. We set it to 16GB here. (more on this)
Finally, call experiment.submit to submit the job. Optionally, you can specify tags with a dictionary of key-value pairs, such as {learning_rate: 0.1}, organize the runs.
There is a caveat though. If you would like to clip gradients by L2 norm, the code needs some change.
for batch in batches:
optimizer.zero_grad()
...
loss.backward()
optimizer.synchronize()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)
optimizer.step(synchronize=False)
Afterword
In this tutorial, we have seen how to prepare training environment, how to store and load large dataset efficiently, and how to submit jobs. There are some other cool features of AzureML that
we haven't covered yet. For example, AzureML's pipeline feature allows one to stitch together different training stages, automating a complex multi-stage training protocol.
Also, keep in mind that this tutorial only provides one solution to distributed training on AzureML. This solution is intended for both research-level and product-level training that requires high
flexibility and reliability. Many steps in this tutorial can be replaced by alternatives, e.g. you can use PyTorch's DataParallel if you only need single-node training. In practice, you are
encouraged to develop your own approach that fits your project the most.
No labels