0% found this document useful (0 votes)

58 views3 pages

Philly To AML Migration: Prerequisites

This document provides a tutorial for setting up distributed training jobs on AzureML using PyTorch. It discusses using a custom Docker image with PyTorch, Horovod, and OpenMPI installed to enable efficient distributed training across GPU nodes. It also describes how to load the Docker image into AzureML, set up an Azure Blob storage datastore to load training data, and submit distributed training jobs to an Azure GPU cluster.

Uploaded by

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

58 views3 pages

Philly To AML Migration: Prerequisites

Uploaded by

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

If you are new, read Why Confluence and best practice and learn about it in Confluence 101 .

Any questions? Contact us in Teams

Dashboard / … / VI CV models

Philly to AML migration

Created by Yutao Xie, last modified by Baoguang Shi on Jun 14, 2019

If you looking for a collection of general information, go to Cognition AML Resources List.
The example project for this tutorial can be found here.
Author: @ Baoguang Shi Date: 10 Jun 2019

Overview
This page provides a tutorial for setting up distributed training jobs on AzureML using PyTorch 1.1 (or higher). Although AzureML provides comprehensive docs and tutorials, the information
about distributed training is a bit scattered and multiple solutions exist. This tutorial aims to provide a solution that is flexible enough for hefty large-scale training jobs. To this end, we have
the following considerations:
• We will be using custom Docker images build by ourselves instead of AzureML's built-in Estimators. Although the built-in PyTorch Estimator also allows installing extra Conda
packages, it might not be a good solution for preparing complex training environments that, for example, requires specific CUDA/NCCL versions that have certain features or bug fixes.
• We will make inter-node communication efficient by leveraging Infiniband and Horovod. Infiniband allows much faster communication than Ethernet (Note: InfiniBand is not yet enabled
as of 5/30/2019 but will be soon, according to Philly team); Horovod is a well optimized distributed backend that implements efficient synchronization.
• We will take care of the storage and reading of large datasets by using Azure Blob Storage properly.

Prerequisites
Before continue, the following requirements need to be met:
• Python 3.6 or later on dev machine
• AzureML SDK for Python, installed on dev machine, for submitting jobs
• You have an Azure subscription and have an AzureML workspace with GPU Compute. Without GPU Compute, you can still run AzureML jobs on your local machine, but non-
distributed.

Docker Image Setup

Dockerfile [file link]
AzureML allows running custom Docker images, which will provide the environment for our training task. In this tutorial, we start from NVIDIA's official image that comes with pre-installed
CUDA and CuDNN.
FROM nvidia/cuda:10.0-cudnn7-devel-ubuntu18.04

Then we install additional apt packages in the following script

ENV PYTHON_VERSION=3.6
ENV NCCL_VERSION=2.4.7-1+cuda10.0

RUN apt-get update && apt-get install -y --allow-downgrades --allow-change-held-packages --no-install-recommends \

build-essential cmake git curl vim wget ca-certificates libjpeg-dev libpng-dev \
infiniband-diags libibverbs-dev \ # for Infiniband support
libnccl2=${NCCL_VERSION} libnccl-dev=${NCCL_VERSION} \
python${PYTHON_VERSION} python${PYTHON_VERSION}-dev python3-setuptools python3-pip

RUN ln -s /usr/bin/python${PYTHON_VERSION} /usr/bin/python

RUN python3 -m pip install --upgrade pip

You are free to add other packages or choose other versions. But be aware:
• It is important to install the correct NCCL version, which is specified in format "$(nccl-version)+$(cuda-version)>". Simply running "apt install libnccl2" may end up installing
incompatible nccl version that causes hidden issues
• Not all combinations of CUDA and Python versions have official PyTorch install package. The combinations can be found on the PyTorch installation page.
Next, use pip to install Python packages, including PyTorch.

RUN python3 -m pip install \

wheel \
https://download.pytorch.org/whl/cu100/torch-1.1.0-cp36-cp36m-linux_x86_64.whl \
https://download.pytorch.org/whl/cu100/torchvision-0.3.0-cp36-cp36m-linux_x86_64.whl \
numpy scipy lmdb Pillow \
azureml-sdk \
protobuf tensorboardX PyYAML

Again, you can install other Python packages based on your need. Notice the azureml-sdk package here. We have it both on our dev machine and inside the Docker image. In Docker
image, azureml-sdk provides AzureML runtime features such as logging accuracy numbers to a dashboard on Azure.

We use Horovod to enable distributed training. Horovod depends on OpenMPI so we need to install it first. OpenMPI initiates communication between multiple containers using ssh, which
needs to be configured to enable that.
# Based on https://github.com/horovod/horovod/blob/master/Dockerfile

# Install Open MPI from source

RUN mkdir /tmp/openmpi && \
cd /tmp/openmpi && \
wget https://www.open-mpi.org/software/ompi/v4.0/downloads/openmpi-4.0.0.tar.gz && \
tar zxf openmpi-4.0.0.tar.gz && \
cd openmpi-4.0.0 && \
./configure --enable-orterun-prefix-by-default && \
make -j $(nproc) all && \
make install && \
ldconfig && \
rm -rf /tmp/openmpi

# Install horovod post-0.16.2 which has an important fix

RUN ldconfig /usr/local/cuda/targets/x86_64-linux/lib/stubs && \
HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_WITH_PYTORCH=1 python3 -m pip install --no-cache-dir \
git+https://github.com/horovod/horovod.git@910333f428e91c4cdca634864398f0952067f2a8 && \
ldconfig

# Install OpenSSH for MPI to communicate between containers

RUN apt-get install -y --no-install-recommends openssh-client openssh-server && \
mkdir -p /var/run/sshd

# Allow OpenSSH to talk to containers without asking for confirmation

RUN cat /etc/ssh/ssh_config | grep -v StrictHostKeyChecking > /etc/ssh/ssh_config.new && \
echo " StrictHostKeyChecking no" >> /etc/ssh/ssh_config.new && \
mv /etc/ssh/ssh_config.new /etc/ssh/ssh_config

Having finished writing the Dockerfile, use docker build in the containing directory to build the image.

docker build --network=host -t <repo>:<tag> .

If no error happens, the created docker image is now on your machine. AzureML needs to load that image to run jobs, so we need to push this image to a remote container registry. In this
tutorial, we will push the image to Azure Container Registry (ACR).
docker push <registry>/<repo>:<tag>

After the push, it may take another few minutes before the image can be accessed by AzureML.

Load Docker image for AzureML jobs

To use custom Docker image in experiments, specify the registry details when submitting jobs.
image_registry_details = ContainerRegistry()
image_registry_details.address = registry_address # example: "oneocracr.azurecr.io"
image_registry_details.username = registry_username
image_registry_details.password = getpass()

Here, the password can be found in the "Access keys" tab in your ACR service. You can use tools such as passpy to safely cache this password on your dev machine. Do not save
credentials in source code!
This image_registry_details will later be passed as a parameter into the constructor of Estimator. To read more about custom Docker images, this AzureML doc provides more
details.

Submitting Jobs
Workspace and Compute Target
A training job runs on a compute target of a workspace. You can create workspace and compute target on Azure Portal. See this tutorial for more.
Then, to connect to the workspace and the compute target, run
workspace = Workspace.from_config()
compute_target = workspace.compute_targets[compute_name]

Data storage
AzureML supports a variety of data storage services through its Datastore wrapper. We will use Azure Blob Storage for reading large dataset and writing experiment outputs. Other storage
types can be used similarly. To begin with, you need a Blob Storage account and a container. This page gives a nice tutorial. Also, it is strongly recommended that you pack your dataset
images into a compact binary file. Options include the LMDB, the .tsv format, and many others. Since Blob Storage struggles reading millions of small images, using packed dataset greatly
reduces the data loading overhead. In our group, we use Protocol Buffers to serialize a sample of any structure (image, annotations, file name, etc.) into a binary string and store it into
LMDB.
Suppose we have created the training and validation LMDBs and uploaded them to the Blob Storage. Next, we need to register the data store to the workspace so that it can be accessed
during training.
datastore = Datastore.register_azure_blob_container(
workspace,
datastore_id,
container_name,
storage_account,
account_key=account_key)

The datastore_id is chosen by yourself. The registration only needs to be done once for a storage account. The account key will be safely managed by AzureML. Next time, use the
datastore_id to retrieve the datastore from the workspace.

datastore = workspace.datastores[datastore_id]

To use the datastore, run

mountpoint = datastore.path(dir_path).as_mount()

to get the mount point in the run. dir_path is the path Blob Storage directory. If your LMDB is stored as train_data/lmdb/train.lmdb in the Blob Storage, you can pass
train_data/lmdb to dir_path.

as_mount() returns an environment variable, e.g. "$AZUREML_DATAREFERENCE_xxxx", which refers to the mount point on the remote machine. Since we are still on the dev machine,
this variable needs to be passed as a parameter to the training script.
The training script will receive the path to the mounted directory, which contains the train.lmdb. When the training script reads train.lmdb for the first time, the whole file will be
automatically downloaded to local to minimize data loading overhead.
It is recommended that you have your Blob Storage account at the same location as your AzureML workspace. This greatly increases data transfer speed and reduces unnecessary network
cost.
Submit
Having set up the workspace, compute target, and data storage, we can now submit the job.
experiment = Experiment(workspace, name=experiment_name)
estimator = Estimator(
source_directory=local_source_dir,
compute_target=compute_target,
entry_script=entry_script_name,
script_params=script_params,
node_count=node_count,
distributed_training=mpi_configuration,
custom_docker_image=image_name,
environment_variables={
'NCCL_DEBUG': 'INFO',
'NCCL_IB_DISABLE': '0'
},
image_registry_details=image_registry_details,
shm_size='16G',
user_managed=True)

run = experiment.submit(config=estimator, tags=tags)

print(run.get_portal_url())

In the constructor of Estimator we need to specify all the necessary information about this job, specifically:

• source_directory: path to the directory on your dev machine. Source code under this directory will be uploaded.
• compute_target: the compute target to use (see above)
• entry_script: is the name of the entry script, e.g. "main.py"
• node_count is the number of nodes to use. It depends on the compute type and the number of GPUs you want to use. If the compute type has 4 GPUs each and we need 16 GPUs
in total, then node_count should be 4.
• distributed_training: A class specifying how many processes to run per node. Typically, we run one process per GPU, so it should be the number of GPUs per node. (docs)
• custom_docker_image: name of the docker image, in "{repo}:{tag}" format.
• image_registry_details, a ContainerRegistry describing registry details. (docs)
• environment_variables: environment variables in the run. Setting NCCL_DEBUG=INFO so that NCCL will output details such as what device is used for communication (Infiniband
or Ethernet), ring-reduce topology, etc., which are useful for debugging; NCCL_IB_DISABLE may have been set to 1 by default. Setting it 0 to enable Infiniband when it is available.
• shm_size: the default shared memory size is often insufficient and will crash training. We set it to 16GB here. (more on this)

Finally, call experiment.submit to submit the job. Optionally, you can specify tags with a dictionary of key-value pairs, such as {learning_rate: 0.1}, organize the runs.

Distributed Training with Horovod

Horovod is a distributed training framework for PyTorch, Tensorflow, PyTorch, and MXNet. Horovod is an alternative solution to PyTorch's built-in torch.distributed. We prefer Horovod
for these reasons:
• Setting up distributed group is easier in Horovod. hvd.init() is all you need. In PyTorch you need to look for master container IP address and port, use shared file, etc.
• Horovod supports single-precision (fp16) gradient compression, which can reduce the communication cost between nodes.
• Horovod comes with performance analysis tools. See here.
Using Horovod in PyTorch needs just a few lines of code. Follow this example. If you are new to distributed training, you are encouraged to read this tutorial first.
hvd.init() # initialize distributed group
...
torch.cuda.set_device(hvd.local_rank())
optimizer = optim.SGD(model.parameters())
optimizer = hvd.DistributedOptimizer(optimizer, named_parameters=model.named_parameters())
...
for batch in batches:
optimizer.zero_grad()
...
loss.backward()
optimizer.step()

There is a caveat though. If you would like to clip gradients by L2 norm, the code needs some change.
for batch in batches:
optimizer.zero_grad()
...
loss.backward()
optimizer.synchronize()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)
optimizer.step(synchronize=False)

Afterword
In this tutorial, we have seen how to prepare training environment, how to store and load large dataset efficiently, and how to submit jobs. There are some other cool features of AzureML that
we haven't covered yet. For example, AzureML's pipeline feature allows one to stitch together different training stages, automating a complex multi-stage training protocol.
Also, keep in mind that this tutorial only provides one solution to distributed training on AzureML. This solution is intended for both research-level and product-level training that requires high
flexibility and reliability. Many steps in this tutorial can be replaced by alternatives, e.g. you can use PyTorch's DataParallel if you only need single-node training. In practice, you are
encouraged to develop your own approach that fits your project the most.

No labels

Microsoft Privacy Policy

Update to Modern C++
From Everand
Update to Modern C++
James Raynard
No ratings yet
The Little Book of Sitecore® Tips: Volume 1
From Everand
The Little Book of Sitecore® Tips: Volume 1
Neil P Shack
No ratings yet
Building Serverless Apps with Azure Functions and Cosmos DB: Leverage Azure functions and Cosmos DB for building serverless applications (English Edition)
From Everand
Building Serverless Apps with Azure Functions and Cosmos DB: Leverage Azure functions and Cosmos DB for building serverless applications (English Edition)
Hansamali Gamage
No ratings yet
Wohlers Report 2012 Brochure PDF
No ratings yet
Wohlers Report 2012 Brochure PDF
4 pages
Bloomsbury Kids PDF
No ratings yet
Bloomsbury Kids PDF
4 pages
Kubernetes Made Easy
From Everand
Kubernetes Made Easy
Pankaj Joshi
No ratings yet
Esas Doc Ms
No ratings yet
Esas Doc Ms
1,590 pages
Evaluation of Some Cloud Based Virtual Private Server (VPS) Providers
From Everand
Evaluation of Some Cloud Based Virtual Private Server (VPS) Providers
Dr. Hidaia Mahmood Alassouli
No ratings yet
Docker Tutorial for Beginners: Learn Programming, Containers, Data Structures, Software Engineering, and Coding
From Everand
Docker Tutorial for Beginners: Learn Programming, Containers, Data Structures, Software Engineering, and Coding
Andrew Lee
3/5 (2)
Azure For Starters
From Everand
Azure For Starters
Chinmoy Mukherjee
No ratings yet
Evaluation of Some Cloud Based Virtual Private Server (VPS) Providers
From Everand
Evaluation of Some Cloud Based Virtual Private Server (VPS) Providers
Dr. Hidaia Mamood Alassouli
No ratings yet
Python and SQLite Development
From Everand
Python and SQLite Development
Agus Kurniawan
No ratings yet
AML Service
No ratings yet
AML Service
38 pages
Configuration of a Simple Samba File Server, Quota and Schedule Backup
From Everand
Configuration of a Simple Samba File Server, Quota and Schedule Backup
Dr. Hedaya Alasooly
No ratings yet
Mslearn dp100 02
No ratings yet
Mslearn dp100 02
5 pages
LPIC-1 Primer
From Everand
LPIC-1 Primer
John Greene
4.5/5 (3)
Build Your First Home Server
From Everand
Build Your First Home Server
R.R. Arnob
No ratings yet
DP100
No ratings yet
DP100
8 pages
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
TensorFlow Developer Certificate Exam Practice Tests 2024 Made Easy
From Everand
TensorFlow Developer Certificate Exam Practice Tests 2024 Made Easy
Mr Troy
No ratings yet
DP-100 - Designing and Implementing A Data Science
No ratings yet
DP-100 - Designing and Implementing A Data Science
9 pages
Mslearn dp100 06
No ratings yet
Mslearn dp100 06
2 pages
Exam DP 100 Data Science Solution On Azure Skills Measured
No ratings yet
Exam DP 100 Data Science Solution On Azure Skills Measured
6 pages
How The Components Work Together?
No ratings yet
How The Components Work Together?
29 pages
Kubernetes: Build and Deploy Modern Applications in a Scalable Infrastructure. The Complete Guide to the Most Modern Scalable Software Infrastructure.: Docker & Kubernetes, #2
From Everand
Kubernetes: Build and Deploy Modern Applications in a Scalable Infrastructure. The Complete Guide to the Most Modern Scalable Software Infrastructure.: Docker & Kubernetes, #2
Jordan Lioy
No ratings yet
Docker, Containers And All The Rest: First Edition, #1
From Everand
Docker, Containers And All The Rest: First Edition, #1
Ami Adi
No ratings yet
Week 13 GCP Lec Notes
No ratings yet
Week 13 GCP Lec Notes
28 pages
Create A Classification Model With Azure Machine Learning Designer
No ratings yet
Create A Classification Model With Azure Machine Learning Designer
19 pages
ConfigMgr - An Administrator's Guide to Deploying Applications using PowerShell
From Everand
ConfigMgr - An Administrator's Guide to Deploying Applications using PowerShell
Owen Smith
5/5 (1)
Configuration of a Simple Samba File Server, Quota and Schedule Backup
From Everand
Configuration of a Simple Samba File Server, Quota and Schedule Backup
Dr. Hidaia Mahmood Alassouli
No ratings yet
Extending Docker
From Everand
Extending Docker
Russ McKendrick
5/5 (1)
Learning Azure DevOps: Outperform DevOps using Azure Pipelines, Artifacts, Boards, Azure CLI, Test Plans and Repos
From Everand
Learning Azure DevOps: Outperform DevOps using Azure Pipelines, Artifacts, Boards, Azure CLI, Test Plans and Repos
Myra Kelnor
No ratings yet
Learning Azure DevOps
From Everand
Learning Azure DevOps
Myra Kelnor
No ratings yet
PHP Package Mastery: 100 Essential Tools in One Hour - 2024 Edition
From Everand
PHP Package Mastery: 100 Essential Tools in One Hour - 2024 Edition
Kanto
No ratings yet
Azure Bicep QuickStart Pro
From Everand
Azure Bicep QuickStart Pro
Selina Threxan
No ratings yet
Azure Bicep QuickStart Pro: From JSON and ARM Templates to Advanced Deployment Techniques, CI/CD Integration, and Environment Management
From Everand
Azure Bicep QuickStart Pro: From JSON and ARM Templates to Advanced Deployment Techniques, CI/CD Integration, and Environment Management
Selina Threxan
No ratings yet
Windows Batch File Programming
From Everand
Windows Batch File Programming
Michael Elliott
2/5 (2)
Learn ASP.NET Core MVC - Be Ready Next Week Using Visual Studio 2017
From Everand
Learn ASP.NET Core MVC - Be Ready Next Week Using Visual Studio 2017
Arnaud Weil
5/5 (1)
Create A Clustering Model With Azure Machine Learning Designer
No ratings yet
Create A Clustering Model With Azure Machine Learning Designer
22 pages
Firebase Storage for Angular: A reliable file upload solution for your applications
From Everand
Firebase Storage for Angular: A reliable file upload solution for your applications
Abdelfattah Ragab
No ratings yet
Mslearn dp100 01
No ratings yet
Mslearn dp100 01
3 pages
OpenNebula 3 Cloud Computing
From Everand
OpenNebula 3 Cloud Computing
Giovanni Toraldo
No ratings yet
Working with Vue.js
From Everand
Working with Vue.js
Jack Franklin
No ratings yet
A Beginners Guide to Cursor
From Everand
A Beginners Guide to Cursor
Steven Mcananey
No ratings yet
All My IT Tech Posts
From Everand
All My IT Tech Posts
Stephen Edwards
No ratings yet
Node.js: Tools & Skills
From Everand
Node.js: Tools & Skills
James Hibbard
No ratings yet
DevOps. How to build pipelines with Jenkins, Docker container, AWS ECS, JDK 11, git and maven 3?
From Everand
DevOps. How to build pipelines with Jenkins, Docker container, AWS ECS, JDK 11, git and maven 3?
John Edward Cooper Berg
No ratings yet
Learn Kubernetes & Docker - .NET Core, Java, Node.JS, PHP or Python
From Everand
Learn Kubernetes & Docker - .NET Core, Java, Node.JS, PHP or Python
Arnaud Weil
No ratings yet
Exam DP 100 Data Science Solution On Azure Skills Measured
No ratings yet
Exam DP 100 Data Science Solution On Azure Skills Measured
9 pages
Interview Questions for IBM Mainframe Developers
From Everand
Interview Questions for IBM Mainframe Developers
Robert Wingate
1/5 (1)
Linux Services Deployment
From Everand
Linux Services Deployment
Fabian Mestre
No ratings yet
The Beginner’s Guide to Node.js
From Everand
The Beginner’s Guide to Node.js
Steven Mcananey
No ratings yet
Visual Basic 2010 Coding Briefs Data Access
From Everand
Visual Basic 2010 Coding Briefs Data Access
Kevin Hough
5/5 (1)
pytorch.org_tutorials__sources_beginner_ptcheat
No ratings yet
pytorch.org_tutorials__sources_beginner_ptcheat
7 pages
Learn Kubernetes - Container orchestration using Docker: Learn Collection
From Everand
Learn Kubernetes - Container orchestration using Docker: Learn Collection
Arnaud Weil
4/5 (1)
Deploy any website on google cloud platform
From Everand
Deploy any website on google cloud platform
AJ Books
No ratings yet
Pyqt6 101: A Beginner’s Guide to PyQt6
From Everand
Pyqt6 101: A Beginner’s Guide to PyQt6
Edward Chang
No ratings yet
50 Recipes for Programming Node.js
From Everand
50 Recipes for Programming Node.js
Jamie Munro
3/5 (4)
SAS Interview Questions You'll Most Likely Be Asked
From Everand
SAS Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
The Complete Guide to Installing Parrot OS
From Everand
The Complete Guide to Installing Parrot OS
mehul kothari
No ratings yet
Different Architecture Support by Azure ML
No ratings yet
Different Architecture Support by Azure ML
7 pages
Stas Bekman - Machine Learning Engineering
No ratings yet
Stas Bekman - Machine Learning Engineering
217 pages
06 dfs2
No ratings yet
06 dfs2
50 pages
15-440 Distributed Systems: Fault Tolerance, Logging and Recovery Thursday Oct 8, 2015
No ratings yet
15-440 Distributed Systems: Fault Tolerance, Logging and Recovery Thursday Oct 8, 2015
30 pages
15-440 Distributed Systems Fall 2016: L-23 Security
No ratings yet
15-440 Distributed Systems Fall 2016: L-23 Security
38 pages
15-440 Distributed Systems: Lecture 19 - Naming and Hashing
No ratings yet
15-440 Distributed Systems: Lecture 19 - Naming and Hashing
46 pages
Thriving in ACrowded and Changing World
No ratings yet
Thriving in ACrowded and Changing World
168 pages
Methodically Defeating Nintendo Switch Security
No ratings yet
Methodically Defeating Nintendo Switch Security
12 pages
@TEAMFOD DPP-07
No ratings yet
@TEAMFOD DPP-07
1 page
November 2015 HL Economics Paper 3
No ratings yet
November 2015 HL Economics Paper 3
4 pages
000 P MBAM Overview of Slope Stability Problems 4 Dec 19L
No ratings yet
000 P MBAM Overview of Slope Stability Problems 4 Dec 19L
15 pages
Dr. Dr. Andry, MM, MH Kes
No ratings yet
Dr. Dr. Andry, MM, MH Kes
50 pages
Chapter 17 - BREATHING AND EXCHANGE OF GASES
No ratings yet
Chapter 17 - BREATHING AND EXCHANGE OF GASES
7 pages
AR-109 - 191030 - Rev.9 - 2ND-3RD FLOOR Reflected Ceiling Plan
No ratings yet
AR-109 - 191030 - Rev.9 - 2ND-3RD FLOOR Reflected Ceiling Plan
1 page
Histories Of Conservation And Art History In Modern Europe Sven Dupre pdf download
No ratings yet
Histories Of Conservation And Art History In Modern Europe Sven Dupre pdf download
78 pages
5_6301029207224883727
No ratings yet
5_6301029207224883727
21 pages
April 2019 qp2
100% (1)
April 2019 qp2
16 pages
Download Introducing Anthropology An Integrated Approach 5th Edition Michael Alan Park ebook All Chapters PDF
100% (21)
Download Introducing Anthropology An Integrated Approach 5th Edition Michael Alan Park ebook All Chapters PDF
60 pages
Sample Undergraduate Application Form WSU
No ratings yet
Sample Undergraduate Application Form WSU
7 pages
Greetings From Globussoft: Search Engine Like Google, Yahoo, Bing
No ratings yet
Greetings From Globussoft: Search Engine Like Google, Yahoo, Bing
7 pages
Parallel Distributed Computing Using Python
No ratings yet
Parallel Distributed Computing Using Python
16 pages
BHN Triangle and Circle Properties
No ratings yet
BHN Triangle and Circle Properties
6 pages
BÀI TẬP 1 - TPR and TBL
50% (2)
BÀI TẬP 1 - TPR and TBL
11 pages
Praneeth
No ratings yet
Praneeth
41 pages
Advanced Web Designing
No ratings yet
Advanced Web Designing
96 pages
ANS KEY TO MOR MATHS CLASS 8 PA 2 EXAM 2024-25
No ratings yet
ANS KEY TO MOR MATHS CLASS 8 PA 2 EXAM 2024-25
9 pages
Enduro_Ladder_Cable_Tray_Installation_Guide_03-11
No ratings yet
Enduro_Ladder_Cable_Tray_Installation_Guide_03-11
2 pages
Eos MCQ
No ratings yet
Eos MCQ
189 pages
3203-Article Text-7785-1-10-20240505
No ratings yet
3203-Article Text-7785-1-10-20240505
6 pages
Ibm 380 560 760
No ratings yet
Ibm 380 560 760
716 pages
Realism Liberalism Constructivism and TH
No ratings yet
Realism Liberalism Constructivism and TH
7 pages
Food Additives
No ratings yet
Food Additives
10 pages
Leadership
No ratings yet
Leadership
14 pages
Windows Communication Foundation & WPF
No ratings yet
Windows Communication Foundation & WPF
55 pages
Caselet Price Elasticity and Total Revenue
No ratings yet
Caselet Price Elasticity and Total Revenue
5 pages
Tree Removal Walkerville Council Regulations - Summary
No ratings yet
Tree Removal Walkerville Council Regulations - Summary
7 pages

Philly To AML Migration: Prerequisites

Uploaded by

Philly To AML Migration: Prerequisites

Uploaded by

If you are new, read Why Confluence and best practice and learn about it in Confluence 101 .

Any questions? Contact us in Teams

Philly to AML migration

Docker Image Setup

Then we install additional apt packages in the following script

RUN apt-get update && apt-get install -y --allow-downgrades --allow-change-held-packages --no-install-recommends \

RUN ln -s /usr/bin/python${PYTHON_VERSION} /usr/bin/python

RUN python3 -m pip install \

# Install Open MPI from source

# Install horovod post-0.16.2 which has an important fix

# Install OpenSSH for MPI to communicate between containers

# Allow OpenSSH to talk to containers without asking for confirmation

docker build --network=host -t <repo>:<tag> .

Load Docker image for AzureML jobs

To use the datastore, run

run = experiment.submit(config=estimator, tags=tags)

Distributed Training with Horovod

Microsoft Privacy Policy

You might also like