Chapter 12 (Ditributing TensorFlow) Fa21-Bse-036
Chapter 12 (Ditributing TensorFlow) Fa21-Bse-036
Some operations also have multithreaded kernels that can utilize the intra-op
thread pools to split computations across multiple threads on the same
device.
Controlling Parallelism
You can control the number of threads used in the inter-op and intra-op
thread pools by setting the appropriate configuration options. This allows you
to fine-tune the parallelism to match the characteristics of your hardware and
workload.
Multiple Devices Across Multiple Servers
Defining a Cluster
To run a TensorFlow graph across multiple servers, you first need to define a cluster. A
cluster is composed of one or more TensorFlow servers, called tasks, typically spread
across several machines. Each task belongs to a job, which is a named group of tasks
with a common role, such as storing model parameters or performing computations.
Placing Operations
You can use device blocks to pin operations on any device managed by any task in the
cluster, specifying the job name, task index, device type, and device index.
TensorFlow also provides the replica_device_setter() function to automatically
distribute variables across parameter servers in a round-robin fashion.
Sharing State
In a distributed setup, variable state is managed by resource containers located on the
cluster, not by individual sessions. This allows multiple sessions to seamlessly share
the same variables, even if they are connected to different servers. TensorFlow also
provides queues and readers that can be shared across sessions to enable asynchronous
data loading.
Efficient Data Loading
Preloading the Data
For datasets that can fit in memory, you can preload the training data into a variable and use
that variable in your graph. This ensures the data is only transferred once from the client to the
cluster, rather than being repeatedly loaded and fed through placeholders.
filesystem, without the data ever passing through the client. This allows you to build a data
loading pipeline that runs in parallel with the training computations.
Multithreaded Readers
To further improve data loading throughput, you can use TensorFlow's Coordinator and
QueueRunner classes to manage multiple threads that simultaneously read from multiple files
and push the data into a shared queue.
Convenience Functions
TensorFlow provides several convenience functions, such as string_input_producer() and
device, either on the same machine or across multiple machines in a cluster. This is
perfect for hyperparameter tuning or serving a high volume of queries.
In-Graph Replication
For parallelizing the training of a large ensemble of neural networks, you can create a
single graph containing all the networks, each placed on a different device, plus the
computations needed to aggregate the individual predictions.
Between-Graph Replication
Alternatively, you can create separate graphs for each neural network and coordinate
their execution using queues, with one client handling the input distribution and
another aggregating the outputs.
Scalable Performance
By leveraging the distributed capabilities of TensorFlow, you can achieve near-linear
such as loading data in the background while training a model. Queues allow you to decouple the data
pipeline from the training pipeline, improving overall throughput.
Controlling Dependencies
Adding control dependencies between operations can help you postpone the execution of memory-
intensive or communication-heavy computations until they are truly needed, allowing other operations
to run in parallel and improving resource utilization.
Managing State
The distributed nature of TensorFlow's resource containers allows you to seamlessly share variables,
queues, and other stateful objects across multiple sessions, simplifying the coordination of your
distributed computations.
Leveraging Coordinators
TensorFlow's Coordinator and QueueRunner classes make it easier to manage the lifecycle of
asynchronous threads, ensuring they start and stop gracefully and avoiding deadlocks or other
concurrency issues.
Achieving Scalable Performance
Technique Benefits
Distributing computations across multiple Reduces training time for large neural
devices networks, allows exploring larger
hyperparameter spaces
Efficient data loading pipelines Ensures data is available when needed,
without becoming a bottleneck
Coordinating asynchronous computations Improves resource utilization, enables
overlapping of data loading and training
Leveraging TensorFlow's distributed Provides a scalable and flexible
capabilities framework for building high-performance
machine learning applications
Conclusion
Unlocking the Power of Distributed Computing:
By mastering the techniques for distributing TensorFlow computations across
devices and servers, you can unlock the true potential of your hardware
resources and tackle much larger and more complex machine learning
problems. Whether it's speeding up the training of neural networks, exploring
a wider range of hyperparameters, or serving high volumes of queries, the
distributed capabilities of TensorFlow provide a powerful and flexible
foundation for building scalable, high-performance applications.