Skip to content

Commit 6bda8aa

Browse files
mrrymartinwicke
authored andcommitted
Applying editorial changes to the distributed how-to. (tensorflow#1920)
Change: 119605636
1 parent 31ea3db commit 6bda8aa

File tree

1 file changed

+86
-75
lines changed
  • tensorflow/g3doc/how_tos/distributed

1 file changed

+86
-75
lines changed

tensorflow/g3doc/how_tos/distributed/index.md

Lines changed: 86 additions & 75 deletions
Original file line numberDiff line numberDiff line change
@@ -8,37 +8,42 @@ writing TensorFlow programs.
88
## Hello distributed TensorFlow!
99

1010
This tutorial assumes that you are using a TensorFlow nightly build. You
11-
can test your installation by starting a local server as follows:
11+
can test your installation by starting and using a local server as follows:
1212

1313
```shell
1414
# Start a TensorFlow server as a single-process "cluster".
1515
$ python
1616
>>> import tensorflow as tf
1717
>>> c = tf.constant("Hello, distributed TensorFlow!")
1818
>>> server = tf.train.Server.create_local_server()
19-
>>> sess = tf.Session(server.target)
19+
>>> sess = tf.Session(server.target) # Create a session on the server.
2020
>>> sess.run(c)
2121
'Hello, distributed TensorFlow!'
2222
```
2323

2424
The
2525
[`tf.train.Server.create_local_server()`](../../api_docs/train.md#Server.create_local_server)
26-
method creates a single-process cluster.
26+
method creates a single-process cluster, with an in-process server.
2727

2828
## Create a cluster
2929

30-
Most clusters have multiple tasks, divided into one or more jobs. To create a
31-
cluster with multiple processes or machines:
30+
A TensorFlow "cluster" is a set of "tasks" that participate in the distributed
31+
execution of a TensorFlow graph. Each task is associated with a TensorFlow
32+
"server", which contains a "master" that can be used to create sessions, and a
33+
"worker" that executes operations in the graph. A cluster can also be divided
34+
into one or more "jobs", where each job contains one or more tasks.
3235

33-
1. **For each process or machine** in the cluster, run a TensorFlow program to
34-
do the following:
36+
To create a cluster, you start one TensorFlow server per task in the cluster.
37+
Each task typically runs on a different machine, but you can run multiple tasks
38+
on the same machine (e.g. to control different GPU devices). In each task, do
39+
the following:
3540

36-
1. **Create a `tf.train.ClusterSpec`**, which describes all of the tasks
37-
in the cluster. This should be the same in each process.
41+
1. **Create a `tf.train.ClusterSpec`** that describes all of the tasks
42+
in the cluster. This should be the same for each task.
3843

39-
1. **Create a `tf.train.Server`**, passing the `tf.train.ClusterSpec` to
40-
the constructor, and identifying the local process with a job name
41-
and task index.
44+
2. **Create a `tf.train.Server`**, passing the `tf.train.ClusterSpec` to
45+
the constructor, and identifying the local task with a job name
46+
and task index.
4247

4348

4449
### Create a `tf.train.ClusterSpec` to describe the cluster
@@ -71,28 +76,29 @@ tf.train.ClusterSpec({
7176
</tr>
7277
</table>
7378

74-
### Create a `tf.train.Server` instance in each process
79+
### Create a `tf.train.Server` instance in each task
7580

7681
A [`tf.train.Server`](../../api_docs/python/train.md#Server) object contains a
77-
set of local devices, and a
78-
[`tf.Session`](../../api_docs/python/client.md#Session) target that can
79-
participate in a distributed computation. Each server belongs to a particular
80-
cluster (specified by a `tf.train.ClusterSpec`), and corresponds to a particular
81-
task in a named job. The server can communicate with any other server in the
82-
same cluster.
82+
set of local devices, a set of connections to other tasks in its
83+
`tf.train.ClusterSpec`, and a
84+
["session target"](../../api_docs/python/client.md#Session) that can use these
85+
to perform a distributed computation. Each server is a member of a specific
86+
named job and has a task index within that job. A server can communicate with
87+
any other server in the cluster.
8388

84-
For example, to define and instantiate servers running on `localhost:2222` and
85-
`localhost:2223`, run the following snippets in different processes:
89+
For example, to launch a cluster with two servers running on `localhost:2222`
90+
and `localhost:2223`, run the following snippets in two different processes on
91+
the local machine:
8692

8793
```python
8894
# In task 0:
89-
cluster = tf.ClusterSpec({"local": ["localhost:2222", "localhost:2223"]})
90-
server = tf.GrpcServer(cluster, job_name="local", task_index=0)
95+
cluster = tf.train.ClusterSpec({"local": ["localhost:2222", "localhost:2223"]})
96+
server = tf.train.Server(cluster, job_name="local", task_index=0)
9197
```
9298
```python
9399
# In task 1:
94-
cluster = tf.ClusterSpec({"local": ["localhost:2222", "localhost:2223"]})
95-
server = tf.GrpcServer(cluster, job_name="local", task_index=1)
100+
cluster = tf.train.ClusterSpec({"local": ["localhost:2222", "localhost:2223"]})
101+
server = tf.train.Server(cluster, job_name="local", task_index=1)
96102
```
97103

98104
**Note:** Manually specifying these cluster specifications can be tedious,
@@ -137,45 +143,44 @@ applying gradients).
137143

138144
## Replicated training
139145

140-
A common training configuration ("data parallel training") involves multiple
141-
tasks in a `worker` job training the same model, using shared parameters hosted
142-
in a one or more tasks in a `ps` job. Each task will typically run on a
143-
different machine. There are many ways to specify this structure in TensorFlow,
144-
and we are building libraries that will simplify the work of specifying a
145-
replicated model. Possible approaches include:
146-
147-
* Building a single graph containing one set of parameters (in `tf.Variable`
148-
nodes pinned to `/job:ps`), and multiple copies of the "model" pinned to
149-
different tasks in `/job:worker`. Each copy of the model can have a different
150-
`train_op`, and one or more client threads can call `sess.run(train_ops[i])`
151-
for each worker `i`. This implements *asynchronous* training.
152-
153-
This approach uses a single `tf.Session` whose target is one of the workers in
154-
the cluster.
155-
156-
* As above, but where the gradients from all workers are averaged. See the
157-
[CIFAR-10 multi-GPU trainer](https://www.tensorflow.org/code/tensorflow/models/image/cifar10/cifar10_multi_gpu_train.py)
158-
for an example of this form of replication. This implements *synchronous*
159-
training.
160-
161-
* The "distributed trainer" approach uses multiple graphs&mdash;one per
162-
worker&mdash;where each graph contains one set of parameters (pinned to
163-
`/job:ps`) and one copy of the model (pinned to a particular
164-
`/job:worker/task:i`). The "container" mechanism is used to share variables
165-
between different graphs: when each variable is constructed, the optional
166-
`container` argument is specified with the same value in each copy of the
167-
graph. For large models, this can be more efficient, because the overall graph
168-
is smaller.
169-
170-
This approach uses multiple `tf.Session` objects: one per worker process,
171-
where the `target` of each is the address of a different worker. The
172-
`tf.Session` objects can all be created in a single Python client, or you can
173-
use multiple Python clients to better distribute the trainer load.
146+
A common training configuration, called "data parallelism," involves multiple
147+
tasks in a `worker` job training the same model on different mini-batches of
148+
data, updating shared parameters hosted in a one or more tasks in a `ps`
149+
job. All tasks typically run on different machines. There are many ways to
150+
specify this structure in TensorFlow, and we are building libraries that will
151+
simplify the work of specifying a replicated model. Possible approaches include:
152+
153+
* **In-graph replication.** In this approach, the client builds a single
154+
`tf.Graph` that contains one set of parameters (in `tf.Variable` nodes pinned
155+
to `/job:ps`); and multiple copies of the compute-intensive part of the model,
156+
each pinned to a different task in `/job:worker`.
157+
158+
* **Between-graph replication.** In this approach, there is a separate client
159+
for each `/job:worker` task, typically in the same process as the worker
160+
task. Each client builds a similar graph containing the parameters (pinned to
161+
`/job:ps` as before using
162+
[`tf.train.replica_device_setter()`](../../api_docs/train.md#replica_device_setter)
163+
to map them deterministically to the same tasks); and a single copy of the
164+
compute-intensive part of the model, pinned to the local task in
165+
`/job:worker`.
166+
167+
* **Asynchronous training.** In this approach, each replica of the graph has an
168+
independent training loop that executes without coordination. It is compatible
169+
with both forms of replication above.
170+
171+
* **Synchronous training.** In this approach, all of the replicas read the same
172+
values for the current parameters, compute gradients in parallel, and then
173+
apply them together. It is compatible with in-graph replication (e.g. using
174+
gradient averaging as in the
175+
[CIFAR-10 multi-GPU trainer](https://www.tensorflow.org/code/tensorflow/models/image/cifar10/cifar10_multi_gpu_train.py)),
176+
and between-graph replication (e.g. using the
177+
`tf.train.SyncReplicasOptimizer`).
174178

175179
### Putting it all together: example trainer program
176180

177-
The following code shows the skeleton of a distributed trainer program. It
178-
includes the code for the parameter server and worker processes.
181+
The following code shows the skeleton of a distributed trainer program,
182+
implementing **between-graph replication** and **asynchronous training**. It
183+
includes the code for the parameter server and worker tasks.
179184

180185
```python
181186
import tensorflow as tf
@@ -197,10 +202,13 @@ def main(_):
197202
ps_hosts = FLAGS.ps_hosts.split(",")
198203
worker_hosts = FLAGS.worker_hosts(",")
199204

205+
# Create a cluster from the parameter server and worker hosts.
200206
cluster = tf.train.ClusterSpec({"ps": ps_hosts, "worker": worker_hosts})
207+
208+
# Create and start a server for the local task.
201209
server = tf.train.Server(cluster,
202210
job_name=FLAGS.job_name,
203-
task_index=task_index)
211+
task_index=FLAGS.task_index)
204212

205213
if FLAGS.job_name == "ps":
206214
server.join()
@@ -290,31 +298,33 @@ $ python trainer.py \
290298
</dd>
291299
<dt>Cluster</dt>
292300
<dd>
293-
A TensorFlow cluster comprises one or more TensorFlow servers, divided into
294-
a set of named jobs, which in turn comprise lists of tasks. A cluster is
295-
typically dedicated to a particular high-level objective, such as training a
296-
neural network, using many machines in parallel.
301+
A TensorFlow cluster comprises a one or more "jobs", each divided into lists
302+
of one or more "tasks". A cluster is typically dedicated to a particular
303+
high-level objective, such as training a neural network, using many machines
304+
in parallel. A cluster is defined by a `tf.train.ClusterSpec` object.
297305
</dd>
298306
<dt>Job</dt>
299307
<dd>
300308
A job comprises a list of "tasks", which typically serve a common
301309
purpose. For example, a job named `ps` (for "parameter server") typically
302310
hosts nodes that store and update variables; while a job named `worker`
303311
typically hosts stateless nodes that perform compute-intensive tasks.
304-
The tasks in a job typically run on different machines.
312+
The tasks in a job typically run on different machines. The set of job roles
313+
is flexible: for example, a `worker` may maintain some state.
305314
</dd>
306315
<dt>Master service</dt>
307316
<dd>
308-
An RPC service that provides remote access to a set of distributed
309-
devices. The master service implements the <code>tensorflow::Session</code>
310-
interface, and is responsible for coordinating work across one or more
311-
"worker services".
317+
An RPC service that provides remote access to a set of distributed devices,
318+
and acts as a session target. The master service implements the
319+
<code>tensorflow::Session</code> interface, and is responsible for
320+
coordinating work across one or more "worker services". All TensorFlow
321+
servers implement the master service.
312322
</dd>
313323
<dt>Task</dt>
314324
<dd>
315-
A task typically corresponds to a single TensorFlow server process,
316-
belonging to a particular "job" and with a particular index within that
317-
job's list of tasks.
325+
A task corresponds to a specific TensorFlow server, and typically
326+
corresponds to a single process. A task belongs to a particular "job" and is
327+
identified by its index within that job's list of tasks.
318328
</dd>
319329
<dt>TensorFlow server</dt>
320330
<dd>
@@ -326,6 +336,7 @@ $ python trainer.py \
326336
An RPC service that executes parts of a TensorFlow graph using its local
327337
devices. A worker service implements <a href=
328338
"https://www.tensorflow.org/code/tensorflow/core/protobuf/worker_service.proto"
329-
><code>worker_service.proto</code></a>.
339+
><code>worker_service.proto</code></a>. All TensorFlow servers implement the
340+
worker service.
330341
</dd>
331342
</dl>

0 commit comments

Comments
 (0)