@@ -8,37 +8,42 @@ writing TensorFlow programs.
8
8
## Hello distributed TensorFlow!
9
9
10
10
This tutorial assumes that you are using a TensorFlow nightly build. You
11
- can test your installation by starting a local server as follows:
11
+ can test your installation by starting and using a local server as follows:
12
12
13
13
``` shell
14
14
# Start a TensorFlow server as a single-process "cluster".
15
15
$ python
16
16
>>> import tensorflow as tf
17
17
>>> c = tf.constant(" Hello, distributed TensorFlow!" )
18
18
>>> server = tf.train.Server.create_local_server ()
19
- >>> sess = tf.Session(server.target)
19
+ >>> sess = tf.Session(server.target) # Create a session on the server.
20
20
>>> sess.run(c)
21
21
' Hello, distributed TensorFlow!'
22
22
```
23
23
24
24
The
25
25
[ ` tf.train.Server.create_local_server() ` ] ( ../../api_docs/train.md#Server.create_local_server )
26
- method creates a single-process cluster.
26
+ method creates a single-process cluster, with an in-process server .
27
27
28
28
## Create a cluster
29
29
30
- Most clusters have multiple tasks, divided into one or more jobs. To create a
31
- cluster with multiple processes or machines:
30
+ A TensorFlow "cluster" is a set of "tasks" that participate in the distributed
31
+ execution of a TensorFlow graph. Each task is associated with a TensorFlow
32
+ "server", which contains a "master" that can be used to create sessions, and a
33
+ "worker" that executes operations in the graph. A cluster can also be divided
34
+ into one or more "jobs", where each job contains one or more tasks.
32
35
33
- 1 . ** For each process or machine** in the cluster, run a TensorFlow program to
34
- do the following:
36
+ To create a cluster, you start one TensorFlow server per task in the cluster.
37
+ Each task typically runs on a different machine, but you can run multiple tasks
38
+ on the same machine (e.g. to control different GPU devices). In each task, do
39
+ the following:
35
40
36
- 1 . ** Create a ` tf.train.ClusterSpec ` ** , which describes all of the tasks
37
- in the cluster. This should be the same in each process .
41
+ 1 . ** Create a ` tf.train.ClusterSpec ` ** that describes all of the tasks
42
+ in the cluster. This should be the same for each task .
38
43
39
- 1 . ** Create a ` tf.train.Server ` ** , passing the ` tf.train.ClusterSpec ` to
40
- the constructor, and identifying the local process with a job name
41
- and task index.
44
+ 2 . ** Create a ` tf.train.Server ` ** , passing the ` tf.train.ClusterSpec ` to
45
+ the constructor, and identifying the local task with a job name
46
+ and task index.
42
47
43
48
44
49
### Create a ` tf.train.ClusterSpec ` to describe the cluster
@@ -71,28 +76,29 @@ tf.train.ClusterSpec({
71
76
</tr >
72
77
</table >
73
78
74
- ### Create a ` tf.train.Server ` instance in each process
79
+ ### Create a ` tf.train.Server ` instance in each task
75
80
76
81
A [ ` tf.train.Server ` ] ( ../../api_docs/python/train.md#Server ) object contains a
77
- set of local devices, and a
78
- [ ` tf.Session ` ] ( ../../api_docs/python/client.md#Session ) target that can
79
- participate in a distributed computation. Each server belongs to a particular
80
- cluster (specified by a ` tf.train.ClusterSpec ` ), and corresponds to a particular
81
- task in a named job. The server can communicate with any other server in the
82
- same cluster.
82
+ set of local devices, a set of connections to other tasks in its
83
+ ` tf.train.ClusterSpec ` , and a
84
+ [ "session target" ] ( ../../api_docs/python/client.md#Session ) that can use these
85
+ to perform a distributed computation. Each server is a member of a specific
86
+ named job and has a task index within that job. A server can communicate with
87
+ any other server in the cluster.
83
88
84
- For example, to define and instantiate servers running on ` localhost:2222 ` and
85
- ` localhost:2223 ` , run the following snippets in different processes:
89
+ For example, to launch a cluster with two servers running on ` localhost:2222 `
90
+ and ` localhost:2223 ` , run the following snippets in two different processes on
91
+ the local machine:
86
92
87
93
``` python
88
94
# In task 0:
89
- cluster = tf.ClusterSpec({" local" : [" localhost:2222" , " localhost:2223" ]})
90
- server = tf.GrpcServer (cluster, job_name = " local" , task_index = 0 )
95
+ cluster = tf.train. ClusterSpec({" local" : [" localhost:2222" , " localhost:2223" ]})
96
+ server = tf.train.Server (cluster, job_name = " local" , task_index = 0 )
91
97
```
92
98
``` python
93
99
# In task 1:
94
- cluster = tf.ClusterSpec({" local" : [" localhost:2222" , " localhost:2223" ]})
95
- server = tf.GrpcServer (cluster, job_name = " local" , task_index = 1 )
100
+ cluster = tf.train. ClusterSpec({" local" : [" localhost:2222" , " localhost:2223" ]})
101
+ server = tf.train.Server (cluster, job_name = " local" , task_index = 1 )
96
102
```
97
103
98
104
** Note:** Manually specifying these cluster specifications can be tedious,
@@ -137,45 +143,44 @@ applying gradients).
137
143
138
144
## Replicated training
139
145
140
- A common training configuration ("data parallel training") involves multiple
141
- tasks in a ` worker ` job training the same model, using shared parameters hosted
142
- in a one or more tasks in a ` ps ` job. Each task will typically run on a
143
- different machine. There are many ways to specify this structure in TensorFlow,
144
- and we are building libraries that will simplify the work of specifying a
145
- replicated model. Possible approaches include:
146
-
147
- * Building a single graph containing one set of parameters (in ` tf.Variable `
148
- nodes pinned to ` /job:ps ` ), and multiple copies of the "model" pinned to
149
- different tasks in ` /job:worker ` . Each copy of the model can have a different
150
- ` train_op ` , and one or more client threads can call ` sess.run(train_ops[i]) `
151
- for each worker ` i ` . This implements * asynchronous* training.
152
-
153
- This approach uses a single ` tf.Session ` whose target is one of the workers in
154
- the cluster.
155
-
156
- * As above, but where the gradients from all workers are averaged. See the
157
- [ CIFAR-10 multi-GPU trainer] ( https://www.tensorflow.org/code/tensorflow/models/image/cifar10/cifar10_multi_gpu_train.py )
158
- for an example of this form of replication. This implements * synchronous*
159
- training.
160
-
161
- * The "distributed trainer" approach uses multiple graphs&mdash ; one per
162
- worker&mdash ; where each graph contains one set of parameters (pinned to
163
- ` /job:ps ` ) and one copy of the model (pinned to a particular
164
- ` /job:worker/task:i ` ). The "container" mechanism is used to share variables
165
- between different graphs: when each variable is constructed, the optional
166
- ` container ` argument is specified with the same value in each copy of the
167
- graph. For large models, this can be more efficient, because the overall graph
168
- is smaller.
169
-
170
- This approach uses multiple ` tf.Session ` objects: one per worker process,
171
- where the ` target ` of each is the address of a different worker. The
172
- ` tf.Session ` objects can all be created in a single Python client, or you can
173
- use multiple Python clients to better distribute the trainer load.
146
+ A common training configuration, called "data parallelism," involves multiple
147
+ tasks in a ` worker ` job training the same model on different mini-batches of
148
+ data, updating shared parameters hosted in a one or more tasks in a ` ps `
149
+ job. All tasks typically run on different machines. There are many ways to
150
+ specify this structure in TensorFlow, and we are building libraries that will
151
+ simplify the work of specifying a replicated model. Possible approaches include:
152
+
153
+ * ** In-graph replication.** In this approach, the client builds a single
154
+ ` tf.Graph ` that contains one set of parameters (in ` tf.Variable ` nodes pinned
155
+ to ` /job:ps ` ); and multiple copies of the compute-intensive part of the model,
156
+ each pinned to a different task in ` /job:worker ` .
157
+
158
+ * ** Between-graph replication.** In this approach, there is a separate client
159
+ for each ` /job:worker ` task, typically in the same process as the worker
160
+ task. Each client builds a similar graph containing the parameters (pinned to
161
+ ` /job:ps ` as before using
162
+ [ ` tf.train.replica_device_setter() ` ] ( ../../api_docs/train.md#replica_device_setter )
163
+ to map them deterministically to the same tasks); and a single copy of the
164
+ compute-intensive part of the model, pinned to the local task in
165
+ ` /job:worker ` .
166
+
167
+ * ** Asynchronous training.** In this approach, each replica of the graph has an
168
+ independent training loop that executes without coordination. It is compatible
169
+ with both forms of replication above.
170
+
171
+ * ** Synchronous training.** In this approach, all of the replicas read the same
172
+ values for the current parameters, compute gradients in parallel, and then
173
+ apply them together. It is compatible with in-graph replication (e.g. using
174
+ gradient averaging as in the
175
+ [ CIFAR-10 multi-GPU trainer] ( https://www.tensorflow.org/code/tensorflow/models/image/cifar10/cifar10_multi_gpu_train.py ) ),
176
+ and between-graph replication (e.g. using the
177
+ ` tf.train.SyncReplicasOptimizer ` ).
174
178
175
179
### Putting it all together: example trainer program
176
180
177
- The following code shows the skeleton of a distributed trainer program. It
178
- includes the code for the parameter server and worker processes.
181
+ The following code shows the skeleton of a distributed trainer program,
182
+ implementing ** between-graph replication** and ** asynchronous training** . It
183
+ includes the code for the parameter server and worker tasks.
179
184
180
185
``` python
181
186
import tensorflow as tf
@@ -197,10 +202,13 @@ def main(_):
197
202
ps_hosts = FLAGS .ps_hosts.split(" ," )
198
203
worker_hosts = FLAGS .worker_hosts(" ," )
199
204
205
+ # Create a cluster from the parameter server and worker hosts.
200
206
cluster = tf.train.ClusterSpec({" ps" : ps_hosts, " worker" : worker_hosts})
207
+
208
+ # Create and start a server for the local task.
201
209
server = tf.train.Server(cluster,
202
210
job_name = FLAGS .job_name,
203
- task_index = task_index)
211
+ task_index = FLAGS . task_index)
204
212
205
213
if FLAGS .job_name == " ps" :
206
214
server.join()
@@ -290,31 +298,33 @@ $ python trainer.py \
290
298
</dd >
291
299
<dt >Cluster</dt >
292
300
<dd >
293
- A TensorFlow cluster comprises one or more TensorFlow servers, divided into
294
- a set of named jobs, which in turn comprise lists of tasks. A cluster is
295
- typically dedicated to a particular high-level objective, such as training a
296
- neural network, using many machines in parallel .
301
+ A TensorFlow cluster comprises a one or more "jobs", each divided into lists
302
+ of one or more " tasks" . A cluster is typically dedicated to a particular
303
+ high-level objective, such as training a neural network, using many machines
304
+ in parallel. A cluster is defined by a `tf.train.ClusterSpec` object .
297
305
</dd >
298
306
<dt >Job</dt >
299
307
<dd >
300
308
A job comprises a list of "tasks", which typically serve a common
301
309
purpose. For example, a job named `ps` (for "parameter server") typically
302
310
hosts nodes that store and update variables; while a job named `worker`
303
311
typically hosts stateless nodes that perform compute-intensive tasks.
304
- The tasks in a job typically run on different machines.
312
+ The tasks in a job typically run on different machines. The set of job roles
313
+ is flexible: for example, a `worker` may maintain some state.
305
314
</dd >
306
315
<dt >Master service</dt >
307
316
<dd >
308
- An RPC service that provides remote access to a set of distributed
309
- devices. The master service implements the <code>tensorflow::Session</code>
310
- interface, and is responsible for coordinating work across one or more
311
- "worker services".
317
+ An RPC service that provides remote access to a set of distributed devices,
318
+ and acts as a session target. The master service implements the
319
+ <code>tensorflow::Session</code> interface, and is responsible for
320
+ coordinating work across one or more "worker services". All TensorFlow
321
+ servers implement the master service.
312
322
</dd >
313
323
<dt >Task</dt >
314
324
<dd >
315
- A task typically corresponds to a single TensorFlow server process,
316
- belonging to a particular "job" and with a particular index within that
317
- job's list of tasks.
325
+ A task corresponds to a specific TensorFlow server, and typically
326
+ corresponds to a single process. A task belongs to a particular "job" and is
327
+ identified by its index within that job's list of tasks.
318
328
</dd >
319
329
<dt >TensorFlow server</dt >
320
330
<dd >
@@ -326,6 +336,7 @@ $ python trainer.py \
326
336
An RPC service that executes parts of a TensorFlow graph using its local
327
337
devices. A worker service implements <a href=
328
338
"https://www.tensorflow.org/code/tensorflow/core/protobuf/worker_service.proto"
329
- ><code>worker_service.proto</code></a>.
339
+ ><code>worker_service.proto</code></a>. All TensorFlow servers implement the
340
+ worker service.
330
341
</dd >
331
342
</dl >
0 commit comments