You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/stable/rpc/distributed_autograd.html
+11-11
Original file line number
Diff line number
Diff line change
@@ -334,12 +334,12 @@
334
334
<spanid="id1"></span><h1>Distributed Autograd Design<aclass="headerlink" href="#distributed-autograd-design" title="Permalink to this headline">¶</a></h1>
335
335
<p>This note will present the detailed design for distributed autograd and walk
336
336
through the internals of the same. Make sure you’re familiar with
337
-
<aclass="reference internal" href="../notes/autograd.html#autograd-mechanics"><spanclass="std std-ref">Autograd mechanics</span></a> and the <aclass="reference internal" href="rpc.html#distributed-rpc-framework"><spanclass="std std-ref">Distributed RPC Framework</span></a> before
337
+
<aclass="reference internal" href="../notes/autograd.html#autograd-mechanics"><spanclass="std std-ref">Autograd mechanics</span></a> and the <aclass="reference internal" href="../rpc.html#distributed-rpc-framework"><spanclass="std std-ref">Distributed RPC Framework</span></a> before
338
338
proceeding.</p>
339
339
<divclass="section" id="background">
340
340
<h2>Background<aclass="headerlink" href="#background" title="Permalink to this headline">¶</a></h2>
341
341
<p>Let’s say you have two nodes and a very simple model partitioned across two
342
-
nodes. This can be implemented using <aclass="reference internal" href="rpc.html#module-torch.distributed.rpc" title="torch.distributed.rpc"><codeclass="xref py py-mod docutils literal notranslate"><spanclass="pre">torch.distributed.rpc</span></code></a> as follows:</p>
342
+
nodes. This can be implemented using <aclass="reference internal" href="../rpc.html#module-torch.distributed.rpc" title="torch.distributed.rpc"><codeclass="xref py py-mod docutils literal notranslate"><spanclass="pre">torch.distributed.rpc</span></code></a> as follows:</p>
<h2>Distributed Autograd Context<aclass="headerlink" href="#distributed-autograd-context" title="Permalink to this headline">¶</a></h2>
398
398
<p>Each forward and backward pass that uses distributed autograd is assigned a
399
-
unique <aclass="reference internal" href="rpc.html#torch.distributed.autograd.context" title="torch.distributed.autograd.context"><codeclass="xref py py-class docutils literal notranslate"><spanclass="pre">torch.distributed.autograd.context</span></code></a> and this context has a
399
+
unique <aclass="reference internal" href="../rpc.html#torch.distributed.autograd.context" title="torch.distributed.autograd.context"><codeclass="xref py py-class docutils literal notranslate"><spanclass="pre">torch.distributed.autograd.context</span></code></a> and this context has a
400
400
globally unique <codeclass="docutils literal notranslate"><spanclass="pre">autograd_context_id</span></code>. This context is created on each node
before we have the opportunity to run the optimizer. This is similar to
408
408
calling <aclass="reference internal" href="../autograd.html#torch.autograd.backward" title="torch.autograd.backward"><codeclass="xref py py-meth docutils literal notranslate"><spanclass="pre">torch.autograd.backward()</span></code></a> multiple times locally. In order to
409
409
provide a way of separating out the gradients for each backward pass, the
410
-
gradients are accumulated in the <aclass="reference internal" href="rpc.html#torch.distributed.autograd.context" title="torch.distributed.autograd.context"><codeclass="xref py py-class docutils literal notranslate"><spanclass="pre">torch.distributed.autograd.context</span></code></a>
410
+
gradients are accumulated in the <aclass="reference internal" href="../rpc.html#torch.distributed.autograd.context" title="torch.distributed.autograd.context"><codeclass="xref py py-class docutils literal notranslate"><spanclass="pre">torch.distributed.autograd.context</span></code></a>
411
411
for each backward pass.</p></li>
412
412
<li><p>During the forward pass we store the <codeclass="docutils literal notranslate"><spanclass="pre">send</span></code> and <codeclass="docutils literal notranslate"><spanclass="pre">recv</span></code> functions for
413
413
each autograd pass in this context. This ensures we hold references to the
@@ -524,7 +524,7 @@ <h3>Computing dependencies<a class="headerlink" href="#computing-dependencies" t
524
524
<aclass="reference internal" href="#distributed-autograd-context">Distributed Autograd Context</a>. The gradients are stored in a
525
525
<codeclass="docutils literal notranslate"><spanclass="pre">Dict[Tensor,</span><spanclass="pre">Tensor]</span></code>, which is basically a map from Tensor to its
526
526
associated gradient and this map can be retrieved using the
<li><p>Takes a list of remote parameters (<aclass="reference internal" href="rpc.html#torch.distributed.rpc.RRef" title="torch.distributed.rpc.RRef"><codeclass="xref py py-class docutils literal notranslate"><spanclass="pre">RRef</span></code></a>) to
600
+
<li><p>Takes a list of remote parameters (<aclass="reference internal" href="../rpc.html#torch.distributed.rpc.RRef" title="torch.distributed.rpc.RRef"><codeclass="xref py py-class docutils literal notranslate"><spanclass="pre">RRef</span></code></a>) to
601
601
optimize. These could also be local parameters wrapped within a local
<li><p>Takes a <aclass="reference internal" href="../optim.html#torch.optim.Optimizer" title="torch.optim.Optimizer"><codeclass="xref py py-class docutils literal notranslate"><spanclass="pre">Optimizer</span></code></a> class as the local
604
604
optimizer to run on all distinct <codeclass="docutils literal notranslate"><spanclass="pre">RRef</span></code> owners.</p></li>
605
605
<li><p>The distributed optimizer creates an instance of the local <codeclass="docutils literal notranslate"><spanclass="pre">Optimizer</span></code> on
606
606
each of the worker nodes and holds an <codeclass="docutils literal notranslate"><spanclass="pre">RRef</span></code> to them.</p></li>
the distributed optimizer uses RPC to remotely execute all the local
609
609
optimizers on the appropriate remote workers. A distributed autograd
610
610
<codeclass="docutils literal notranslate"><spanclass="pre">context_id</span></code> must be provided as input to
611
-
<aclass="reference internal" href="rpc.html#torch.distributed.optim.DistributedOptimizer.step" title="torch.distributed.optim.DistributedOptimizer.step"><codeclass="xref py py-meth docutils literal notranslate"><spanclass="pre">torch.distributed.optim.DistributedOptimizer.step()</span></code></a>. This is used
611
+
<aclass="reference internal" href="../rpc.html#torch.distributed.optim.DistributedOptimizer.step" title="torch.distributed.optim.DistributedOptimizer.step"><codeclass="xref py py-meth docutils literal notranslate"><spanclass="pre">torch.distributed.optim.DistributedOptimizer.step()</span></code></a>. This is used
612
612
by local optimizers to apply gradients stored in the corresponding
613
613
context.</p></li>
614
614
<li><p>If multiple concurrent distributed optimizers are updating the same
0 commit comments