1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
# MegEngine is Licensed under the Apache License, Version 2.0 (the "License")
#
# Copyright (c) 2014-2021 Megvii Inc. All rights reserved.
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT ARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
=
= None
return
= ,
r"""GradManager computes gradients or more generally, vector-Jacobian product, by reverse mode
automatic differentiation (a.k.a. back propagation).
Reverse mode autodiff normally reuses many intermediate tensors for best computation efficiency.
In a read-eval-print-loop (REPL) environment however, it is impossible to known how the user
would take gradients later thus which tensors to keep. To solve this problem, the user must
somehow declare beforehand which gradient could possibly be taken. With GradManager, users are
required to call the :meth:`attach` method on a tensor if they want to take gradients with
respect to it later. Furthermore, any computation on a tensor before it is attached is
completely ignored from the autodiff perspective, so :meth:`attach` must be called before any
computation that needs differentiation.
For example, the following symbolic differentiation code
.. code-block::
x = get_x()
y = f(x)
dy = ones_like(y)
dx = vjp(y, x, dy) # vector-Jacobian product
can be rewriten using GradManager for REPL environment as
.. code-block::
with GradManager() as gm:
x = get_x()
gm.attach(x) # must be placed before any computation on x that needs differentiation
y = f(x)
dy = ones_like(y)
gm.backward(y, dy) # doesn't need x, already known via attach()
dx = x.grad # backward() saves result to .grad attribute
A more realistic example of training a neural network would be like
.. code-block::
gm = GradManager()
gm.attach(model.parameters())
for data in dataset:
with gm:
loss = model(data)
gm.backward(loss)
# gradients w.r.t. parameters is accumulated into their .grad attributes
You can also use ``record()`` and ``release()`` method instead of ``with`` context:
.. code-block::
gm = GradManager()
gm.attach(model.parameters())
for data in dataset:
gm.record()
loss = model(data)
gm.backward(loss)
# backward() will clear recorded history and free resources
# call release() if backward() is not called
# gm.release()
For your convenience, GradManager may (not must) be reused. As shown in the examples, you
only need to attach a tensor once and GradManager will remember it afterwards.
However, a single GradManager can record only one computation history at a time. To run
multiple differentiations simultaneously or perform high order differentiation, create
as many GradManager as you need.
.. note::
Mutable tensors introduce ambiguities when doing symbolic differentiation: which version
of the tensor are we referring to? For attached tensors, GradManager resolves this
ambiguity by "snapshoting" them on first encounter, either on :meth:`record` (or entering
with statement) if tensor is attached before :meth:`record`, or on :meth:`attach` if
GradManager is already recording. Attached tensors will then be interpreted as their
snapshotted version for differentiation purpose. The same ambiguity on the first parameter
of :meth:`backward` is simply resolved by using the latest version.
Typically, in data parallel, we would like to average the gradients across
processes. Users will finally get the averaged gradients if an "AllReduce"
callback is registered as follows:
.. code-block::
import megengine.distributed as dist
gm = GradManager()
gm.attach(model.parameters(), callback=dist.make_allreduce_cb("MEAN"))
"""
= # id(Tensor) -> AttachSpec
= False
= None
=
=
r"""Return attached tensor list from :meth:`attach`."""
return
r"""Instruct GradManager to track operations on tensors, so that gradients with respect
to those tensors could be evaluated later.
:meth:`attach` also accepts a list of callbacks, which will be called with the tensor and
its gradient during :meth:`backward`. The signature of callbacks should look like:
.. code-block::
def callback(tensor: Tensor, grad: Tensor) -> Tensor:
...
# returned grad is passed to subsequent callbacks
# and finally accumulated to the .grad attribute of tensor
return grad
:meth:`attach` calls with overlapping tensors will result in their callbacks concatenated,
independently for each tensor. For example,
.. code-block::
gm.attach([x, y], callbacks=[f])
gm.attach([y], callbacks=[g])
is equivalent to
.. code-block::
gm.attach([x], callbacks=[f])
gm.attach([y], callbacks=[f, g])
The effect of :meth:`attach` will persist across multiple uses of the GradManager. When
reusing a GradManager, it is likely a mistake to call :meth:`attach` on the same set of
tensors and callbacks repeatedly, which may grow the callback list indefinitely.
.. note::
When reusing a GradManager, it is sometimes desirable to attach temporary tensors each
time, e.g. for computing gradients of inputs of a neural network. GradManager tries to
accommodate such usages by holding weak references to attached tensors. Most of the
times, this should be enough to prevent resource leak. Unfortunately, there are still
some pitfalls left:
- Callbacks should not hold strong references, directly or indirectly, to attached
tensors. Any strong reference, including those from callbacks, will prevent
garbage collection (even by the cycle collector!) of a attached tensor, until
the GradManager object is garbage collected.
Please also note that GradManager might hold additional strong references to attached
tensors when it is in use. This note only covers potential resource leaks across
multiple uses of a GradManager, which is unrelated to whether resources is timely
released within a single use.
Args:
tensors: tensor or list of tensors to track
callbacks: callback or list of callbacks
"""
=
=
=
=
=
=
del
=
=
=
return
assert ,
assert ,
=
= is None
=
=
return
return
r"""Compute gradients (or vector-Jacobian product) for all attached tensors, accumulate to
corresponding .grad attribute, and release resources along the way.
:meth:`backward` computes the vector-Jacobian product :math:`dx_j = \sum_{i} dy_i J_{ij}`
where :math:`J_{ij} = ∂y_i/∂x_j` is the Jacobian matrix between vector variables :math:`y`
and :math:`x`, with all vectors involved represented as a list of tensors, in the sense of
direct sums (or flatten-and-concatenate). :math:`y` and :math:`dy` are passed as the first
and second parameter respectively, whereas :math:`x` is directly taken from the list of
all attached tensors. The result :math:`dx` is also not returned. Instead, it is directly
accumulated into the .grad attribute of matching attached tensors (a.k.a. :math:`x`). This
can be done unambiguously since :math:`dx` as a list of tensors has the same structure as
:math:`x`.
If :math:`y` is a scalar and :math:`dy` is chosen to be 1, the vector-Jacobian product
yield gradient of :math:`y` with repect to :math:`x` as a special case. In that case,
you will be able to omit the :math:`dy` parameter and :meth:`backward` will automatically
use 1 for it and compute the gradient.
:meth:`backward` consumes all resources held by this GradManager and releases them in the
process of this call. When the call successfully finishes, the GradManager will be put back
to an inactive state.
Args:
y: tensor or list of tensors
dy: tensor or list of tensors. Defaults to 1 if y is scalar
"""
global
=
=
assert is not None
# These checks should be consistent with GradScaler's
=
=
=
=
=
=
=
=
= and
=
+=
=
r"""Start recording operations
After this call, you will be able to call :meth:`backward`.
"""
=
= True
=
=
return
=
=
# NOTE: override prev callback wrt when called serval times
r"""Stop recording operations and release resources kept for gradient computation
After this call, you will not be able to call :meth:`backward`.
"""
= None
= False
=
return
return
return NotImplemented
=
=
=
return NotImplemented
return
=
=
assert is not None
assert is None