-
Notifications
You must be signed in to change notification settings - Fork 24.9k
[inductor] dont reuse buffers if it affects peak (#145883) #159530
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: gh/v0i0/4/base
Are you sure you want to change the base?
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159530
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (1 Unrelated Failure)As of commit 45698e2 with merge base 2507ae6 ( UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
@pytorchbot label "topic: not user facing" |
torch/_inductor/codegen/wrapper.py
Outdated
) * get_dtype_size(self.node.get_dtype()) | ||
if free_line_scheduler_node >= self_scheduler_node: | ||
return False | ||
peak_memory_in_range = max( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after we reuse a buffer, we need update the memory of nodes for its reuse window
torch/_inductor/codegen/wrapper.py
Outdated
peak_memory_per_scheduler_node[free_line_scheduler_node:self_scheduler_node] | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be potentially O(n^2), because at each node we are iterating through O(n) nodes.
Is there an O(n log n) solution we could do ? From looking around a bit - maybe https://en.wikipedia.org/wiki/Fenwick_tree ? note - I haven't looked especially closely at this yet.
If we can't figure out an O(n log n) solution we could also potentially do a sliding window, or add other heuristics like disallow buffer reuse of tensors if they are above a certain size.
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben [ghstack-poisoned]
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben [ghstack-poisoned]
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben [ghstack-poisoned]
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben [ghstack-poisoned]
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben [ghstack-poisoned]
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben [ghstack-poisoned]
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben [ghstack-poisoned]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good, a couple questions about the tree. would you mind doing one dashboard run ? I believe we expect to see memory improvmenets in timm benchmark.
if lazy_node is not None: | ||
# Apply lazy update to current node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: early return instead of nesting ?
self.size = 1 | ||
while self.size < self.n: | ||
self.size *= 2 | ||
self.size *= 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it worth describing the data layout of the tree
[1 ... n ] base values then [1... n // 2, n // 2... n //4 ] ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the opposite. will add a comment
|
||
# Initialize tree and lazy arrays | ||
self.tree = [identity_element] * self.size | ||
self.lazy: list[Optional[T]] = [None] * self.size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: describe what the lazy array will do ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a comment
self._build(values, right_child, mid + 1, end) | ||
|
||
# Update current node with summary of children | ||
self.tree[node] = self.summary_op(self.tree[left_child], self.tree[right_child]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice to use the identity element instead of having to handle special cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh i meant, i think this is what the code is already doing, right.
) * get_dtype_size(self.node.get_dtype()) | ||
if self.should_reuse_buffer(free_line, size): | ||
free_line.is_reused = True | ||
self.wrapper.estimate_peak.update_peak_between(free_line, self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So i guess the difference is, it used to be queries are O(n), now just updates are O(n). is that correct ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
both are O(log n): consider the binary tree of nodes, and an interval between two nodes. to process the interval, we need to look at the path from the tree root to the left and the right boundary of the interval (2 * log n). For updates, it update those nodes and their direct neighbors within the interval. For queries, it reads those nodes, and potentially pushes lazy updates to the direct neighbors. So both are 2 * 2 * log_2 n.
return | ||
|
||
mid = (start + end) // 2 | ||
left_child = 2 * node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking aloud, Do we need both _push_lazy and build ? I guess build could be replaced by iteratively _push_lazy
each element in values ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'd keep it as is. replacing build by update_range/push_lazy turns it from O(n) to O(n log n).
left_child = 2 * node | ||
right_child = 2 * node + 1 | ||
|
||
# Propagate to children | ||
lazy_left_child = self.lazy[left_child] | ||
if lazy_left_child is None: | ||
self.lazy[left_child] = lazy_node | ||
else: | ||
self.lazy[left_child] = self.update_op(lazy_left_child, lazy_node) | ||
|
||
lazy_right_child = self.lazy[right_child] | ||
if lazy_right_child is None: | ||
self.lazy[right_child] = lazy_node | ||
else: | ||
self.lazy[right_child] = self.update_op(lazy_right_child, lazy_node) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: for loop over left/right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
lazy_left_child = self.lazy[left_child] | ||
if lazy_left_child is None: | ||
self.lazy[left_child] = value | ||
else: | ||
self.lazy[left_child] = self.update_op(lazy_left_child, value) | ||
|
||
lazy_right_child = self.lazy[right_child] | ||
if lazy_right_child is None: | ||
self.lazy[right_child] = value | ||
else: | ||
self.lazy[right_child] = self.update_op(lazy_right_child, value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: for loop ? also, this seems the same as lines 87-104 above. refactor ?
Dashboard run here: https://github.com/pytorch/pytorch/actions/runs/16895400336 |
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben [ghstack-poisoned]
|
||
self.overall_peak_memory, peak_by_scheduler_node = estimate_peak_memory( | ||
V.graph.scheduler.nodes, | ||
{}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, one last thing, names_to_freeable_bufs
is important for the backward when we need to know which activations will deallocate in order to have an accurate memory estimate
Stack from ghstack (oldest at bottom):
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben