Avoid potential deadlocks in host allocator #159352

guangyey · 2025-07-29T08:13:17Z

Stack from ghstack (oldest at bottom):

-> Avoid potential deadlocks in host allocator #159352

Motivation

This PR fixes a potential deadlock in the host allocator.
When calling event->record(stream), the record_stream implementation may acquire the Python GIL.
In places such as

pytorch/aten/src/ATen/cuda/CachingHostAllocator.cpp

Lines 145 to 151 in 842cc77

    
           void record_stream( 
        
               std::optional<std::vector<EventPool::Event>>& events, 
        
               CUDAStream stream) override { 
        
             auto event = create_event_internal(stream.device_index()); 
        
             event->record(stream); 
        
             events->push_back(std::move(event)); 
        
           }

, and

pytorch/aten/src/ATen/xpu/CachingHostAllocator.cpp

Lines 22 to 28 in 842cc77

    
           void record_stream( 
        
               std::optional<std::vector<XPUEvent>>& events, 
        
               XPUStream stream) override { 
        
             XPUEvent event; 
        
             event.record(stream); 
        
             events->push_back(std::move(event)); 
        
           }

record_stream is invoked while holding the allocator lock.

To prevent deadlocks, we must ensure the locking order is:
GIL → Allocator Lock.
Reversing the order (Allocator Lock → GIL) can cause a deadlock.

pytorch-bot · 2025-07-29T08:13:21Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159352

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 89aa54d with merge base 05c19d1 ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

xpu / linux-jammy-xpu-2025.1-py3.9 / test (default, 3, 6, linux.idc.xpu) (gh) (similar failure)
'Test'

This comment was automatically generated by Dr. CI and updates every 15 minutes.

cyyever · 2025-07-29T08:27:31Z

aten/src/ATen/core/CachingHostAllocator.h

        block->event_count_ += events->size();
+        // Move out streams to avoid holding the mutex during event recording
+        streams = std::move(block->streams_);
        block->streams_.clear();


Moved streams are used at the next line.

This actually fine per the spec, std::move moves puts it an unspecified, but valid state. Clear() should be safe to call.

[ghstack-poisoned]

Skylion007 · 2025-07-29T16:56:53Z

aten/src/ATen/core/CachingHostAllocator.h

        block->event_count_ += events->size();
+        // Move out streams to avoid holding the mutex during event recording
+        streams = std::move(block->streams_);
        block->streams_.clear();


This actually fine per the spec, std::move moves puts it an unspecified, but valid state. Clear() should be safe to call.

Didn't mean to approve, just comment

EikanWang · 2025-08-08T18:05:03Z

@guangyey , could you help rebase the stack to make the ci signal green?

guangyey · 2025-08-08T18:19:28Z

@guangyey , could you help rebase the stack to make the ci signal green?

OK.

[ghstack-poisoned]

ghstack-source-id: 17abfd7 Pull Request resolved: #159352

[ghstack-poisoned]

albanD · 2025-08-11T16:54:49Z

Can you clarify when the GIL would get grabbed in this case?
I feel like event recording is a common thing to do in our backend and it is done without much care for python world.

guangyey · 2025-08-11T17:07:52Z

Can you clarify when the GIL would get grabbed in this case?
I feel like event recording is a common thing to do in our backend and it is done without much care for python world.

@albanD Take cuda for an example, here will grab the GIL at trace_gpu_event_record

pytorch/aten/src/ATen/cuda/CUDAEvent.h

Lines 140 to 146 in ca7315c

    
           const c10::impl::PyInterpreter* interp = c10::impl::GPUTrace::get_trace(); 
        
           if (C10_UNLIKELY(interp)) { 
        
             (*interp)->trace_gpu_event_record(at::kCUDA, 
        
                 reinterpret_cast<uintptr_t>(event_), 
        
                 reinterpret_cast<uintptr_t>(stream.stream()) 
        
             ); 
        
           }

-> CONCRETE_GPU_TRACE

pytorch/torch/csrc/PyInterpreter.cpp

Lines 107 to 112 in ca7315c

    
           void trace_gpu_event_record( 
        
               at::DeviceType device_type, 
        
               uintptr_t event, 
        
               uintptr_t stream) const override { 
        
             CONCRETE_GPU_TRACE(device_type, "EventRecordCallbacks", event, stream); 
        
           }

and try to fetch GIL at line 26

pytorch/torch/csrc/PyInterpreter.cpp

Lines 23 to 28 in ca7315c

    
           #define CONCRETE_GPU_TRACE(device_type, func_name, ...)                       \ 
        
             at::impl::MaybeSetTLSOnEntryGuard guard;                                    \ 
        
             if (Py_IsInitialized()) {                                                   \ 
        
               pybind11::gil_scoped_acquire gil;                                         \ 
        
               try {                                                                     \ 
        
                 /* Masquerade hip as cuda because hip uses `torch.cuda` module. */      \

This was referenced Jul 18, 2025

Move CUDAEvent to c10 #158219

Open

Move EventPool::Event to c10 #158220

Open

reuse EventPool::Event in CUDAAllocator #158224

Open

Move XPUEvent to c10 #158336

Open

guangyey added ciflow/xpu Run XPU CI tasks topic: not user facing topic category ciflow/trunk Trigger trunk jobs on your pull request labels Jul 29, 2025

pytorchbot added the open source label Jul 29, 2025

cyyever approved these changes Jul 29, 2025

View reviewed changes

guangyey requested a review from albanD July 29, 2025 08:38

guangyey added 2 commits July 29, 2025 15:42

Update

8293891

[ghstack-poisoned]

Update

ccfbfae

[ghstack-poisoned]

Skylion007 previously approved these changes Jul 29, 2025

View reviewed changes

Update

99e5c53

[ghstack-poisoned]

guangyey added a commit that referenced this pull request Aug 11, 2025

Avoid potential deadlocks in host allocator

04a4bb3

ghstack-source-id: 17abfd7 Pull Request resolved: #159352

guangyey requested a review from ezyang August 11, 2025 02:44

Update

89aa54d

[ghstack-poisoned]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Avoid potential deadlocks in host allocator #159352

Avoid potential deadlocks in host allocator #159352

guangyey commented Jul 29, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jul 29, 2025 •

edited

Loading

Uh oh!

cyyever Jul 29, 2025

Uh oh!

Skylion007 Jul 29, 2025

Uh oh!

Skylion007 Jul 29, 2025

Uh oh!

EikanWang commented Aug 8, 2025

Uh oh!

guangyey commented Aug 8, 2025

Uh oh!

albanD commented Aug 11, 2025

Uh oh!

guangyey commented Aug 11, 2025 •

edited

Loading

Uh oh!

Uh oh!

	void record_stream(
	std::optional<std::vector<EventPool::Event>>& events,
	CUDAStream stream) override {
	auto event = create_event_internal(stream.device_index());
	event->record(stream);
	events->push_back(std::move(event));
	}

	void record_stream(
	std::optional<std::vector<XPUEvent>>& events,
	XPUStream stream) override {
	XPUEvent event;
	event.record(stream);
	events->push_back(std::move(event));
	}

Avoid potential deadlocks in host allocator #159352

Are you sure you want to change the base?

Avoid potential deadlocks in host allocator #159352

Conversation

guangyey commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Uh oh!

pytorch-bot bot commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159352

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

cyyever Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

Skylion007 Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

Skylion007 Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

EikanWang commented Aug 8, 2025

Uh oh!

guangyey commented Aug 8, 2025

Uh oh!

albanD commented Aug 11, 2025

Uh oh!

guangyey commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

guangyey commented Jul 29, 2025 •

edited

Loading

pytorch-bot bot commented Jul 29, 2025 •

edited

Loading

guangyey commented Aug 11, 2025 •

edited

Loading