fix: prevent EDEADLK self-join in ~CurlHttpOperation on async-thread destruction#1481
Open
bmehta001 wants to merge 7 commits into
Open
fix: prevent EDEADLK self-join in ~CurlHttpOperation on async-thread destruction#1481bmehta001 wants to merge 7 commits into
bmehta001 wants to merge 7 commits into
Conversation
Under -Werror on Linux/macOS, the modules-repo CI (build-posix-latest-exp)
has been failing for ~2 weeks with:
config-default.h:36: error: 'HAVE_MAT_LIVEEVENTINSPECTOR' macro redefined
[-Werror,-Wmacro-redefined]
config-default.h:37: error: 'HAVE_MAT_PRIVACYGUARD' macro redefined
tests/functests/CMakeLists.txt and tests/unittests/CMakeLists.txt add
-DHAVE_MAT_LIVEEVENTINSPECTOR / -DHAVE_MAT_PRIVACYGUARD on the command
line when BUILD_LIVEEVENTINSPECTOR / BUILD_PRIVACYGUARD (default YES) and
the respective module dir exists. The three config-default headers then
redefined them unconditionally, which is fatal under -Werror (added by
microsoft#1415).
Wrapping the two defines in #ifndef in all three config-default*.h
headers preserves all existing behavior:
- Without command-line -D: macros get defined here as before.
- With command-line -D: header skips the redefinition, no warning.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…destruction
The modules-repo CI test ECSClientFuncTests.GetConfigs (and every
test in the ECSClientFuncTests suite) crashed on Linux/macOS with:
terminate called after throwing an instance of 'std::system_error'
what(): Resource deadlock avoided
Aborted (core dumped)
Root cause
==========
SendAsync() runs Send() + the user callback on a std::async worker
thread. The callback owns a strong ref to CurlHttpOperation, so when
it releases the last ref the ~CurlHttpOperation destructor runs on
the async thread itself.
libstdc++'s std::future<>::~future implicitly calls
_Async_state_impl::~_Async_state_impl, which calls _M_complete_async
-> _M_join via std::call_once. On the async thread that's a self-join;
call_once throws std::system_error(EDEADLK). Because the throw escapes
a noexcept destructor, terminate() aborts the process. A try/catch
around the future cannot rescue this — destructors of std::future are
noexcept.
Fix
===
Move the future onto a detached helper thread before its destructor
runs. The helper is by definition NOT the async thread (we'd only be
on the async thread if its work already finished), so the implicit
join completes immediately. On the common path (destruction from the
caller thread) it costs one short-lived thread spawn that exits in
microseconds.
Verified locally with sister + modules linked: all 113 FuncTests pass,
including all 25 ECSClientFuncTests (which include the formerly-fatal
GetConfigs).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ession)
Code review found that the previous fix detached the async future's join for
EVERY destruction. That removed the cross-thread lifetime guarantee the old
result.wait() provided: when the operation is destroyed from another thread
while the async Send() is still running, the destructor would proceed to
curl_easy_cleanup()/ReleaseResponse() and destroy the by-reference request body
while the worker thread is still using them -> use-after-free.
Restore the guarantee while keeping the EDEADLK self-join fix:
- Record the async task's thread id (atomic) when SendAsync's task starts.
- In the destructor, compare std::this_thread::get_id():
* self-join (destroyed from within our own async callback, e.g. EraseRequest
drops the last reference): the work is necessarily complete, so defer the
future's join to a detached helper thread instead of joining on this (the
async) thread, avoiding EDEADLK.
* cross-thread: result.wait() to keep the curl handle, response buffer and
by-reference request body alive until the async Send() finishes.
- Heap-allocate the deferred future first so a rare std::thread spawn failure
leaks the already-finished future rather than self-joining (EDEADLK) or
letting std::system_error escape this noexcept destructor (std::terminate).
- Refresh the stale HttpClient_Curl.cpp lifetime comment.
Logic validated with a standalone C++11 repro under AddressSanitizer: the
cross-thread path waits (no UAF) and the self-join path does not deadlock.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR addresses a Linux/macOS crash caused by std::future’s implicit join throwing on self-join when ~CurlHttpOperation runs on the same async thread created by std::async.
Changes:
- Adds tracking of the async task’s thread id to detect destructor execution on the async thread.
- Updates
~CurlHttpOperationto avoid self-join by deferringstd::futuredestruction to a detached helper thread. - Updates comments in the curl HTTP client to document the lifetime/join behavior.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| lib/http/HttpClient_Curl.hpp | Adds async thread id tracking and modifies destructor logic to avoid std::future self-join aborts. |
| lib/http/HttpClient_Curl.cpp | Updates documentation comment describing the lifetime guarantees of CurlHttpOperation across async send. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…row-new failure - Add #include <new> so std::nothrow is not relied on transitively (review). - If new (std::nothrow) returns nullptr (OOM), result stays valid and would self-join (EDEADLK) at end of the noexcept dtor; abort() as a last resort instead of falling through to that, per review. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…fix lifetime comments - Replace std::atomic<std::thread::id> (not guaranteed supported across standard libraries) with a plain std::thread::id published via an std::atomic<bool> flag using release/acquire ordering. - Correct the lifetime comments: the operation's last shared_ptr is held by the owning CurlHttpRequest (via SetOperation), not by EraseRequest (which only removes the raw id from m_requests). The self-join occurs when the async callback leads to that request being destroyed on the async thread (OnHttpResponse -> EventsUploadContext::clear()). Re-validated the wait-vs-detach logic with a standalone C++11 repro under AddressSanitizer + UBSan. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… reset flag on reuse - Reword the self-join comment: in that case Send() has returned (we are in its callback) but the async task itself has not yet returned (the destructor runs inside it), so the deferred helper's ~future join completes only after this destructor unwinds. Avoids implying the async task is already finished. - Reset m_asyncThreadIdSet to false at the start of SendAsync so self-join detection stays correct if the operation were ever reused (it is single-use today: one SendAsync per request). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
modules-repo(cpp_client_telemetry_modules) CI on Linux + macOS crashes during the firstECSClientFuncTeststest with:Affects every PR that gets past compile (e.g. modules #277).
Root cause
The callback fired from the async thread is the last owner of the
CurlHttpOperation. When it releases itsshared_ptr,~CurlHttpOperationruns on that same async thread.~CurlHttpOperationthen destroys itsstd::future<long> resultmember. libstdc++'s~future->~_Async_state_implcalls_M_complete_async->_M_joinviastd::call_once. Joining the async thread from the async thread is a self-join —call_oncethrowsstd::system_error("Resource deadlock avoided"). Because the throw escapes a noexcept destructor,terminate()aborts the process.A
try/catcharound the future or even around an explicitresult.wait()cannot rescue this:std::future's destructor itself isnoexcept, so the throw lands at the noexcept boundary before any usercatchcan engage.GDB confirms (modules CI reproduction, Ubuntu 24.04, libstdc++ 13):
Fix
Move the future onto a detached helper thread before it can destruct on the async thread. The helper is by definition not the async thread (we'd only be on the async thread if its work has already completed, which is the only way
~CurlHttpOperationcould run there), so the implicit join completes in microseconds.On the common path — destruction from the caller thread — the helper thread spawn is also nearly free: the implicit join sees the work is done and returns immediately.
Verification
Verified on Linux GCC 13 with sister-main + modules-#277 linked (the exact configuration that crashed):
ECSClientFuncTests.GetConfigsaborts at first test in suite.ECSClientFuncTestspass; all 113FuncTestspass ([ PASSED ] 113 tests).Why this hasn't been seen on sister CI
The crashing path is
ECSClientFuncTests, which is in the modules repo (lib/modules/exp/tests/functests/). Sister CI doesn't buildlib/modules, so the crash is invisible there. It's only exposed when sister is linked with modules (i.e. the modules-repo CI configuration). That's why master modules CI has been red on Linux/macOS for ~50 consecutive runs — the redefine bug (#1473, merged) was masking this one. Once #1473 cleared, this surfaced as the next gate.Risk: low. One short-lived thread per
~CurlHttpOperationinvocation whenresultis still valid. Pure additive — no API change, no behavior change for already-joined futures.