Skip to content

Fix runtime data races, memory leak, and shutdown safety#1429

Open
bmehta001 wants to merge 25 commits into
microsoft:mainfrom
bmehta001:bhamehta/runtime-fixes
Open

Fix runtime data races, memory leak, and shutdown safety#1429
bmehta001 wants to merge 25 commits into
microsoft:mainfrom
bmehta001:bhamehta/runtime-fixes

Conversation

@bmehta001

@bmehta001 bmehta001 commented Apr 28, 2026

Copy link
Copy Markdown
Contributor

Fixes runtime thread-safety, shutdown-safety, and response-lifetime issues that affect normal SDK operation. Split out from #1415 per reviewer request so runtime behavior changes stay separate from CI/build/test fixes.

Fix HTTP cancellation and callback tracking

HttpClient_Apple.mm

  • Scope cancellation to the current request’s m_dataTask instead of canceling every task on the shared static NSURLSession.
  • Fix the CancelAllRequests() wait loop so m_requests.empty() is checked while holding m_requestsMtx.

HttpClientManager.cpp

  • Fix the cancelAllRequests() wait loop so m_httpCallbacks.empty() is checked while holding m_httpCallbacksMtx.

Fix HTTP response lifetime on abort/network-failure paths

HttpResponseDecoder.cpp

  • Preserve ctx->httpResponse through requestAborted(ctx) and temporaryNetworkFailure(ctx) so downstream storage/statistics handlers can still read status and headers.
  • Avoid leaking aborted/network-failure responses by keeping ownership with EventsUploadContext, whose clear() path deletes the response.

HttpResponseDecoderTests.cpp

  • Add regression coverage that aborted and network-failure decode routes still receive the response object with result/status intact.

Fix WorkerThread shutdown safety

WorkerThread.cpp

  • Add an explicit shutdown gate so late Queue() calls are rejected once teardown starts.
  • Enqueue the shutdown sentinel only once under the queue lock.
  • Clean up pending queued/timer tasks only after a successful join(), and avoid cleanup after detach() because the worker may still be running.
  • Log join/detach failures instead of silently swallowing all exceptions.

Make TransmissionPolicyManager scheduling consistently mutex-guarded

TransmissionPolicyManager.cpp / .hpp

  • Guard m_isUploadScheduled, m_runningLatency, and m_scheduledUploadTime consistently with m_scheduledUploadMutex.
  • Avoid holding m_scheduledUploadMutex across potentially blocking cancellation during stop/shutdown.
  • Keep force/zero-delay no-wait cancellation under the scheduler mutex so a competing delayed schedule cannot suppress an immediate upload.
  • Use an explicit std::chrono::milliseconds value in the bandwidth-controller reschedule path.

TransmissionPolicyManagerTests.cpp

  • Add regression coverage for the force/zero-delay scheduling race.

Fix Logger static-destruction-order crash

Logger.cpp

  • Remove destructor logging from Logger::~Logger() because it can run after logging infrastructure has already been destroyed, causing crashes during static teardown.

Known parity gap (separate repo, follow-up): AIHttpResponseDecoder::handleDecode in the lib/modules submodule (lib/modules/azmon/AIHttpResponseDecoder.cpp:105,118) still sets ctx->httpResponse = nullptr; on the aborted/failure paths — the same response-lifetime leak fixed here in lib/http/HttpResponseDecoder.cpp. It lives in a different repository (the lib/modules submodule), so it must be fixed there and pulled in via a submodule bump; it is out of scope for this PR.

bmehta001 and others added 4 commits April 29, 2026 14:41
- HttpClient_Apple: scope Cancel() to m_dataTask only instead of
  blanket-cancelling every task on the shared session. Fix torn read
  on m_requests.empty() in CancelAllRequests spin loop.
- HttpClientManager: fix torn read on m_httpCallbacks.empty() in
  cancelAllRequests spin loop — read under lock.
- HttpResponseDecoder: add missing delete ctx->httpResponse before
  nullptr in Abort and RetryNetwork paths (memory leak).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Only delete queued tasks after successful join (not after detach,
  where the thread may still access them — undefined behavior)
- Replace catch(...) with std::system_error and std::exception handlers
  that log error code and message
- Log pending queue sizes in both join and detach paths

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Both variables are read and written from different threads during
normal upload scheduling. Declare as std::atomic to eliminate data
races per the C++ memory model. Add .load() for variadic LOG_TRACE
calls. Add comment explaining why unlocked stores in uploadAsync
are safe.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove LOG_TRACE from Logger destructor — it triggers a crash on iOS
simulator when the recursive_mutex used by logging has already been
destroyed during static destruction.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@bmehta001 bmehta001 force-pushed the bhamehta/runtime-fixes branch from 3f0289c to de46cb2 Compare April 29, 2026 21:42
Reject new worker-thread tasks once shutdown starts so queue cleanup
cannot race with late producers, and move the TPM scheduled-upload
state back under a single mutex so latency/next-upload decisions stay
consistent without mixed atomic and mutex access.

Files changed:
- lib/pal/WorkerThread.cpp
- lib/tpm/TransmissionPolicyManager.cpp
- lib/tpm/TransmissionPolicyManager.hpp

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR targets runtime correctness in the SDK by addressing thread-safety issues, shutdown safety, and a per-request memory leak in HTTP response handling.

Changes:

  • Tighten HTTP request cancellation scoping and fix torn reads in request/callback tracking loops.
  • Fix a SimpleHttpResponse leak on aborted/network-failure decode paths.
  • Rework worker-thread shutdown behavior to avoid unsafe queue cleanup after detach() and improve error logging.
  • Refactor TransmissionPolicyManager scheduling state synchronization (mutex + new cancelUploadTaskLocked() helper).

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
lib/tpm/TransmissionPolicyManager.hpp Changes upload-scheduling state fields and adds a locked cancellation helper declaration.
lib/tpm/TransmissionPolicyManager.cpp Moves upload-scheduling state access under a mutex and adjusts cancellation/scheduling flow.
lib/pal/WorkerThread.cpp Makes shutdown/join behavior safer and improves exception handling/logging during join/detach.
lib/http/HttpResponseDecoder.cpp Deletes ctx->httpResponse on Abort/RetryNetwork paths to prevent leaks.
lib/http/HttpClient_Apple.mm Limits cancellation to the instance’s task and fixes a torn read in the shutdown wait loop.
lib/http/HttpClientManager.cpp Fixes a torn read in the shutdown wait loop by locking around empty-check.
lib/api/Logger.cpp Removes destructor logging to avoid iOS static-destruction-order crash.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread lib/tpm/TransmissionPolicyManager.cpp Outdated
Comment thread lib/tpm/TransmissionPolicyManager.hpp
Comment thread lib/tpm/TransmissionPolicyManager.cpp Outdated
Keep the scheduled-upload state mutex-based, but stop holding
m_scheduledUploadMutex across DeferredCallbackHandle::Cancel so
shutdown and pause paths do not block uploadAsync behind the same
lock. While touching the path, use std::chrono::milliseconds for the
bandwidth-controller reschedule call so ENABLE_BW_CONTROLLER builds
cleanly.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
bmehta001 added a commit to bmehta001/cpp_client_telemetry that referenced this pull request May 1, 2026
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@bmehta001 bmehta001 self-assigned this May 3, 2026
@bmehta001 bmehta001 requested a review from Copilot May 3, 2026 05:46

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread lib/http/HttpClientManager.cpp Outdated
Comment thread lib/pal/WorkerThread.cpp Outdated
Comment thread lib/pal/WorkerThread.cpp
Comment thread lib/tpm/TransmissionPolicyManager.cpp Outdated
Comment thread lib/tpm/TransmissionPolicyManager.cpp Outdated

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread lib/tpm/TransmissionPolicyManager.cpp Outdated
Comment thread lib/http/HttpResponseDecoder.cpp Outdated
Comment thread lib/http/HttpResponseDecoder.cpp Outdated
bmehta001 and others added 6 commits May 4, 2026 07:26
Keep forced upload scheduling atomic around no-wait cancellation and preserve HTTP responses until downstream abort/network-failure handlers finish.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When scheduleUpload is called with force=true (or zero delay) and the
previously scheduled upload task is currently executing on the worker,
the no-wait cancel returns false and m_isUploadScheduled stays set. The
existing m_isUploadScheduled check then skipped scheduling a new task,
silently dropping the requested latency for force-scheduled profile
changes.

Propagate the requested latency to m_runningLatency under the same
mutex when this race occurs. uploadAsync re-reads m_runningLatency
inside its own LOCKGUARD, so a task that hasn't yet entered that
critical section will pick up the new latency. If uploadAsync has
already cleared m_isUploadScheduled (past its LOCKGUARD), the existing
fallthrough at line 184 schedules a fresh task with the new latency.

Add a regression test using a fake dispatcher whose Cancel always
returns false, asserting that a force-scheduled call updates
m_runningLatency without enqueueing a duplicate task.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Use the existing LOCKGUARD helper because scheduled upload cancellation does not need movable lock ownership.

Consolidate the duplicated Issue 388 cancellation note so the PR keeps the remaining limitation documented without repeating the same TODO.

Files changed:

- lib/tpm/TransmissionPolicyManager.cpp

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace tight current-time assertions with a direct comparison against the original delayed schedule time.

This keeps coverage for the forced immediate upload race while reducing timing sensitivity in CI.

Files changed:

- tests/unittests/TransmissionPolicyManagerTests.cpp

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Restore the existing Issue 388 wording in the remaining cancellation comment while keeping the duplicated helper comment removed.

Files changed:

- lib/tpm/TransmissionPolicyManager.cpp

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@bmehta001 bmehta001 requested a review from Copilot May 11, 2026 16:57

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

Comment thread lib/tpm/TransmissionPolicyManager.cpp Outdated
Comment thread lib/pal/WorkerThread.cpp Outdated
Comment thread tests/unittests/TransmissionPolicyManagerTests.cpp
Fix printf-style logging arguments for scheduled upload delays and queued worker task pointers.

Ensure the blocking cancel test releases the dispatcher before failing so async futures cannot hang the test runner.

Files changed:

- lib/pal/WorkerThread.cpp

- lib/tpm/TransmissionPolicyManager.cpp

- tests/unittests/TransmissionPolicyManagerTests.cpp

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@bmehta001 bmehta001 force-pushed the bhamehta/runtime-fixes branch from 4178021 to d357260 Compare May 12, 2026 23:33
@bmehta001

bmehta001 commented May 20, 2026

Copy link
Copy Markdown
Contributor Author

Added a unit test for the issue from #1437 we are resolving with 100% CPU spin so that this PR can be verified before/after (it's somewhat contrived but does the job):

 ./out/tests/unittests/UnitTests \
   --gtest_filter=HttpClientManagerSpinReproTests.CancelAllRequestsDoesNotSpinWhileResponseCallbackHoldsCallbackLock

It uses the real HttpClientManager response path. The test blocks a downstream requestDone sink while HttpClientManager::onHttpResponse() is holding m_httpCallbacksMtx and before it removes the callback from m_httpCallbacks.

Before this PR, cancelAllRequests() ignored that mutex and yield-spun on m_httpCallbacks.empty() while the callback list is still non-empty. With this PR, cancelAllRequests() checks the callback list under the same mutex, so it blocks on the mutex instead of burning CPU.

Local validation on macOS arm64:

  • bhamehta/repro-1437-spin-main-20260520: repro fails as expected; the cancel thread consumed ~240ms CPU during a 250ms blocked response window.
  • bhamehta/repro-1437-spin-runtime-fixes-20260520: same repro passes.

This is a deterministic unit-level reproduction of the spin mechanism, not an end-to-end customer scenario, but it exercises the actual HttpClientManager::onHttpResponse() and cancelAllRequests() interaction fixed by this PR.

bmehta001 added a commit to bmehta001/cpp_client_telemetry that referenced this pull request May 22, 2026
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@bmehta001 bmehta001 requested a review from Copilot May 28, 2026 17:23

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated no new comments.

bmehta001 and others added 3 commits June 3, 2026 18:50
# Conflicts:
#	lib/http/HttpClientManager.cpp
#	lib/http/HttpClient_Apple.mm
Addresses review feedback on microsoft#1429: scheduleTask() returned a
DeferredCallbackHandle holding the task pointer even when WorkerThread::Queue()
had already deleted the task during shutdown. Cancel() only pointer-compares
today, but the stale pointer is fragile -- a reused heap address could make
Cancel() match and cancel the wrong task (ABA), any future deref would be a
use-after-free, and the caller got no signal that scheduling was dropped.

- ITaskDispatcher: add a non-pure virtual QueueWithResult(Task*) reporting
  whether the task was accepted. The default delegates to Queue() and returns
  true, so existing/third-party dispatchers are unaffected (no signature change).
- WorkerThread: implement QueueWithResult (returns false on the shutdown-drop
  path); Queue() now delegates to it.
- scheduleTask(): return an empty DeferredCallbackHandle when the task was not
  queued, so no dangling pointer is retained and Cancel() is a safe no-op.
- Test ScheduleTaskReturnsNoOpHandleWhenTaskDropped verifies the handle is a
  no-op and the dispatcher's Cancel() is never invoked with a freed pointer.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

Comment thread lib/tpm/TransmissionPolicyManager.cpp Outdated
Comment thread tests/unittests/TaskDispatcherCAPITests.cpp Outdated
…ame)

- TransmissionPolicyManager.cpp:166: the WAIT LOG_TRACE used %llu with a
  uint64_t 'delta'. On LP64 uint64_t is unsigned long, which mismatches %llu's
  unsigned long long in varargs (technically UB). Cast delta to
  unsigned long long. (m_runningLatency is an EventLatency enum -> promotes to
  int, so %d is correct.)
- TaskDispatcherCAPITests.cpp: the new ScheduleTaskReturnsNoOpHandleWhenTaskDropped
  test used the TaskDispatcherTests suite; rename to the file's existing
  TaskDispatcherCAPITests suite for consistency/discoverability.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated no new comments.

bmehta001 and others added 2 commits June 10, 2026 18:30
…e ABI)

Code review found that inserting the new virtual QueueWithResult between Queue
and Cancel shifted Cancel's vtable slot. ITaskDispatcher is a public,
client-implementable extension point (LogManagerProvider/config accept a custom
std::shared_ptr<ITaskDispatcher>), so a client compiled against the old header
but linked to a newer SDK binary (or vice versa) would dispatch Cancel through
the wrong slot -> undefined behavior.

Move QueueWithResult to the end of the interface (after Cancel) so the
pre-existing virtuals Join/Queue/Cancel keep their slots and only the brand-new
method (which old callers never invoke) adds a slot. No behavior change; the
default still delegates to Queue().

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.

Comment thread lib/include/public/ITaskDispatcher.hpp Outdated
…ty in vtable comment

Reword the QueueWithResult doc comment: adding a virtual still grows the vtable
and the SDK gives no general C++ ABI guarantee, so the comment no longer claims
"binary compatibility". It now states the narrower, accurate property -- placing
the new method after Cancel keeps the existing virtuals' slot indices stable so
old call sites are not dispatched through the wrong slot.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.

Comment thread lib/pal/WorkerThread.cpp

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated no new comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

100% CPU spin on CancelAllRequests

4 participants