Skip to content

fix(code): unwedge stuck cloud message queues#2088

Draft
VojtechBartos wants to merge 1 commit intomainfrom
posthog-code/fix-cloud-queue-wedge
Draft

fix(code): unwedge stuck cloud message queues#2088
VojtechBartos wants to merge 1 commit intomainfrom
posthog-code/fix-cloud-queue-wedge

Conversation

@VojtechBartos
Copy link
Copy Markdown
Member

Problem

Users on cloud runs report messages getting permanently stuck "queued", with no automatic recovery. Reproduced from a real session log: SSE streams dropped repeatedly with error: 'terminated', the cloud-task watcher exhausted its 5-attempt reconnect budget, and every subsequent follow-up just sat in the queue. Bug regressed in #1905 (Gate B was added without an SSE-reconnect kick) and #2060 only patched the success path.

What's going on

SessionService.sendCloudPrompt has three queue gates. Two of them could strand a queue forever:

  • Gate B (status !== "connected") — queued the message but never tried to bring the SSE stream back, so no turn_complete ever arrived to trigger a drain. Fix: when status is disconnected or error, fire-and-forget retryCloudTaskWatch(taskId) (already used by the manual "Retry" button). This is the change that actually unwedges the user-reported scenario.
  • Gate A (cloudStatus !== "in_progress") — set isPromptPending: true so the boot-time UI could show a spinner, but the flag was only cleared by turn_complete. A missed turn_complete left the flag stuck and sendQueuedCloudMessages's own !isPromptPending guard then blocked the drain. Fix: drop the eager isPromptPending: true write. The flag now means only "an actual prompt is in flight."
  • handleCloudTaskUpdate status branch — explicitly skipped auto-flush on cloudStatus → in_progress to avoid racing the agent's initial clientConnection.prompt(). That race only exists before run_started flips status to "connected". Fix: if a status update with in_progress arrives and session.status === "connected" and the queue is non-empty, schedule a drain. sendQueuedCloudMessages still bails on isPromptPending, preserving the original race protection.

Tests

5 new vitest cases covering each path; full code-app suite (1150 tests) passes.


Created with PostHog Code

Cloud follow-ups got permanently stuck in the local queue when the SSE
watcher exhausted its reconnect budget — Gate B in sendCloudPrompt
queued the message but never restored the SSE stream, so no
turn_complete ever arrived to drain. Two adjacent holes in Gate A and
the cloudStatus handler could also strand a queue on a missed
turn_complete. Each fix is a few lines, all in SessionService.

Generated-By: PostHog Code
Task-Id: 3de9f10b-b668-45c9-8688-eb94b3260be5
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant