fix: terminate infinite retry loop in LoadSkillResourceTool on RESOURCE_NOT_FOUND#5651
Open
Raman369AI wants to merge 1 commit intogoogle:mainfrom
Open
fix: terminate infinite retry loop in LoadSkillResourceTool on RESOURCE_NOT_FOUND#5651Raman369AI wants to merge 1 commit intogoogle:mainfrom
Raman369AI wants to merge 1 commit intogoogle:mainfrom
Conversation
…CE_NOT_FOUND When load_skill_resource returned RESOURCE_NOT_FOUND, the LLM treated it as a transient soft error and retried the same path indefinitely, producing runaway invocations and large API bills. Two complementary guards are added: 1. Code (LoadSkillResourceTool.run_async): an invocation-scoped retry guard tracks already-attempted (skill, path) pairs in tool_context.state under a temp:_adk_skill_resource_failed_paths_<id> key. The temp: prefix prevents persistence to durable session storage; the invocation_id suffix prevents leakage across invocations on in-memory session backends (where temp keys are not auto-cleared). A second call on the same path within the same invocation returns RESOURCE_NOT_FOUND_FATAL with an explicit stop instruction, giving the LLM an unambiguous terminal signal. 2. Prompt (DEFAULT_SKILL_SYSTEM_INSTRUCTION): adds a rule prohibiting retrying the same path after any error, and a scope boundary clarifying that load_skill_resource is for skill-bundled files only, not for runtime user input (a common source of hallucinated paths). Neither guard alone is sufficient: a code-only stop produces confusing downstream behavior when the LLM has no semantic understanding of why to stop; a prompt-only guard can be ignored under imperfect system prompts. Both layers are required for defense-in-depth. Tests cover: repeat-failure escalation, distinct-path soft errors not escalating, cross-invocation isolation with shared session state, and the temp: prefix on tracking keys.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #5652
Summary
Closes a latent defect in
SkillToolsetthat lets the LLM enter an unbounded retry loop whenload_skill_resourcereturnsRESOURCE_NOT_FOUND. Because the only existing backstop isRunConfig.max_llm_calls(default 500), a single hallucinated path can quietly burn the entire per-invocation call budget on the same failing tool call before the framework intervenes — andmax_llm_callsis a global cap on legitimate reasoning, not a defense against this specific failure mode.This fix adds invocation-scoped termination to the tool itself, plus a complementary system-prompt rule, so the framework no longer relies on a perfect upstream prompt to avoid unbounded loops on a known error path.
Why this matters
This is a structural problem, not an edge case:
load_skillresponse intentionally omits a manifest of available files — that is the agentskills.io progressive-disclosure contract, and it is correct for token economy. But it means the LLM must infer paths from prose instructions insideSKILL.md. Inferred paths are routinely wrong, even with strong models.RESOURCE_NOT_FOUNDis returned as a structured soft string. From the model's perspective it looks transient and recoverable, exactly like a flaky network error — so it retries the same path. There is no terminal signal in the current implementation.SkillToolsetdistinguishes "first miss" from "500th miss". Every retry is treated identically. The loop terminates only whenmax_llm_calls(default 500) is hit, by which point the entire budget has been spent on one wrong path.load_skill_resource) and runtime user inputs (e.g., a PDF the user is processing). A model that has just been asked to analyze a runtime document will sometimes route that document throughload_skill_resource, hitRESOURCE_NOT_FOUND, and loop on it — even though the file was never a skill resource to begin with.The combination — no manifest, soft error, no terminal signal, no scope boundary — means the loop is reachable by ordinary use of the feature, not just adversarial inputs. Any production user with imperfect prompts is exposed.
What changed
Two layers in one file (
src/google/adk/tools/skill_toolset.py), plus tests:1. Code: invocation-scoped retry guard in
LoadSkillResourceTool.run_asyncFailed
(skill, path)pairs are tracked ontool_context.stateunder the key:temp:prefix uses ADK's existing convention so the value is trimmed from the persisted event delta and never reaches durable session storage.<invocation_id>suffix ensures correctness on in-memory session backends as well, wheretemp:keys are added tosession.stateand are not auto-cleared between invocations. Without the suffix, a path that legitimately failed in invocation A would spuriously hit the fatal path on its first attempt in invocation B.Behavior:
(skill, path)within an invocation → returnsRESOURCE_NOT_FOUND(unchanged).RESOURCE_NOT_FOUND_FATALerror code, with an explicit "do not retry — report the error and stop" message in the error string. This gives the LLM an unambiguous terminal signal.The guard is bounded (one entry per unique missing path), invocation-scoped, and adds no overhead on the success path.
2. Prompt: two additions to
_DEFAULT_SKILL_SYSTEM_INSTRUCTIONload_skill_resourcereturns any error, do not retry the same path. Report the error to the user and stop."load_skill_resourceis only for skill-bundled files inreferences/,assets/, orscripts/— not for runtime user input. This addresses the confounding failure mode above.Why both layers
Defense-in-depth. Code-only termination produces confusing downstream behavior when the LLM has no semantic understanding of why the tool stopped responding the way it expects. Prompt-only termination relies on the LLM following the rule, which imperfect upstream prompts can override or contradict. Together they are robust.
Considered and rejected
max_llm_callsafter_tool_callbackworkaroundSkillToolset; the framework still ships with the loopavailable_resourcesmanifest to the L2load_skillresponselist_skill_resourcestoolTest plan
New tests in
tests/unittests/tools/test_skill_toolset.py:test_load_resource_repeated_failure_escalates_to_fatal— second call on the same missing path returnsRESOURCE_NOT_FOUND_FATALwith an explicit stop instruction.test_load_resource_different_paths_each_soft_fail— distinct missing paths each return the soft error (no over-eager cross-path escalation).test_load_resource_failures_isolated_per_invocation— proves the<invocation_id>scoping: failures from invocation A do not escalate the first attempt in invocation B even when sharing the same session state dict.test_load_resource_failed_paths_use_temp_prefix— proves thetemp:prefix invariant so tracking keys are never persisted to durable storage.Verification:
pyink --checkclean on both modified files.isort --check-onlyclean on both modified files.mypy src/google/adk/tools/skill_toolset.pyreports the same 17 pre-existing errors asmain; zero new errors introduced.Backwards compatibility
RESOURCE_NOT_FOUNDerror code, same error string. Existing callers and tests that key off this code see no difference.RESOURCE_NOT_FOUND_FATALcode is purely additive.temp:prefix and is therefore guaranteed not to leak into persisted session storage.