fix: terminate infinite retry loop in LoadSkillResourceTool on RESOURCE_NOT_FOUND by Raman369AI · Pull Request #5651 · google/adk-python

Raman369AI · 2026-05-10T04:17:35Z

Summary

Closes a latent defect in SkillToolset that lets the LLM enter an unbounded retry loop when load_skill_resource returns RESOURCE_NOT_FOUND. Because the only existing backstop is RunConfig.max_llm_calls (default 500), a single hallucinated path can quietly burn the entire per-invocation call budget on the same failing tool call before the framework intervenes — and max_llm_calls is a global cap on legitimate reasoning, not a defense against this specific failure mode.

This fix adds invocation-scoped termination to the tool itself, plus a complementary system-prompt rule, so the framework no longer relies on a perfect upstream prompt to avoid unbounded loops on a known error path.

Why this matters

This is a structural problem, not an edge case:

The L2 load_skill response intentionally omits a manifest of available files — that is the agentskills.io progressive-disclosure contract, and it is correct for token economy. But it means the LLM must infer paths from prose instructions inside SKILL.md. Inferred paths are routinely wrong, even with strong models.
RESOURCE_NOT_FOUND is returned as a structured soft string. From the model's perspective it looks transient and recoverable, exactly like a flaky network error — so it retries the same path. There is no terminal signal in the current implementation.
Nothing in SkillToolset distinguishes "first miss" from "500th miss". Every retry is treated identically. The loop terminates only when max_llm_calls (default 500) is hit, by which point the entire budget has been spent on one wrong path.
Confounding scope ambiguity: the default system prompt does not draw a boundary between skill-bundled files (the legitimate target of load_skill_resource) and runtime user inputs (e.g., a PDF the user is processing). A model that has just been asked to analyze a runtime document will sometimes route that document through load_skill_resource, hit RESOURCE_NOT_FOUND, and loop on it — even though the file was never a skill resource to begin with.

The combination — no manifest, soft error, no terminal signal, no scope boundary — means the loop is reachable by ordinary use of the feature, not just adversarial inputs. Any production user with imperfect prompts is exposed.

What changed

Two layers in one file (src/google/adk/tools/skill_toolset.py), plus tests:

1. Code: invocation-scoped retry guard in `LoadSkillResourceTool.run_async`

Failed (skill, path) pairs are tracked on tool_context.state under the key:

temp:_adk_skill_resource_failed_paths_<invocation_id>

The temp: prefix uses ADK's existing convention so the value is trimmed from the persisted event delta and never reaches durable session storage.
The <invocation_id> suffix ensures correctness on in-memory session backends as well, where temp: keys are added to session.state and are not auto-cleared between invocations. Without the suffix, a path that legitimately failed in invocation A would spuriously hit the fatal path on its first attempt in invocation B.

Behavior:

First failure on a given (skill, path) within an invocation → returns RESOURCE_NOT_FOUND (unchanged).
Second failure on the same pair within the same invocation → returns the new RESOURCE_NOT_FOUND_FATAL error code, with an explicit "do not retry — report the error and stop" message in the error string. This gives the LLM an unambiguous terminal signal.

The guard is bounded (one entry per unique missing path), invocation-scoped, and adds no overhead on the success path.

2. Prompt: two additions to `_DEFAULT_SKILL_SYSTEM_INSTRUCTION`

No-retry rule: "If load_skill_resource returns any error, do not retry the same path. Report the error to the user and stop."
Scope boundary added to rule 3: clarifies that load_skill_resource is only for skill-bundled files in references/, assets/, or scripts/ — not for runtime user input. This addresses the confounding failure mode above.

Why both layers

Defense-in-depth. Code-only termination produces confusing downstream behavior when the LLM has no semantic understanding of why the tool stopped responding the way it expects. Prompt-only termination relies on the LLM following the rule, which imperfect upstream prompts can override or contradict. Together they are robust.

Considered and rejected

Alternative	Why not
Tighten or default-lower `max_llm_calls`	Caps the agent's overall reasoning budget; punishes legitimate long-running tasks; does not address the specific defect
Symptomatic `after_tool_callback` workaround	Pushes the fix onto every user of `SkillToolset`; the framework still ships with the loop
Add a full `available_resources` manifest to the L2 `load_skill` response	Defeats the lazy-loading / token-saving design that the Skills spec is built around
Introduce a new `list_skill_resources` tool	Violates the L1→L2→L3 progressive disclosure contract from agentskills.io
Include available paths on the fatal response	Re-introduces the manifest cost; contradicts the "stop" semantic the fatal code is meant to enforce

Test plan

New tests in tests/unittests/tools/test_skill_toolset.py:

test_load_resource_repeated_failure_escalates_to_fatal — second call on the same missing path returns RESOURCE_NOT_FOUND_FATAL with an explicit stop instruction.
test_load_resource_different_paths_each_soft_fail — distinct missing paths each return the soft error (no over-eager cross-path escalation).
test_load_resource_failures_isolated_per_invocation — proves the <invocation_id> scoping: failures from invocation A do not escalate the first attempt in invocation B even when sharing the same session state dict.
test_load_resource_failed_paths_use_temp_prefix — proves the temp: prefix invariant so tracking keys are never persisted to durable storage.

Verification:

All 81 pre-existing tests still pass; 85 total, 0 regressions.
pyink --check clean on both modified files.
isort --check-only clean on both modified files.
mypy src/google/adk/tools/skill_toolset.py reports the same 17 pre-existing errors as main; zero new errors introduced.

Backwards compatibility

The first-failure behavior is unchanged: same RESOURCE_NOT_FOUND error code, same error string. Existing callers and tests that key off this code see no difference.
The new RESOURCE_NOT_FOUND_FATAL code is purely additive.
The new state key uses the temp: prefix and is therefore guaranteed not to leak into persisted session storage.

…CE_NOT_FOUND When load_skill_resource returned RESOURCE_NOT_FOUND, the LLM treated it as a transient soft error and retried the same path indefinitely, producing runaway invocations and large API bills. Two complementary guards are added: 1. Code (LoadSkillResourceTool.run_async): an invocation-scoped retry guard tracks already-attempted (skill, path) pairs in tool_context.state under a temp:_adk_skill_resource_failed_paths_<id> key. The temp: prefix prevents persistence to durable session storage; the invocation_id suffix prevents leakage across invocations on in-memory session backends (where temp keys are not auto-cleared). A second call on the same path within the same invocation returns RESOURCE_NOT_FOUND_FATAL with an explicit stop instruction, giving the LLM an unambiguous terminal signal. 2. Prompt (DEFAULT_SKILL_SYSTEM_INSTRUCTION): adds a rule prohibiting retrying the same path after any error, and a scope boundary clarifying that load_skill_resource is for skill-bundled files only, not for runtime user input (a common source of hallucinated paths). Neither guard alone is sufficient: a code-only stop produces confusing downstream behavior when the LLM has no semantic understanding of why to stop; a prompt-only guard can be ignored under imperfect system prompts. Both layers are required for defense-in-depth. Tests cover: repeat-failure escalation, distinct-path soft errors not escalating, cross-invocation isolation with shared session state, and the temp: prefix on tracking keys.

Raman369AI mentioned this pull request May 10, 2026

[Bug] LoadSkillResourceTool retries RESOURCE_NOT_FOUND indefinitely; default max_llm_calls=500 is the only backstop #5652

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: terminate infinite retry loop in LoadSkillResourceTool on RESOURCE_NOT_FOUND#5651

fix: terminate infinite retry loop in LoadSkillResourceTool on RESOURCE_NOT_FOUND#5651
Raman369AI wants to merge 1 commit intogoogle:mainfrom
Raman369AI:fix/skill-resource-not-found-infinite-loop

Raman369AI commented May 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Raman369AI commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why this matters

What changed

1. Code: invocation-scoped retry guard in LoadSkillResourceTool.run_async

2. Prompt: two additions to _DEFAULT_SKILL_SYSTEM_INSTRUCTION

Why both layers

Considered and rejected

Test plan

Backwards compatibility

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Raman369AI commented May 10, 2026 •

edited

Loading

1. Code: invocation-scoped retry guard in `LoadSkillResourceTool.run_async`

2. Prompt: two additions to `_DEFAULT_SKILL_SYSTEM_INSTRUCTION`