feat: per-layer package attribution (opt-in) by ashokn1 · Pull Request #793 · snyk/snyk-docker-plugin

ashokn1 · 2026-04-18T22:32:55Z

Summary

Adds the ability to attribute each OS package to the specific image layer that first introduced it, following the same approach as Trivy's `Layer { DiffID, Digest }`. Attribution is opt-in via a new `layer-attribution` plugin option so there is no performance impact on existing callers.

How layer attribution is computed

Docker images are built from an ordered stack of layers. Each layer is a filesystem delta produced by one Dockerfile instruction. When a package manager installs or removes packages, it rewrites its database in full (e.g. `/lib/apk/db/installed`, `/var/lib/dpkg/status`). This property makes diff-based attribution possible: if you parse the package DB from each layer in isolation and compare successive snapshots, you can pinpoint exactly which layer introduced (or removed) each package.

Algorithm (`lib/analyzer/layer-attribution.ts`)

History alignment. The image config's `history` array contains one entry per Dockerfile instruction, some marked `empty_layer: true` (metadata instructions like `ENV`, `LABEL`, `EXPOSE` that produce no filesystem delta). These are filtered out to produce an aligned array where index `i` maps to `rootFsLayers[i]` and its instruction text.
Per-layer parse. For each layer in order, the package DB is read from that layer's file map alone — not the merged view used for the normal scan. Two cases are distinguished:
- DB file absent in the layer (e.g. a `COPY` or `WORKDIR` instruction): `parseLayerPackages` returns `null` and the layer is skipped entirely. `previousPkgs` is left unchanged.
- DB file present but empty (e.g. `apk del $(apk info)`): returns an empty `Set`. This is treated as "all packages removed" and the layer is recorded.
Set diff. Each DB-writing layer's package set is diffed against the previous one:
- Keys in `currentPkgs` not in `previousPkgs` → added (`packages[]`)
- Keys in `previousPkgs` not in `currentPkgs` → removed (`removedPackages[]`)
A `LayerAttributionEntry` is emitted for any layer with at least one addition or removal. The `pkgLayerMap` records the layer where each `name@version` key first appeared.
Multi-manager support. `computeLayerAttribution` is called once per unique `AnalysisType` (APK, APT, RPM, Chisel). Results are cached by type so duplicate entries — APT regular + APT distroless, RPM BDB + RPM SQLite — share one parse pass and reuse the cached `pkgLayerMap`. Entries from all managers are merged per-layer by `mergeLayerAttributionEntries`.
Package annotation. Each `AnalyzedPackage` is stamped with `layerIndex` and `layerDiffId` by looking up its key in `pkgLayerMap`. These propagate to dep-graph node labels via `lib/dependency-tree/index.ts`.
Fact emission. `lib/response-builder.ts` assembles the entries into a `layerPackageAttribution` fact on the OS scan result.

Output

New fact (`layerPackageAttribution`):
```json
{
"type": "layerPackageAttribution",
"data": [
{
"layerIndex": 0,
"diffID": "sha256:abc...",
"instruction": "FROM ubuntu:22.04",
"packages": ["libc6@2.35-0ubuntu3", "curl@7.81.0"]
},
{
"layerIndex": 2,
"diffID": "sha256:ghi...",
"digest": "sha256:def...",
"instruction": "RUN apt-get install -y nginx",
"packages": ["nginx@1.18.0"],
"removedPackages": ["curl@7.81.0"]
}
]
}
```

New dep-graph node labels (additive alongside existing `dockerLayerId`):
```json
"labels": {
"dockerLayerId": "UnVOIGFwdC1nZXQ...",
"layerDiffId": "sha256:ghi...",
"layerIndex": "2"
}
```

Edge cases

Scenario	Behaviour
Layer doesn't touch the package DB	Skipped; `previousPkgs` unchanged for next diff
DB file present but empty (`apk del` all packages)	Recorded as `removedPackages`; package set reset to empty
Package deleted then reinstalled (different version)	Deletion layer records `removedPackages`; reinstall layer records new version in `packages`
`rootFsLayers` shorter than `orderedLayers`	Loop capped at `Math.min(...)`
Scratch image / no package DB	Returns empty entries; fact omitted
No history / empty history	Instructions omitted from entries; diff still runs
Multiple managers with same `AnalysisType`	Single parse pass, cached `pkgLayerMap` reused

Changes

File	Change
`lib/extractor/types.ts`	Add `orderedLayers?: ExtractedLayers[]` (optional) to `ExtractionResult`
`lib/extractor/index.ts`	Populate `orderedLayers` only when `layer-attribution` is enabled (avoids holding all per-layer buffers unconditionally)
`lib/facts.ts`	Add `LayerAttributionEntry`, `LayerPackageAttributionFact`
`lib/types.ts`	Add `"layer-attribution"` to `PluginOptions`; `"layerPackageAttribution"` to `FactType`
`lib/analyzer/types.ts`	Add `layerIndex?`, `layerDiffId?` to `AnalyzedPackage`; `layerPackageAttribution?` to `StaticAnalysis`
`lib/analyzer/layer-attribution.ts`	New — `computeLayerAttribution()`, `mergeLayerAttributionEntries()`
`lib/analyzer/static-analyzer.ts`	Call attribution (gated on option); cache results by `AnalysisType`; annotate packages
`lib/dependency-tree/index.ts`	Propagate `layerDiffId`/`layerIndex` to `DepTreeDep.labels`
`lib/response-builder.ts`	Emit `LayerPackageAttributionFact`
`test/lib/analyzer/layer-attribution.spec.ts`	New — 19 unit tests (APK + APT, including deletion/reinstall/empty-DB scenarios)
`test/harness/run.ts`	New — CLI harness wrapping `scan()` for manual testing
`test/system/docker.spec.ts`	Fix stale assertions: remove fragile sha256 round-trip comparison; fix invalid-image-name test to exercise the 404 path
`test/system/plugin.spec.ts`	Update nginx:1.19.0 manifest layer digests (re-published on Docker Hub with different compression)

Test plan

`npx jest test/lib/analyzer/layer-attribution.spec.ts` — all 19 unit tests pass
`npm run test:system` — all archive-based system tests pass; remaining failures are unauthenticated Docker Hub rate limits or require a live Docker daemon (pre-existing, unrelated to this PR)
Manual verification against real images via `npx ts-node test/harness/run.ts --layer-attribution`

🤖 Generated with Claude Code

Introduces `computeLayerAttribution` in `lib/analyzer/layer-attribution.ts` and wires it through the full pipeline. Enabled with `--layer-attribution`. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Memory: make orderedLayers optional in ExtractionResult; only populate it when layer-attribution option is enabled, avoiding holding all per-layer file buffers unconditionally - Performance: cache computeLayerAttribution results by AnalysisType so duplicate manager types (APT regular + distroless, RPM BDB + SQLite) share a single expensive layer-parsing pass - Clarity: add JSDoc to buildHistoryInstructions explaining why it differs from getUserInstructionLayersFromConfig (all-layers vs user-layers) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- docker.spec.ts: remove fragile sha256 checksum comparison in the hello-world round-trip test; Docker's tar format varies across versions so the normalised checksums no longer match the fixture. Existence of the output file is still verified. - docker.spec.ts: change 'someImage' (uppercase → HTTP 400) to a valid lowercase name so the "image doesn't exist" test exercises the intended 404 code path ("not found") rather than a name-validation error. - plugin.spec.ts: update nginx:1.19.0 manifest layer digests; the compressed layer blobs were re-published on Docker Hub with different compression, changing the manifest digests while the image config (and therefore imageId) remained the same. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…key generation - Thread osRelease through computeLayerAttribution → parseLayerPackages so aptAnalyze uses the same normalization as the main analysis path, preventing pkgLayerMap lookup misses on distros where osRelease affects package version strings (e.g. Ubuntu epoch stripping) - Pass redHatRepositories through the same chain so rpmAnalyze and mapRpmSqlitePackages receive the same repository list as the main path - Fix RPM SQLite branch: SQLite packages now go through mapRpmSqlitePackages (sync helper matching the main path) instead of being merged into the rpmAnalyze call; BDB+NDB and SQLite results are combined in a single Set - Update computeLayerAttribution call in static-analyzer.ts to supply the already-computed osRelease and redHatRepositories - Update unit tests to pass the new required parameters Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- [sev 8] static-analyzer: document cache assumption — analyzers sharing an AnalysisType parse the same DB format, so a single pass covers all of them; add comment explaining why allEntries.push is cache-miss-only - [sev 7] layer-attribution: buildHistoryInstructions now returns Array<string | undefined> using `?.trim() || undefined` so empty and whitespace-only created_by values are treated as absent; tighten `if (instruction)` to `if (instruction !== undefined)` to make the intent explicit - [sev 6] layer-attribution: add comment to RPM branch clarifying that BDB/NDB and SQLite paths are independent and intentionally use separate analyzers to match the main analysis path - [sev 6] dependency-tree: extract buildLayerLabels() helper used by both the tooFrequentDeps path and buildTreeRecursive, eliminating the duplicate inline label-building blocks and the inconsistent freqLabels variable name - [sev 5] static-analyzer: set attributionCache to an empty Map on error so subsequent results of the same AnalysisType skip recomputation instead of triggering O(n) retry attempts for a broken type - [sev 4] facts: add JSDoc to LayerAttributionEntry.packages and removedPackages documenting the "name@version" key format - [sev 3] harness: remove startsWith("--") guard from next() so option values that begin with "--" (e.g. passwords) are accepted Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

SteveShani · 2026-04-23T05:42:03Z

@ashokn1 I think there's a critical gap. App vulnerabilities are missing.

ashokn1 · 2026-04-25T17:02:38Z

@ashokn1 I think there's a critical gap. App vulnerabilities are missing.

@SteveShani are you saying that the layers don't include app vulnerabilities, or that the basic app vulnerability scan is broken? The latter should be fine.

…nifest digests imageLayers was populated from manifestLayers (compressed digest in the OCI manifest), which varies by Docker storage backend — local Docker 29.1.3 caches compressed layers and embeds compressed digests, while CI's newer containerd stores uncompressed layers and embeds DiffIDs. This caused all system-test snapshot mismatches in CI. Switch to rootFsLayers (rootfs.diff_ids from image config), which are stable uncompressed content hashes regardless of Docker version or storage format. Fall back to manifestLayers only when rootFsLayers is unavailable. Also update all system-test snapshots to DiffID values, update the hard-coded hashes in plugin.spec.ts (nginx:1.19.0), and fix image-layers.spec.ts which was documenting this inconsistency as a known bug — it now asserts equality. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

snyk-pr-review-bot · 2026-04-25T18:42:07Z

PR Reviewer Guide 🔍

🧪 PR contains tests
🔒 No security concerns identified
⚡ Recommended focus areas for review Incorrect Cache Key 🟠 [major] The `attributionCache` uses `AnalysisType` as its unique key. However, for Debian-based Distroless images, the system runs both `aptAnalyze` (reading `/var/lib/dpkg/status`) and `aptDistrolessAnalyze` (reading `/var/lib/dpkg/status.d/`). Both return `AnalysisType.Apt`. Since `aptAnalyze` runs first, the cache will be populated with the standard status file results. When `aptDistrolessAnalyze` runs, it will trigger a cache hit and return the standard attribution, failing to attribute any packages found in the `.status.d/` directory (which the standard status file does not contain). if (attributionCache.has(result.AnalyzeType)) { // Cache hit: all analyzers that share an AnalysisType (e.g. aptAnalyze // and aptDistrolessAnalyze both return AnalysisType.Apt) parse the same // underlying package DB format, so a single parse pass covers all of them. // Entries were already pushed on the first pass; we only need the pkgLayerMap. pkgLayerMap = attributionCache.get(result.AnalyzeType)!; High Memory Pressure* 🟠 [major] When `layer-attribution` is enabled, `extractImageContent` returns the `orderedLayers` array, which contains the unmerged file content of every layer. For large container images (e.g., ML images or heavy Windows images) with many layers, this holds multiple copies of large package databases or configuration files in memory simultaneously. Unlike the primary analysis which merges layers into a single `extractedLayers` object to save space, this opt-in path may trigger Heap Out-Of-Memory (OOM) errors on memory-constrained environments like CI runners. orderedLayers: isTrue(options?.["layer-attribution"]) ? archiveContent.layers : undefined,
📚 Repository Context Analyzed This review considered 81 relevant code sections from 11 files (average relevance: 0.94) lib/analyzer/static-analyzer.ts (lines 301-400) lib/response-builder.ts (2 segments, lines 301-400) lib/dockerfile/instruction-parser.ts (2 segments, lines 1-100) lib/scan.ts (3 segments, lines 201-287) lib/dockerfile/index.ts (lines 1-61) lib/index.ts (lines 1-49) lib/static.ts (lines 1-78) ... and 4 more file(s)

SteveShani · 2026-04-26T07:57:07Z

@ashokn1 I think there's a critical gap. App vulnerabilities are missing.

@SteveShani are you saying that the layers don't include app vulnerabilities, or that the basic app vulnerability scan is broken? The latter should be fine.

The gap I meant is about this PR’s feature: we only add per-layer package attribution for OS (image) packages. We don’t yet attribute application deps (e.g. npm) to an image layer, so from a "which Dockerfile layer / diff id brought this in?" perspective, app vulns don’t get that extra signal yet.

mtstanley-snyk · 2026-04-28T20:14:42Z

@ashokn1 I think there's a critical gap. App vulnerabilities are missing.

@SteveShani are you saying that the layers don't include app vulnerabilities, or that the basic app vulnerability scan is broken? The latter should be fine.

The gap I meant is about this PR’s feature: we only add per-layer package attribution for OS (image) packages. We don’t yet attribute application deps (e.g. npm) to an image layer, so from a "which Dockerfile layer / diff id brought this in?" perspective, app vulns don’t get that extra signal yet.

@SteveShani shouldn't we be able to extend the fundamental concept used for OS packages used in this PR to application SCA? There's definitely more complexity involved though 🤔 , since for example when scanning node modules we don't parse a single file to determine the dep graph we traverse all the directories and reconstruct a dep graph from the nested package json files. We could potentially merge the previous layers and recompute the dep graph each time we find a file change within the node modules directory or something similar.

ashokn1 · 2026-04-29T09:57:28Z

@mtstanley-snyk this is possible, I believe. I was thinking we would do it as a second add-on after this feature.

ashokn1 requested review from a team as code owners April 18, 2026 22:32

ashokn1 requested a review from mtstanley-snyk April 18, 2026 22:32