gh-149584: Fix excessive overhead in the Tachyon profiler regarding the cache behavior by pablogsal · Pull Request #149649 · python/cpython

pablogsal · 2026-05-10T18:29:54Z

Some ideas after the discussion in the issue with @maurycy. The profiler was spending too much time on repeated remote-memory bookkeeping, full remote page reads for small fixed-size structs, repeated remote writes of unchanged frame-cache state, and Python object allocation churn on steady-state frame-cache hits.

This PR improves the profiler by:

reading hot interpreter/thread/frame structs with exact remote reads instead of page-cache reads
tracking the number of live page-cache entries so cache clear/search only scans the used prefix
batching predicted interpreter/thread/frame reads with process_vm_readv() on Linux
reusing cached FrameInfo and thread id objects when frame-cache hits dominate

The last_profiled_frame remote-write suppression is already present in current upstream/main, so this branch keeps that baseline behavior and builds on top of it.

Benchmark

Benchmarked with:

./python Tools/inspection/benchmark_external_inspection.py --duration 3

For the per-commit measurements, I used the same benchmark workload in quiet mode with cache_frames=True and all_threads=True.

Step	Work rate	Sample rate	Avg call time	Incremental win
`upstream/main` baseline	382,593/s	363,963/s	2.614 us	-
Exact reads for hot structs	549,741/s	514,642/s	1.819 us	+167,148/s (+43.7%)
Track live page-cache entries	747,946/s	688,282/s	1.337 us	+198,205/s (+36.1%)
Batch predicted remote reads	897,703/s	806,084/s	1.114 us	+149,756/s (+20.0%)
Reuse profiler result objects	1,187,314/s	1,047,698/s	0.842 us	+289,611/s (+32.3%)

Final benchmark using the benchmark script:

Metric	Baseline	Final	Change
Work rate	382,593/s	1,180,434/s	3.08x faster
Sample rate	363,963/s	1,038,965/s	2.85x faster
Avg call time	2.614 us	0.85 us	-67.5%
Success rate	99.67%	99.84%	+0.17 pp

Final benchmark output:

Average call time:   0.85 us
Work rate:           1,180,434.2 calls/sec
Sample rate:         1,038,964.9 samples/sec
Success rate:        99.84%

Issue: _remote_debugging: reading whole pages over and over #149584

Use exact remote reads for interpreter state, thread state, and interpreter frame structs instead of pulling full remote pages into the profiler page cache. This matches the core change from python#149585.

The profiler clears the page cache between samples, so live entries are always packed at the front. Track the live count and only clear/search that prefix instead of scanning all 1024 slots on the hot path.

pablogsal · 2026-05-10T18:34:01Z

@maurycy do you mind reviewing this PR?

Use the frame cache to predict the next thread state and top frame address, then batch interpreter/thread/frame reads with process_vm_readv when profiling a Linux target. Reuse prefetched frame buffers in the frame walker when the prediction is valid.

Cache the last FrameInfo tuple per code object/instruction offset, reuse cached thread id objects, and append cached parent frames directly on full frame-cache hits. This cuts Python allocation churn in the steady-state profiler path.

bedevere-app Bot added the awaiting core review label May 10, 2026

bedevere-app Bot mentioned this pull request May 10, 2026

_remote_debugging: reading whole pages over and over #149584

Open

pablogsal added 2 commits May 10, 2026 19:31

pythongh-149584: Avoid page reads for hot profiler structs

c5520d3

Use exact remote reads for interpreter state, thread state, and interpreter frame structs instead of pulling full remote pages into the profiler page cache. This matches the core change from python#149585.

pythongh-149584: Track live remote page cache entries

8be8d7d

The profiler clears the page cache between samples, so live entries are always packed at the front. Track the live count and only clear/search that prefix instead of scanning all 1024 slots on the hot path.

pablogsal force-pushed the speedups branch from fa43c34 to 7899a6a Compare May 10, 2026 18:31

pablogsal mentioned this pull request May 10, 2026

gh-149584: Do not use page cache for thread/frame/interp reads #149585

Closed

pablogsal force-pushed the speedups branch from 7899a6a to d5d3f3b Compare May 10, 2026 18:35

pablogsal added 3 commits May 10, 2026 20:22

pythongh-149584: Reuse profiler result objects

c69a0f3

Cache the last FrameInfo tuple per code object/instruction offset, reuse cached thread id objects, and append cached parent frames directly on full frame-cache hits. This cuts Python allocation churn in the steady-state profiler path.

pythongh-149584: Add NEWS for Tachyon profiler overhead fix

7a85c9a

pablogsal force-pushed the speedups branch from d5d3f3b to 7a85c9a Compare May 10, 2026 19:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

gh-149584: Fix excessive overhead in the Tachyon profiler regarding the cache behavior#149649

gh-149584: Fix excessive overhead in the Tachyon profiler regarding the cache behavior#149649
pablogsal wants to merge 5 commits intopython:mainfrom
pablogsal:speedups

pablogsal commented May 10, 2026 •

edited

Loading

Uh oh!

pablogsal commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

pablogsal commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark

Uh oh!

pablogsal commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pablogsal commented May 10, 2026 •

edited

Loading