Skip to content

gh-149584: Fix excessive overhead in the Tachyon profiler regarding the cache behavior#149649

Open
pablogsal wants to merge 5 commits intopython:mainfrom
pablogsal:speedups
Open

gh-149584: Fix excessive overhead in the Tachyon profiler regarding the cache behavior#149649
pablogsal wants to merge 5 commits intopython:mainfrom
pablogsal:speedups

Conversation

@pablogsal
Copy link
Copy Markdown
Member

@pablogsal pablogsal commented May 10, 2026

Some ideas after the discussion in the issue with @maurycy. The profiler was spending too much time on repeated remote-memory bookkeeping, full remote page reads for small fixed-size structs, repeated remote writes of unchanged frame-cache state, and Python object allocation churn on steady-state frame-cache hits.

This PR improves the profiler by:

  • reading hot interpreter/thread/frame structs with exact remote reads instead of page-cache reads
  • tracking the number of live page-cache entries so cache clear/search only scans the used prefix
  • batching predicted interpreter/thread/frame reads with process_vm_readv() on Linux
  • reusing cached FrameInfo and thread id objects when frame-cache hits dominate

The last_profiled_frame remote-write suppression is already present in current upstream/main, so this branch keeps that baseline behavior and builds on top of it.

Benchmark

Benchmarked with:

./python Tools/inspection/benchmark_external_inspection.py --duration 3

For the per-commit measurements, I used the same benchmark workload in quiet mode with cache_frames=True and all_threads=True.

Step Work rate Sample rate Avg call time Incremental win
upstream/main baseline 382,593/s 363,963/s 2.614 us -
Exact reads for hot structs 549,741/s 514,642/s 1.819 us +167,148/s (+43.7%)
Track live page-cache entries 747,946/s 688,282/s 1.337 us +198,205/s (+36.1%)
Batch predicted remote reads 897,703/s 806,084/s 1.114 us +149,756/s (+20.0%)
Reuse profiler result objects 1,187,314/s 1,047,698/s 0.842 us +289,611/s (+32.3%)

Final benchmark using the benchmark script:

Metric Baseline Final Change
Work rate 382,593/s 1,180,434/s 3.08x faster
Sample rate 363,963/s 1,038,965/s 2.85x faster
Avg call time 2.614 us 0.85 us -67.5%
Success rate 99.67% 99.84% +0.17 pp

Final benchmark output:

Average call time:   0.85 us
Work rate:           1,180,434.2 calls/sec
Sample rate:         1,038,964.9 samples/sec
Success rate:        99.84%

pablogsal added 2 commits May 10, 2026 19:31
Use exact remote reads for interpreter state, thread state, and
interpreter frame structs instead of pulling full remote pages into the
profiler page cache. This matches the core change from
python#149585.
The profiler clears the page cache between samples, so live entries are
always packed at the front. Track the live count and only clear/search
that prefix instead of scanning all 1024 slots on the hot path.
@pablogsal
Copy link
Copy Markdown
Member Author

@maurycy do you mind reviewing this PR?

pablogsal added 3 commits May 10, 2026 20:22
Use the frame cache to predict the next thread state and top frame
address, then batch interpreter/thread/frame reads with process_vm_readv
when profiling a Linux target. Reuse prefetched frame buffers in the
frame walker when the prediction is valid.
Cache the last FrameInfo tuple per code object/instruction offset, reuse
cached thread id objects, and append cached parent frames directly on
full frame-cache hits. This cuts Python allocation churn in the
steady-state profiler path.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant