gh-149584: Fix excessive overhead in the Tachyon profiler regarding the cache behavior#149649
Open
pablogsal wants to merge 5 commits intopython:mainfrom
Open
gh-149584: Fix excessive overhead in the Tachyon profiler regarding the cache behavior#149649pablogsal wants to merge 5 commits intopython:mainfrom
pablogsal wants to merge 5 commits intopython:mainfrom
Conversation
Use exact remote reads for interpreter state, thread state, and interpreter frame structs instead of pulling full remote pages into the profiler page cache. This matches the core change from python#149585.
The profiler clears the page cache between samples, so live entries are always packed at the front. Track the live count and only clear/search that prefix instead of scanning all 1024 slots on the hot path.
Member
Author
|
@maurycy do you mind reviewing this PR? |
Use the frame cache to predict the next thread state and top frame address, then batch interpreter/thread/frame reads with process_vm_readv when profiling a Linux target. Reuse prefetched frame buffers in the frame walker when the prediction is valid.
Cache the last FrameInfo tuple per code object/instruction offset, reuse cached thread id objects, and append cached parent frames directly on full frame-cache hits. This cuts Python allocation churn in the steady-state profiler path.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Some ideas after the discussion in the issue with @maurycy. The profiler was spending too much time on repeated remote-memory bookkeeping, full remote page reads for small fixed-size structs, repeated remote writes of unchanged frame-cache state, and Python object allocation churn on steady-state frame-cache hits.
This PR improves the profiler by:
process_vm_readv()on LinuxFrameInfoand thread id objects when frame-cache hits dominateThe
last_profiled_frameremote-write suppression is already present in currentupstream/main, so this branch keeps that baseline behavior and builds on top of it.Benchmark
Benchmarked with:
For the per-commit measurements, I used the same benchmark workload in quiet mode with
cache_frames=Trueandall_threads=True.upstream/mainbaselineFinal benchmark using the benchmark script:
Final benchmark output:
_remote_debugging: reading whole pages over and over #149584