fix(postgresql-proxy): prevent SSL COPY stalls by draining nonblocking reads#11
Conversation
f8e9f34 to
8e50151
Compare
…g reads Co-authored-by: GitHub Copilot <copilot@github.com>
2a9574a to
e7d5df8
Compare
Co-authored-by: GitHub Copilot <copilot@github.com>
e7d5df8 to
1c51600
Compare
pinzon
left a comment
There was a problem hiding this comment.
So basically, the SSL handshake is now done incrementally in the event loop instead of blocking on accept, and once it's complete we keep draining the SSL buffer until it's empty so data doesn't get stranded. Right?
Thank you for also adding a test workflow, now we can test changes here without relying solely on LS. I'll let the final say to @bentsku or @cloutierMat
cloutierMat
left a comment
There was a problem hiding this comment.
This PR does a lot of things that it is hard to review properly. I think the issue is much simpler that this and there is a lot of rewrite of ssl/socket functionality that is most likely handle properly by their respective libraries.
Maybe it is also tackling many other issues I don't know of, but a bit of research revealed this stack overflow post.
My understanding here is that self.listen calls selector.select but as there might still be decrypted data left in the socket, it would be lost and unseen by select. There might be a smarter way to solve but using the sample in the post I could fix the ssl hanging issue by changing the service_connection to handle pending bytes in sock.pending
if recv_data:
LOG.debug('%s received data:\n%s', conn.name, recv_data)
conn.received(recv_data)
# excerpt from https://docs.python.org/3/library/ssl.html#ssl-nonblocking
# Conversely, since the SSL layer has its own framing, a SSL socket may still have data available
# for reading without select() being aware of it. Therefore, you should first call SSLSocket.recv()
# to drain any potentially available data, and then only block on a select() call if still necessary.
while isinstance(sock, ssl.SSLSocket) and sock.pending() > 0:
extra = sock.recv(min(4096, sock.pending()))
if extra:
LOG.debug('%s received pending SSL data:\n%s', conn.name, extra)
conn.received(extra)There might still be something cleaner we can do by calling sock.recv in listen but I think this PR is kinda overboard and makes a lot of changes that would be hard to measure the impact.
I love the addition of a pipeline and cleaner testing, but I would also argue that it would be easier to review and comment, if we would seperate it in multiple PRs
Motivation
Fixes
UNC-470: Postgres proxy could hang forsslmode=requiresessions running largepsql -f/ COPY-heavy workloads.Root cause
service_connection()did not reliably make forward progress for non-blocking SSL traffic under heavy load. Additionally, SSL negotiation in the accept path could block the entire selector loop.Changes (
postgresql_proxy/proxy.py)EVENT_READcycle (instead of one-shot).SSLWantReadErrorSSLWantWriteErrorSSLWantReadErrorduring send (not enable)BlockingIOError(kernel send buffer full) as backpressure, not connection closure._flush_outgoing()and_set_write_interest().Expected impact
Prevents SSL COPY-session stalls, improves forward progress for large dump-like workloads, and eliminates selector blocking during new client acceptance.
Test coverage