Skip to content

fix(postgresql-proxy): prevent SSL COPY stalls by draining nonblocking reads#11

Open
nik-localstack wants to merge 2 commits intomasterfrom
unc-472-postgresql-proxy-drain-nonblocking-ssl-reads-to-prevent-copy
Open

fix(postgresql-proxy): prevent SSL COPY stalls by draining nonblocking reads#11
nik-localstack wants to merge 2 commits intomasterfrom
unc-472-postgresql-proxy-drain-nonblocking-ssl-reads-to-prevent-copy

Conversation

@nik-localstack
Copy link
Copy Markdown

@nik-localstack nik-localstack commented May 7, 2026

Motivation

Fixes UNC-470: Postgres proxy could hang for sslmode=require sessions running large psql -f / COPY-heavy workloads.

Root cause

service_connection() did not reliably make forward progress for non-blocking SSL traffic under heavy load. Additionally, SSL negotiation in the accept path could block the entire selector loop.

Changes (postgresql_proxy/proxy.py)

  • Drain readable bytes in a loop per EVENT_READ cycle (instead of one-shot).
  • Handle SSL non-blocking states correctly:
    • Break on SSLWantReadError
    • Enable write interest on SSLWantWriteError
    • Disable write interest on SSLWantReadError during send (not enable)
  • Treat BlockingIOError (kernel send buffer full) as backpressure, not connection closure.
  • Move PostgreSQL SSLRequest detection from blocking accept-path to non-blocking selector loop to prevent accept stalls.
  • Explicit outbound flush path with _flush_outgoing() and _set_write_interest().

Expected impact

Prevents SSL COPY-session stalls, improves forward progress for large dump-like workloads, and eliminates selector blocking during new client acceptance.

Test coverage

  • Added basic test coverage and specific test test_psql_ssl_file_batch_stress_no_hang for the scenario of UNC-470
  • Added support for testing in CI

@nik-localstack nik-localstack self-assigned this May 7, 2026
@nik-localstack nik-localstack force-pushed the unc-472-postgresql-proxy-drain-nonblocking-ssl-reads-to-prevent-copy branch 7 times, most recently from f8e9f34 to 8e50151 Compare May 8, 2026 11:49
…g reads

Co-authored-by: GitHub Copilot <copilot@github.com>
@nik-localstack nik-localstack force-pushed the unc-472-postgresql-proxy-drain-nonblocking-ssl-reads-to-prevent-copy branch 3 times, most recently from 2a9574a to e7d5df8 Compare May 8, 2026 11:58
Co-authored-by: GitHub Copilot <copilot@github.com>
@nik-localstack nik-localstack force-pushed the unc-472-postgresql-proxy-drain-nonblocking-ssl-reads-to-prevent-copy branch from e7d5df8 to 1c51600 Compare May 8, 2026 12:00
@nik-localstack nik-localstack marked this pull request as ready for review May 8, 2026 12:15
Copy link
Copy Markdown
Member

@pinzon pinzon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So basically, the SSL handshake is now done incrementally in the event loop instead of blocking on accept, and once it's complete we keep draining the SSL buffer until it's empty so data doesn't get stranded. Right?

Thank you for also adding a test workflow, now we can test changes here without relying solely on LS. I'll let the final say to @bentsku or @cloutierMat

Copy link
Copy Markdown
Member

@cloutierMat cloutierMat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR does a lot of things that it is hard to review properly. I think the issue is much simpler that this and there is a lot of rewrite of ssl/socket functionality that is most likely handle properly by their respective libraries.

Maybe it is also tackling many other issues I don't know of, but a bit of research revealed this stack overflow post.

My understanding here is that self.listen calls selector.select but as there might still be decrypted data left in the socket, it would be lost and unseen by select. There might be a smarter way to solve but using the sample in the post I could fix the ssl hanging issue by changing the service_connection to handle pending bytes in sock.pending

                if recv_data:
                    LOG.debug('%s received data:\n%s', conn.name, recv_data)
                    conn.received(recv_data)
                    # excerpt from https://docs.python.org/3/library/ssl.html#ssl-nonblocking
                    # Conversely, since the SSL layer has its own framing, a SSL socket may still have data available
                    # for reading without select() being aware of it. Therefore, you should first call SSLSocket.recv()
                    # to drain any potentially available data, and then only block on a select() call if still necessary.
                    while isinstance(sock, ssl.SSLSocket) and sock.pending() > 0:
                        extra = sock.recv(min(4096, sock.pending()))
                        if extra:
                            LOG.debug('%s received pending SSL data:\n%s', conn.name, extra)
                            conn.received(extra)

There might still be something cleaner we can do by calling sock.recv in listen but I think this PR is kinda overboard and makes a lot of changes that would be hard to measure the impact.

I love the addition of a pipeline and cleaner testing, but I would also argue that it would be easier to review and comment, if we would seperate it in multiple PRs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants