Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: listener retries with exponential backoff #3536

Merged
merged 1 commit into from
May 20, 2024

Conversation

steve-chavez
Copy link
Member

@steve-chavez steve-chavez commented May 19, 2024

Continues #3533

Now it works like:

$ postgrest-with-postgresql-16 --replica -f test/spec/fixtures/load.sql postgrest-run

19/May/2024:20:24:28 -0500: Retrying listening for notifications in 1 seconds...
19/May/2024:20:24:28 -0500: Failed listening for notifications on the "pgrst" channel. ERROR:  cannot execute LISTEN during recovery
19/May/2024:20:24:30 -0500: Retrying listening for notifications in 2 seconds...
19/May/2024:20:24:30 -0500: Failed listening for notifications on the "pgrst" channel. ERROR:  cannot execute LISTEN during recovery
19/May/2024:20:24:34 -0500: Retrying listening for notifications in 4 seconds...
19/May/2024:20:24:34 -0500: Failed listening for notifications on the "pgrst" channel. ERROR:  cannot execute LISTEN during recovery

Limitation

Once the listen channel is recovered, the retry status is not reset. So if the last backoff was 4 seconds, the next time recovery kicks in the backoff will be 8 seconds. If a recovery reaches 32 seconds, the backoff will stay there.

This is a problem with the interaction between hasql-notifications and retry. The Hasql.Notifications.waitForNotifications uses a forever loop that only finishes when it throws an exception, retry recovers on an exception and succeeds (here it restarts the retry status) when the function finishes .

I've left this as a TODO for now, it's still better than retrying without backoff.

@steve-chavez steve-chavez marked this pull request as draft May 19, 2024 20:14
@steve-chavez
Copy link
Member Author

steve-chavez commented May 19, 2024

Really puzzled by this error that only happens on CI, not locally:

def set_statement_timeout(postgrest, role, milliseconds):
        """Set the statement timeout for the given role.
        For this to work reliably with low previous timeout settings,
        use a postgrest instance that doesn't use the affected role."""
    
        response = postgrest.session.post(
            "/rpc/set_statement_timeout", data={"role": role, "milliseconds": milliseconds}
        )
>       assert response.text == ""
E       assert '{"code":"57P01","details":null,"hint":null,"message":"terminating connection due to administrator command"}' == ''
E         
E         + {"code":"57P01","details":null,"hint":null,"message":"terminating connection due to administrator command"}

Whenever io tests fail, they're really hard to debug.

@steve-chavez
Copy link
Member Author

steve-chavez commented May 19, 2024

It's like something is calling pg_terminate_backend. AFAICT there's no relationship between statement_timeout and LISTEN.

The only place that's done is here:

postgrest/test/io/test_io.py

Lines 1220 to 1222 in 1ad6dc6

os.system(
f'psql -d {defaultenv["PGDATABASE"]} -U {defaultenv["PGUSER"]} -h {defaultenv["PGHOST"]} --set ON_ERROR_STOP=1 -a -c "SELECT terminate_pgrst()"'
)

Managed to reproduce the error locally, the only related db logs:

2024-05-19 21:16:48.888 UTC [908423] LOG:  statement: SELECT terminate_pgrst()
2024-05-19 21:16:48.889 UTC [907910] FATAL:  terminating connection due to administrator command
2024-05-19 21:16:48.889 UTC [908419] FATAL:  terminating connection due to administrator command
2024-05-19 21:16:48.889 UTC [908420] FATAL:  terminating connection due to administrator command

@steve-chavez
Copy link
Member Author

steve-chavez commented May 19, 2024

It's like something is calling pg_terminate_backend

That was it, I just skipped this test:

postgrest/test/io/test_io.py

Lines 1212 to 1226 in 4416050

def test_fail_with_automatic_recovery_disabled_and_terminated_using_query(defaultenv):
"Should fail without retries when automatic recovery is disabled and pg_terminate_backend(pid) is called"
env = {
**defaultenv,
"PGRST_DB_POOL_AUTOMATIC_RECOVERY": "false",
}
with run(env=env) as postgrest:
os.system(
f'psql -d {defaultenv["PGDATABASE"]} -U {defaultenv["PGUSER"]} -h {defaultenv["PGHOST"]} --set ON_ERROR_STOP=1 -a -c "SELECT terminate_pgrst()"'
)
exitCode = wait_until_exit(postgrest)
assert exitCode == 1

And the tests no longer fail.

The thing that has changed in this PR is that the listener no longer waits for the connectionWorker (since the listener now has its own backoff), so looks like the pg_terminate_backend was affecting the LISTEN connection too.


I think this can be fixed by isolating terminate_pgrst with a particular application name.

Fixed on #3537

Also corrects the admin ready response which now considers the listener
state.
@steve-chavez steve-chavez merged commit 3cf5656 into PostgREST:main May 20, 2024
24 of 25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant