Fix and add test case for issue 7791 #7795

hlinnaka · 2024-05-17T15:53:14Z

This is a test case to repro issue #7791, and a fix.

Sometimes this fails with this assertion, which seems consistent with my hypothesis at #7791 (comment):

PG:TRAP: failed Assert("last_written_lsn <= result.request_lsn"), File: "/home/heikki/git-sandbox/neon//pgxn/neon/pagestore_smgr.c", Line: 1533, PID: 1337078
postgres: cloud_admin postgres 127.0.0.1(43964) SELECT(ExceptionalCondition+0xa2)[0x563ab48992d0]
/home/heikki/git-sandbox/neon/pg_install/v16/lib/postgresql/neon.so(+0x1219a)[0x7f6ed17e119a]
/home/heikki/git-sandbox/neon/pg_install/v16/lib/postgresql/neon.so(+0x13d77)[0x7f6ed17e2d77]
postgres: cloud_admin postgres 127.0.0.1(43964) SELECT(smgrread+0x36)[0x563ab46d13cb]
postgres: cloud_admin postgres 127.0.0.1(43964) SELECT(+0x5cd69e)[0x563ab468769e]
postgres: cloud_admin postgres 127.0.0.1(43964) SELECT(ReadBufferExtended+0x12b)[0x563ab4686c7f]
postgres: cloud_admin postgres 127.0.0.1(43964) SELECT(heapgetpage+0x392)[0x563ab41eb8d0]
postgres: cloud_admin postgres 127.0.0.1(43964) SELECT(+0x132374)[0x563ab41ec374]
postgres: cloud_admin postgres 127.0.0.1(43964) SELECT(heap_getnextslot+0x49)[0x563ab41ecaac]
postgres: cloud_admin postgres 127.0.0.1(43964) SELECT(+0x3daa15)[0x563ab4494a15]
postgres: cloud_admin postgres 127.0.0.1(43964) SELECT(+0x3daadd)[0x563ab4494add]
postgres: cloud_admin postgres 127.0.0.1(43964) SELECT(+0x39ac9d)[0x563ab4454c9d]
postgres: cloud_admin postgres 127.0.0.1(43964) SELECT(ExecScan+0x73)[0x563ab4454d12]
postgres: cloud_admin postgres 127.0.0.1(43964) SELECT(+0x3dab3c)[0x563ab4494b3c]
postgres: cloud_admin postgres 127.0.0.1(43964) SELECT(+0x396b93)[0x563ab4450b93]
postgres: cloud_admin postgres 127.0.0.1(43964) SELECT(+0x38a620)[0x563ab4444620]
postgres: cloud_admin postgres 127.0.0.1(43964) SELECT(+0x38d3d4)[0x563ab44473d4]
postgres: cloud_admin postgres 127.0.0.1(43964) SELECT(standard_ExecutorRun+0x1d5)[0x563ab4444c93]
postgres: cloud_admin postgres 127.0.0.1(43964) SELECT(ExecutorRun+0x59)[0x563ab4444abb]
postgres: cloud_admin postgres 127.0.0.1(43964) SELECT(+0x621594)[0x563ab46db594]
postgres: cloud_admin postgres 127.0.0.1(43964) SELECT(PortalRun+0x26c)[0x563ab46db246]
postgres: cloud_admin postgres 127.0.0.1(43964) SELECT(+0x61c200)[0x563ab46d6200]
postgres: cloud_admin postgres 127.0.0.1(43964) SELECT(PostgresMain+0x910)[0x563ab46d94ea]
postgres: cloud_admin postgres 127.0.0.1(43964) SELECT(+0x550ded)[0x563ab460aded]
postgres: cloud_admin postgres 127.0.0.1(43964) SELECT(+0x55073e)[0x563ab460a73e]
postgres: cloud_admin postgres 127.0.0.1(43964) SELECT(+0x54cdd9)[0x563ab4606dd9]
postgres: cloud_admin postgres 127.0.0.1(43964) SELECT(PostmasterMain+0x1377)[0x563ab460674e]
postgres: cloud_admin postgres 127.0.0.1(43964) SELECT(main+0x3c4)[0x563ab44caeec]
/lib/x86_64-linux-gnu/libc.so.6(+0x2724a)[0x7f6ed105d24a]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7f6ed105d305]
postgres: cloud_admin postgres 127.0.0.1(43964) SELECT(_start+0x2a)[0x563ab418ba3a]
PG:2024-05-17 15:51:46.872 GMT [1337066] LOG:  server process (PID 1337078) was terminated by signal 6: Aborted
PG:2024-05-17 15:51:46.872 GMT [1337066] DETAIL:  Failed process was running: SELECT counter FROM test

But other times this hits the same assertion in the startup process:

PG:TRAP: failed Assert("last_written_lsn <= result.request_lsn"), File: "/home/heikki/git-sandbox/neon//pgxn/neon/pagestore_smgr.c", Line: 1533, PID: 1337339
postgres: startup recovering 000000010000000000000001(ExceptionalCondition+0xa2)[0x55c79dcde2d0]
/home/heikki/git-sandbox/neon/pg_install/v16/lib/postgresql/neon.so(+0x1219a)[0x7f8f1517e19a]
/home/heikki/git-sandbox/neon/pg_install/v16/lib/postgresql/neon.so(+0x13d77)[0x7f8f1517fd77]
postgres: startup recovering 000000010000000000000001(smgrread+0x36)[0x55c79db163cb]
postgres: startup recovering 000000010000000000000001(+0x5cd69e)[0x55c79dacc69e]
postgres: startup recovering 000000010000000000000001(ReadBufferWithoutRelcache+0x85)[0x55c79dacbd7e]
postgres: startup recovering 000000010000000000000001(XLogReadBufferExtended+0xf7)[0x55c79d6d8dcc]
postgres: startup recovering 000000010000000000000001(XLogReadBufferForRedoExtended+0x3a0)[0x55c79d6d8c5b]
postgres: startup recovering 000000010000000000000001(XLogReadBufferForRedo+0x36)[0x55c79d6d8882]
/home/heikki/git-sandbox/neon/pg_install/v16/lib/postgresql/neon_rmgr.so(+0x354c)[0x7f8f1538354c]
/home/heikki/git-sandbox/neon/pg_install/v16/lib/postgresql/neon_rmgr.so(+0x27e9)[0x7f8f153827e9]
postgres: startup recovering 000000010000000000000001(+0x1d4160)[0x55c79d6d3160]
postgres: startup recovering 000000010000000000000001(PerformWalRecovery+0x3de)[0x55c79d6d2cd1]
postgres: startup recovering 000000010000000000000001(StartupXLOG+0x934)[0x55c79d6bf9a5]
postgres: startup recovering 000000010000000000000001(StartupProcessMain+0xe5)[0x55c79da51530]
postgres: startup recovering 000000010000000000000001(AuxiliaryProcessMain+0x194)[0x55c79da457d9]
postgres: startup recovering 000000010000000000000001(+0x551275)[0x55c79da50275]
postgres: startup recovering 000000010000000000000001(PostmasterMain+0x1331)[0x55c79da4b708]
postgres: startup recovering 000000010000000000000001(main+0x3c4)[0x55c79d90feec]
/lib/x86_64-linux-gnu/libc.so.6(+0x2724a)[0x7f8f1465d24a]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7f8f1465d305]
postgres: startup recovering 000000010000000000000001(_start+0x2a)[0x55c79d5d0a3a]
PG:2024-05-17 15:52:17.847 GMT [1337334] LOG:  startup process (PID 1337339) was terminated by signal 6: Aborted

github-actions · 2024-05-17T16:34:32Z

3102 tests run: 2975 passed, 0 failed, 127 skipped (full report)

Code coverage* (full report)

functions: 31.3% (6413 of 20481 functions)
lines: 48.0% (49303 of 102647 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
1a3337c at 2024-05-22T12:58:52.541Z :recycle:}

andreasscherbaum · 2024-05-21T15:48:43Z

Blocked by minor version upgrade (should be solved by now).

hlinnaka · 2024-05-21T21:39:24Z

Ok, the test is now passing with this fix, and the fix from #7803. I verified that #7794 now also passes this test.

#7794 does more or less the same as this. The differences are:

This PR updates the last-written LSN also when ShutdownRequestPending. I think that's needed for correctness, although in practice it would be difficult to get into trouble because of it.
This PR includes more refactoring of neon_wallog_page(), mostly for readability. It also makes an effort to update the last-written LSN if a page with zero LSN is evicted in standby, and prints a LOG message in that.
I originally moved the XLogWaitForReplayOf() call into neon_get_request_lsns() in this PR, but I reverted that. I think it works either way, so let's avoid the code churn.

save-buffer

I think the fix overall looks good to me. But it leaves me questioning the design of locking pages and applying changes one at a time. I guess that's inside Postgres, not really under our control

I'd think that we'd just want these replays to happen atomically across all pages. In other words, what makes more sense to me is to lock all the pages first, then apply all changes, and then I think this sort of problem just isn't possible by construction? Definitely could be missing something though.

pgxn/neon/pagestore_smgr.c

hlinnaka · 2024-05-22T06:57:41Z

I think the fix overall looks good to me. But it leaves me questioning the design of locking pages and applying changes one at a time. I guess that's inside Postgres, not really under our control

I'd think that we'd just want these replays to happen atomically across all pages. In other words, what makes more sense to me is to lock all the pages first, then apply all changes, and then I think this sort of problem just isn't possible by construction? Definitely could be missing something though.

Postgres redo functions do two-phase locking of the pages, ie. they don't release any of the locks until all the pages have been locked and modified. So the changes do appear atomic to any other backend that's looking.

We are violating that in neon with the hack that we skip the WAL replay of pages that are not currently in the buffer cache. That's why we need the XLogWaitForReplayOf() call. That feels risky to me. I don't see any immediate bug with that, but I think it would be better if the redo process inserted an entry to the buffer manager for the page, without doing the I/O to read it in, and held the lock on the buffer as usual until the whole record has been replayed. That would require more invasive changes to the Postgres buffer manager code, however.

hlinnaka · 2024-05-22T06:59:38Z

I originally moved the XLogWaitForReplayOf() call into neon_get_request_lsns() in this PR, but I reverted that. I think it works either way, so let's avoid the code churn.

Sorry, I wrote that comment before I had actually pushed that change to this branch. Pushed it now.

The code was working correctly, but was incorrectly using Buffer for a 0-based index into the BufferDesc array.

Don't set last-written LSN of a page when the record is replayed, only when the page is evicted from cache. For comparison, we don't update the last-written LSN on every page modification on the primary either, only when the page is evicted. Do update the last-written LSN when the page update is skipped in WAL redo, however. In neon_get_request_lsns(), don't be surprised if the last-written LSN is equal to the record being replayed. Use the LSN of the record being replayed as the request LSN in that case. Add a long comment explaining how that can happen. In neon_wallog_page, update last-written LSN also when Shutdown has been requested. We might still fetch and evict pages for a while, after shutdown has been requested, so we better continue to do that correctly. Enable the check that we don't evict a page with zero LSN also in standby, but make it a LOG message instead of PANIC Fixes issue #7791

pgxn/neon/pagestore_smgr.c

Don't set last-written LSN of a page when the record is replayed, only when the page is evicted from cache. For comparison, we don't update the last-written LSN on every page modification on the primary either, only when the page is evicted. Do update the last-written LSN when the page update is skipped in WAL redo, however. In neon_get_request_lsns(), don't be surprised if the last-written LSN is equal to the record being replayed. Use the LSN of the record being replayed as the request LSN in that case. Add a long comment explaining how that can happen. In neon_wallog_page, update last-written LSN also when Shutdown has been requested. We might still fetch and evict pages for a while, after shutdown has been requested, so we better continue to do that correctly. Enable the check that we don't evict a page with zero LSN also in standby, but make it a LOG message instead of PANIC Fixes issue #7791

hlinnaka mentioned this pull request May 17, 2024

Set result.not_modified_since as minimum of last_written_lsn and replay lsn #7794

Closed

5 tasks

hlinnaka force-pushed the fix-7791 branch from 14af870 to 1a7d055 Compare May 17, 2024 22:44

knizhnik mentioned this pull request May 18, 2024

Bad request: invalid request with request LSN 0/B0AFDD0 and not_modified_since 0/B0AFE80 #7791

Closed

hlinnaka force-pushed the fix-7791 branch from 1a7d055 to f3602ca Compare May 21, 2024 16:39

hlinnaka changed the base branch from main to fix-zenith_test_evict May 21, 2024 16:40

hlinnaka changed the title ~~Repro test case for issue 7791~~ Fix and add test case for issue 7791 May 21, 2024

hlinnaka force-pushed the fix-7791 branch from d618f37 to 57ea49e Compare May 21, 2024 21:08

hlinnaka marked this pull request as ready for review May 21, 2024 21:39

hlinnaka requested review from a team as code owners May 21, 2024 21:39

hlinnaka requested review from save-buffer and arssher and removed request for a team May 21, 2024 21:39

save-buffer reviewed May 21, 2024

View reviewed changes

pgxn/neon/pagestore_smgr.c Outdated Show resolved Hide resolved

hlinnaka requested review from save-buffer and knizhnik May 22, 2024 07:00

knizhnik approved these changes May 22, 2024

View reviewed changes

Base automatically changed from fix-zenith_test_evict to main May 22, 2024 11:26

Fix confusion between 1-based Buffer and 0-based index (#7825)

0aa9549

The code was working correctly, but was incorrectly using Buffer for a 0-based index into the BufferDesc array.

hlinnaka force-pushed the fix-7791 branch from 3bc2715 to 445450c Compare May 22, 2024 11:36

hlinnaka enabled auto-merge (rebase) May 22, 2024 11:37

arssher approved these changes May 22, 2024

View reviewed changes

pgxn/neon/pagestore_smgr.c Outdated Show resolved Hide resolved

pgxn/neon/pagestore_smgr.c Outdated Show resolved Hide resolved

pgxn/neon/pagestore_smgr.c Outdated Show resolved Hide resolved

hlinnaka disabled auto-merge May 22, 2024 11:54

hlinnaka force-pushed the fix-7791 branch from 445450c to 1a3337c Compare May 22, 2024 11:57

hlinnaka merged commit 9217564 into main May 22, 2024
55 checks passed

hlinnaka deleted the fix-7791 branch May 22, 2024 15:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix and add test case for issue 7791 #7795

Fix and add test case for issue 7791 #7795

hlinnaka commented May 17, 2024 •

edited

github-actions bot commented May 17, 2024 •

edited

andreasscherbaum commented May 21, 2024

hlinnaka commented May 21, 2024

save-buffer left a comment •

edited

hlinnaka commented May 22, 2024

hlinnaka commented May 22, 2024 •

edited

Fix and add test case for issue 7791 #7795

Fix and add test case for issue 7791 #7795

Conversation

hlinnaka commented May 17, 2024 • edited

github-actions bot commented May 17, 2024 • edited

3102 tests run: 2975 passed, 0 failed, 127 skipped (full report)

Code coverage* (full report)

andreasscherbaum commented May 21, 2024

hlinnaka commented May 21, 2024

save-buffer left a comment • edited

Choose a reason for hiding this comment

hlinnaka commented May 22, 2024

hlinnaka commented May 22, 2024 • edited

hlinnaka commented May 17, 2024 •

edited

github-actions bot commented May 17, 2024 •

edited

save-buffer left a comment •

edited

hlinnaka commented May 22, 2024 •

edited