Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix and add test case for issue 7791 #7795

Merged
merged 2 commits into from
May 22, 2024
Merged

Fix and add test case for issue 7791 #7795

merged 2 commits into from
May 22, 2024

Conversation

hlinnaka
Copy link
Contributor

@hlinnaka hlinnaka commented May 17, 2024

This is a test case to repro issue #7791, and a fix.

Sometimes this fails with this assertion, which seems consistent with my hypothesis at #7791 (comment):

PG:TRAP: failed Assert("last_written_lsn <= result.request_lsn"), File: "/home/heikki/git-sandbox/neon//pgxn/neon/pagestore_smgr.c", Line: 1533, PID: 1337078​
postgres: cloud_admin postgres 127.0.0.1(43964) SELECT(ExceptionalCondition+0xa2)[0x563ab48992d0]​
/home/heikki/git-sandbox/neon/pg_install/v16/lib/postgresql/neon.so(+0x1219a)[0x7f6ed17e119a]
​/home/heikki/git-sandbox/neon/pg_install/v16/lib/postgresql/neon.so(+0x13d77)[0x7f6ed17e2d77]
​postgres: cloud_admin postgres 127.0.0.1(43964) SELECT(smgrread+0x36)[0x563ab46d13cb]​
postgres: cloud_admin postgres 127.0.0.1(43964) SELECT(+0x5cd69e)[0x563ab468769e]
​postgres: cloud_admin postgres 127.0.0.1(43964) SELECT(ReadBufferExtended+0x12b)[0x563ab4686c7f]​
postgres: cloud_admin postgres 127.0.0.1(43964) SELECT(heapgetpage+0x392)[0x563ab41eb8d0]​
postgres: cloud_admin postgres 127.0.0.1(43964) SELECT(+0x132374)[0x563ab41ec374]
​postgres: cloud_admin postgres 127.0.0.1(43964) SELECT(heap_getnextslot+0x49)[0x563ab41ecaac]​
postgres: cloud_admin postgres 127.0.0.1(43964) SELECT(+0x3daa15)[0x563ab4494a15]​
postgres: cloud_admin postgres 127.0.0.1(43964) SELECT(+0x3daadd)[0x563ab4494add]​
postgres: cloud_admin postgres 127.0.0.1(43964) SELECT(+0x39ac9d)[0x563ab4454c9d]​
postgres: cloud_admin postgres 127.0.0.1(43964) SELECT(ExecScan+0x73)[0x563ab4454d12]​
postgres: cloud_admin postgres 127.0.0.1(43964) SELECT(+0x3dab3c)[0x563ab4494b3c]​
postgres: cloud_admin postgres 127.0.0.1(43964) SELECT(+0x396b93)[0x563ab4450b93]​
postgres: cloud_admin postgres 127.0.0.1(43964) SELECT(+0x38a620)[0x563ab4444620]
​postgres: cloud_admin postgres 127.0.0.1(43964) SELECT(+0x38d3d4)[0x563ab44473d4]​
postgres: cloud_admin postgres 127.0.0.1(43964) SELECT(standard_ExecutorRun+0x1d5)[0x563ab4444c93]​
postgres: cloud_admin postgres 127.0.0.1(43964) SELECT(ExecutorRun+0x59)[0x563ab4444abb]​
postgres: cloud_admin postgres 127.0.0.1(43964) SELECT(+0x621594)[0x563ab46db594]​
postgres: cloud_admin postgres 127.0.0.1(43964) SELECT(PortalRun+0x26c)[0x563ab46db246]​
postgres: cloud_admin postgres 127.0.0.1(43964) SELECT(+0x61c200)[0x563ab46d6200]​
postgres: cloud_admin postgres 127.0.0.1(43964) SELECT(PostgresMain+0x910)[0x563ab46d94ea]​
postgres: cloud_admin postgres 127.0.0.1(43964) SELECT(+0x550ded)[0x563ab460aded]​
postgres: cloud_admin postgres 127.0.0.1(43964) SELECT(+0x55073e)[0x563ab460a73e]​
postgres: cloud_admin postgres 127.0.0.1(43964) SELECT(+0x54cdd9)[0x563ab4606dd9]​
postgres: cloud_admin postgres 127.0.0.1(43964) SELECT(PostmasterMain+0x1377)[0x563ab460674e]​
postgres: cloud_admin postgres 127.0.0.1(43964) SELECT(main+0x3c4)[0x563ab44caeec]​
/lib/x86_64-linux-gnu/libc.so.6(+0x2724a)[0x7f6ed105d24a]​
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7f6ed105d305]​
postgres: cloud_admin postgres 127.0.0.1(43964) SELECT(_start+0x2a)[0x563ab418ba3a]
PG:2024-05-17 15:51:46.872 GMT [1337066] LOG:  server process (PID 1337078) was terminated by signal 6: Aborted
PG:2024-05-17 15:51:46.872 GMT [1337066] DETAIL:  Failed process was running: SELECT counter FROM test

But other times this hits the same assertion in the startup process:

PG:TRAP: failed Assert("last_written_lsn <= result.request_lsn"), File: "/home/heikki/git-sandbox/neon//pgxn/neon/pagestore_smgr.c", Line: 1533, PID: 1337339​
postgres: startup recovering 000000010000000000000001(ExceptionalCondition+0xa2)[0x55c79dcde2d0]​
/home/heikki/git-sandbox/neon/pg_install/v16/lib/postgresql/neon.so(+0x1219a)[0x7f8f1517e19a]​
/home/heikki/git-sandbox/neon/pg_install/v16/lib/postgresql/neon.so(+0x13d77)[0x7f8f1517fd77]​
postgres: startup recovering 000000010000000000000001(smgrread+0x36)[0x55c79db163cb]​
postgres: startup recovering 000000010000000000000001(+0x5cd69e)[0x55c79dacc69e]​
postgres: startup recovering 000000010000000000000001(ReadBufferWithoutRelcache+0x85)[0x55c79dacbd7e]​
postgres: startup recovering 000000010000000000000001(XLogReadBufferExtended+0xf7)[0x55c79d6d8dcc]​
postgres: startup recovering 000000010000000000000001(XLogReadBufferForRedoExtended+0x3a0)[0x55c79d6d8c5b]​
postgres: startup recovering 000000010000000000000001(XLogReadBufferForRedo+0x36)[0x55c79d6d8882]​
/home/heikki/git-sandbox/neon/pg_install/v16/lib/postgresql/neon_rmgr.so(+0x354c)[0x7f8f1538354c]​
/home/heikki/git-sandbox/neon/pg_install/v16/lib/postgresql/neon_rmgr.so(+0x27e9)[0x7f8f153827e9]​
postgres: startup recovering 000000010000000000000001(+0x1d4160)[0x55c79d6d3160]​
postgres: startup recovering 000000010000000000000001(PerformWalRecovery+0x3de)[0x55c79d6d2cd1]​
postgres: startup recovering 000000010000000000000001(StartupXLOG+0x934)[0x55c79d6bf9a5]​
postgres: startup recovering 000000010000000000000001(StartupProcessMain+0xe5)[0x55c79da51530]​
postgres: startup recovering 000000010000000000000001(AuxiliaryProcessMain+0x194)[0x55c79da457d9]​
postgres: startup recovering 000000010000000000000001(+0x551275)[0x55c79da50275]​
postgres: startup recovering 000000010000000000000001(PostmasterMain+0x1331)[0x55c79da4b708]​
postgres: startup recovering 000000010000000000000001(main+0x3c4)[0x55c79d90feec]​
/lib/x86_64-linux-gnu/libc.so.6(+0x2724a)[0x7f8f1465d24a]​
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7f8f1465d305]​
postgres: startup recovering 000000010000000000000001(_start+0x2a)[0x55c79d5d0a3a]
PG:2024-05-17 15:52:17.847 GMT [1337334] LOG:  startup process (PID 1337339) was terminated by signal 6: Aborted

Copy link

github-actions bot commented May 17, 2024

3102 tests run: 2975 passed, 0 failed, 127 skipped (full report)


Code coverage* (full report)

  • functions: 31.3% (6413 of 20481 functions)
  • lines: 48.0% (49303 of 102647 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
1a3337c at 2024-05-22T12:58:52.541Z :recycle:

@andreasscherbaum
Copy link
Contributor

Blocked by minor version upgrade (should be solved by now).

@hlinnaka hlinnaka changed the base branch from main to fix-zenith_test_evict May 21, 2024 16:40
@hlinnaka hlinnaka changed the title Repro test case for issue 7791 Fix and add test case for issue 7791 May 21, 2024
@hlinnaka
Copy link
Contributor Author

Ok, the test is now passing with this fix, and the fix from #7803. I verified that #7794 now also passes this test.

#7794 does more or less the same as this. The differences are:

  • This PR updates the last-written LSN also when ShutdownRequestPending. I think that's needed for correctness, although in practice it would be difficult to get into trouble because of it.

  • This PR includes more refactoring of neon_wallog_page(), mostly for readability. It also makes an effort to update the last-written LSN if a page with zero LSN is evicted in standby, and prints a LOG message in that.

  • I originally moved the XLogWaitForReplayOf() call into neon_get_request_lsns() in this PR, but I reverted that. I think it works either way, so let's avoid the code churn.

@hlinnaka hlinnaka marked this pull request as ready for review May 21, 2024 21:39
@hlinnaka hlinnaka requested review from a team as code owners May 21, 2024 21:39
@hlinnaka hlinnaka requested review from save-buffer and arssher and removed request for a team May 21, 2024 21:39
Copy link
Contributor

@save-buffer save-buffer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the fix overall looks good to me. But it leaves me questioning the design of locking pages and applying changes one at a time. I guess that's inside Postgres, not really under our control

I'd think that we'd just want these replays to happen atomically across all pages. In other words, what makes more sense to me is to lock all the pages first, then apply all changes, and then I think this sort of problem just isn't possible by construction? Definitely could be missing something though.

pgxn/neon/pagestore_smgr.c Outdated Show resolved Hide resolved
@hlinnaka
Copy link
Contributor Author

I think the fix overall looks good to me. But it leaves me questioning the design of locking pages and applying changes one at a time. I guess that's inside Postgres, not really under our control

I'd think that we'd just want these replays to happen atomically across all pages. In other words, what makes more sense to me is to lock all the pages first, then apply all changes, and then I think this sort of problem just isn't possible by construction? Definitely could be missing something though.

Postgres redo functions do two-phase locking of the pages, ie. they don't release any of the locks until all the pages have been locked and modified. So the changes do appear atomic to any other backend that's looking.

We are violating that in neon with the hack that we skip the WAL replay of pages that are not currently in the buffer cache. That's why we need the XLogWaitForReplayOf() call. That feels risky to me. I don't see any immediate bug with that, but I think it would be better if the redo process inserted an entry to the buffer manager for the page, without doing the I/O to read it in, and held the lock on the buffer as usual until the whole record has been replayed. That would require more invasive changes to the Postgres buffer manager code, however.

@hlinnaka
Copy link
Contributor Author

hlinnaka commented May 22, 2024

I originally moved the XLogWaitForReplayOf() call into neon_get_request_lsns() in this PR, but I reverted that. I think it works either way, so let's avoid the code churn.

Sorry, I wrote that comment before I had actually pushed that change to this branch. Pushed it now.

Base automatically changed from fix-zenith_test_evict to main May 22, 2024 11:26
The code was working correctly, but was incorrectly using Buffer for a
0-based index into the BufferDesc array.
hlinnaka added a commit that referenced this pull request May 22, 2024
Don't set last-written LSN of a page when the record is replayed, only
when the page is evicted from cache. For comparison, we don't update
the last-written LSN on every page modification on the primary either,
only when the page is evicted. Do update the last-written LSN when the
page update is skipped in WAL redo, however.

In neon_get_request_lsns(), don't be surprised if the last-written LSN
is equal to the record being replayed. Use the LSN of the record being
replayed as the request LSN in that case. Add a long comment
explaining how that can happen.

In neon_wallog_page, update last-written LSN also when Shutdown has
been requested. We might still fetch and evict pages for a while,
after shutdown has been requested, so we better continue to do that
correctly.

Enable the check that we don't evict a page with zero LSN also in
standby, but make it a LOG message instead of PANIC

Fixes issue #7791
@hlinnaka hlinnaka enabled auto-merge (rebase) May 22, 2024 11:37
pgxn/neon/pagestore_smgr.c Outdated Show resolved Hide resolved
pgxn/neon/pagestore_smgr.c Outdated Show resolved Hide resolved
pgxn/neon/pagestore_smgr.c Outdated Show resolved Hide resolved
@hlinnaka hlinnaka disabled auto-merge May 22, 2024 11:54
Don't set last-written LSN of a page when the record is replayed, only
when the page is evicted from cache. For comparison, we don't update
the last-written LSN on every page modification on the primary either,
only when the page is evicted. Do update the last-written LSN when the
page update is skipped in WAL redo, however.

In neon_get_request_lsns(), don't be surprised if the last-written LSN
is equal to the record being replayed. Use the LSN of the record being
replayed as the request LSN in that case. Add a long comment
explaining how that can happen.

In neon_wallog_page, update last-written LSN also when Shutdown has
been requested. We might still fetch and evict pages for a while,
after shutdown has been requested, so we better continue to do that
correctly.

Enable the check that we don't evict a page with zero LSN also in
standby, but make it a LOG message instead of PANIC

Fixes issue #7791
@hlinnaka hlinnaka merged commit 9217564 into main May 22, 2024
55 checks passed
@hlinnaka hlinnaka deleted the fix-7791 branch May 22, 2024 15:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants