-
Notifications
You must be signed in to change notification settings - Fork 359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(pageserver): duplicate layers can cause corruption #7799
Conversation
3180 tests run: 3041 passed, 0 failed, 139 skipped (full report)Code coverage* (full report)
* collected from Rust tests only The comment gets automatically updated with the latest test results
b48c21a at 2024-06-04T16:24:14.201Z :recycle: |
I once made it into a panic and then it did panic in prod 🤣 having generation numbers might help reduce the chances of such things happening but I don't feel confident enough to bring this panic back again unless we can assign each layer file a unique name. https://neondb.slack.com/archives/C04PSBP2SAF/p1689153285026749 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should keep:
- the second test case
- the noreplace
- aborting on L0 -> L1 compaction returning L0
I think we should remove:
- aborting on the situation of the first case
- noreplace removes the dangeriousness
I think we should add:
- removal of the tempfile for which the rename failed
- we should also remove the other files but that is non-trivial
I think we should find and reopen my issue where I described we should not make the tempfiles into Layer
before L0 -> L1 all are ready. But that intertwines with the "is layer duplicate" check, so it's not that trivial.
But the link+unlink or rename_noreplace is a good change.
I think an easy way to hit the problem shown by the second test case is disk full; we shouldn't abort on it.
What it actually does nowadays is test the RFC-27. It pre-dates implementation of RFC-27.
We already had/have that for deltas: neon/pageserver/src/tenant/storage_layer/delta_layer.rs Lines 656 to 665 in 6ff7429
But not for images. (Check callers of I think we can move that handling code from |
…en image & delta before this patch, only deltas would get cleaned up but images benefit similarly, so, share the code this could be a preliminary PR, but, finish() failure is subject of this PR, so, it fits in
Pushed a commit that does it. @koivunej I think this addresses all the concerns you raised. Regarding
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed review comments.
We don't have strict convention regarding test_
prefix in front of tests.
I personally prefer the test_ prefix due to how "show callers" works with rust-analyzer + VScode.
If you care deeply about it, let's have a docs/
PR or RFC, discuss it with the team, and unify the entire code base in one go. (I would of course oppose the removal of the test_
prefix).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to the rewrite test changes
… new where appropriate
I decided to revert the changes that made the code abort: b48c21a This PR achieves its goal without doing the aborts. Ready to merge IMO, let's see what the test suite run produces. |
fixes #7790 (duplicating most of the issue description here for posterity)
Background
From the time before always-authoritative
index_part.json
, we had to handle duplicate layers. See the RFC for an illustration of how duplicate layers could happen:neon/docs/rfcs/027-crash-consistent-layer-map-through-index-part.md
Lines 41 to 50 in a8e6d25
As of #5198 , we should not be exposed to that problem anymore.
Problem 1
We still have
However, the test in the test suite doesn't use the failpoint to induce a crash that could legitimately happen in production.
What is does instead is to return early with an
Ok()
, so that the code in Pageserver that handles duplicate layers (item 1) actually gets exercised.That "return early" would be a bug in the routine if it happened in production.
So, the tests in the test suite are tests for their own sake, but don't serve to actually regress-test any production behavior.
Problem 2
Further, if production code did (it nowawdays doesn't!) create a duplicate layer, the code in Pageserver that handles the condition (item 1 above) is too little and too late:
struct Layer
; that's good.Solution
RENAME_NOREPLACE
to prevent over-rename the file during.finish()
, bail with an error if it happens (Problem 2)struct Layer
at hand.duplicated L1 layer
log message anymore. Different code paths may still reach there, and an earlier version of this PR replaced that message with an abort(), but, that's an orthogonal change.duplicated L1
message. However, these greps were stale as the line should not have been logged ever since RFC-27 was implemented. Remove those greps..finish()
between image layers and delta layers (drive-by cleanup)image_layer_rewrite
was overwriting layer files in place. Fix the test.duplicated L1
message that should have been addressed by the changes that implement RFC-27. Remove those checks.Future Work
This PR adds a new failure scenario that was previously "papered over" by the overwriting of layers:
finish()
ed successfully.put_value()
..finish()
fails because A already exists on disk.The failure in step 5 is new with this PR, and it causes the compaction to get stuck.
Before, it would silently overwrite the file and "successfully" complete the second iteration.
The mitigation for this is to
/reset
the tenant.