fix(pageserver): duplicate layers can cause corruption #7799

problame · 2024-05-17T16:33:10Z

fixes #7790 (duplicating most of the issue description here for posterity)

Background

From the time before always-authoritative index_part.json, we had to handle duplicate layers. See the RFC for an illustration of how duplicate layers could happen:

neon/docs/rfcs/027-crash-consistent-layer-map-through-index-part.md

Lines 41 to 50 in a8e6d25

    
           The implications of the above are primarily problematic for compaction. 
        
           Specifically, the part of it that compacts L0 layers into L1 layers. 
        
           Remember that compaction takes a set of L0 layers and reshuffles the delta records in them into L1 layer files. 
        
           Once the L1 layer files are written to disk, it atomically removes the L0 layers from the layer map and adds the L1 layers to the layer map. 
        
           It then deletes the L0 layers locally, and schedules an upload of the L1 layers and and updated index part. 
        
           If we crash before deleting L0s, but after writing out L1s, the next compaction after restart will re-digest the L0s and produce new L1s. 
        
           This means the compaction after restart will **overwrite** the previously written L1s. 
        
           Currently we also schedule an S3 upload of the overwritten L1.

As of #5198 , we should not be exposed to that problem anymore.

Problem 1

We still have

code in Pageserver that handles duplicate layers
a test in the test suite that demonstrates the problem using a failpoint

However, the test in the test suite doesn't use the failpoint to induce a crash that could legitimately happen in production.
What is does instead is to return early with an Ok(), so that the code in Pageserver that handles duplicate layers (item 1) actually gets exercised.

That "return early" would be a bug in the routine if it happened in production.
So, the tests in the test suite are tests for their own sake, but don't serve to actually regress-test any production behavior.

Problem 2

Further, if production code did (it nowawdays doesn't!) create a duplicate layer, the code in Pageserver that handles the condition (item 1 above) is too little and too late:

the code handles it by discarding the newer struct Layer; that's good.
however, on disk, we have already overwritten the old with the new layer file
the fact that we do it atomically doesn't matter because ...
if the new layer file is not bit-identical, then we have a cache coherency problem
- PS PageCache block cache: caches old bit battern
- blob_io offsets stored in variables, based on pre-overwrite bit pattern / offsets
  - => reading based on these offsets from the new file might yield different data than before

Solution

Remove the test suite code pertaining to Problem 1
Move & rename test suite code that actually tests RFC-27 crash-consistent layer map.
Remove the Pageserver code that handles duplicate layers too late (Problem 1)
Use RENAME_NOREPLACE to prevent over-rename the file during .finish(), bail with an error if it happens (Problem 2)
- This bailing prevents the caller from even trying to insert into the layer map, as they don't even get a struct Layer at hand.
- This means we won't reach the duplicated L1 layer log message anymore. Different code paths may still reach there, and an earlier version of this PR replaced that message with an abort(), but, that's an orthogonal change.
  - Some tests grepped for that duplicated L1 message. However, these greps were stale as the line should not have been logged ever since RFC-27 was implemented. Remove those greps.
Share the logic to clean up after failed .finish() between image layers and delta layers (drive-by cleanup)
- This exposed that test image_layer_rewrite was overwriting layer files in place. Fix the test.
Some tests had stale checks/greps for the duplicated L1 message that should have been addressed by the changes that implement RFC-27. Remove those checks.

Future Work

This PR adds a new failure scenario that was previously "papered over" by the overwriting of layers:

Start a compaction that will produce 3 layers: A, B, C
Layer A is finish()ed successfully.
Layer B fails mid-way at some put_value().
Compaction bails out, sleeps 20s.
Some disk space gets freed in the meantime.
Compaction wakes from sleep, another iteration starts, it attempts to write Layer A again. But the .finish() fails because A already exists on disk.

The failure in step 5 is new with this PR, and it causes the compaction to get stuck.
Before, it would silently overwrite the file and "successfully" complete the second iteration.

The mitigation for this is to /reset the tenant.

fixes #7790

…licate-layers

github-actions · 2024-05-17T17:16:34Z

3180 tests run: 3041 passed, 0 failed, 139 skipped (full report)

Flaky tests (1)

Postgres 14

test_tenant_delete_smoke: release

Code coverage* (full report)

functions: 31.5% (6580 of 20893 functions)
lines: 48.5% (50979 of 105136 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
b48c21a at 2024-06-04T16:24:14.201Z :recycle:}

skyzh · 2024-05-17T19:25:18Z

I once made it into a panic and then it did panic in prod 🤣 having generation numbers might help reduce the chances of such things happening but I don't feel confident enough to bring this panic back again unless we can assign each layer file a unique name. https://neondb.slack.com/archives/C04PSBP2SAF/p1689153285026749

libs/utils/src/fs_ext/rename_noreplace.rs

pageserver/src/tenant/timeline.rs

test_runner/regress/test_duplicate_layers.py

pageserver/src/tenant/timeline.rs

koivunej

I think we should keep:

the second test case
the noreplace
aborting on L0 -> L1 compaction returning L0

I think we should remove:

aborting on the situation of the first case
- noreplace removes the dangeriousness

I think we should add:

removal of the tempfile for which the rename failed
- we should also remove the other files but that is non-trivial

I think we should find and reopen my issue where I described we should not make the tempfiles into Layer before L0 -> L1 all are ready. But that intertwines with the "is layer duplicate" check, so it's not that trivial.

But the link+unlink or rename_noreplace is a good change.

I think an easy way to hit the problem shown by the second test case is disk full; we shouldn't abort on it.

What it actually does nowadays is test the RFC-27. It pre-dates implementation of RFC-27.

problame · 2024-05-27T17:15:36Z

I think we should add:
removal of the tempfile for which the rename failed
we should also remove the other files but that is non-trivial

We already had/have that for deltas:

neon/pageserver/src/tenant/storage_layer/delta_layer.rs

Lines 656 to 665 in 6ff7429

    
           let result = inner.finish(key_end, timeline, ctx).await; 
        
           // The delta layer files can sometimes be really large. Clean them up. 
        
           if result.is_err() { 
        
               tracing::warn!( 
        
                   "Cleaning up temporary delta file {temp_path} after error during writing" 
        
               ); 
        
               if let Err(e) = std::fs::remove_file(&temp_path) { 
        
                   tracing::warn!("Error cleaning up temporary delta layer file {temp_path}: {e:?}") 
        
               } 
        
           }

But not for images.

(Check callers of finish_creating()).

I think we can move that handling code from delta_layer.rs to layer.rs so that both delta and image benefit. WDYT?

…en image & delta before this patch, only deltas would get cleaned up but images benefit similarly, so, share the code this could be a preliminary PR, but, finish() failure is subject of this PR, so, it fits in

problame · 2024-05-27T17:47:55Z

Pushed a commit that does it.

@koivunej I think this addresses all the concerns you raised.

Regarding abort() yes or no, I don't feel too strongly about it. In general, duplicates in the layer map are probably fine, i.e., lookup would do the right thing if they're present.
It's just that right now,

it can't be modelled safely because until recently with the -v1-... suffix, there was a bijection descriptor <==> file name, so, the file system would not allow the duplicates and the layer map was used as a check
legacy compaction isn't supposed to produce them, and hence
we probably don't have good coverage of layer map with duplicates / overlapping keys.

pageserver/src/tenant/storage_layer/delta_layer.rs

libs/utils/src/fs_ext/rename_noreplace.rs

problame

Addressed review comments.
We don't have strict convention regarding test_ prefix in front of tests.
I personally prefer the test_ prefix due to how "show callers" works with rust-analyzer + VScode.
If you care deeply about it, let's have a docs/ PR or RFC, discuss it with the team, and unify the entire code base in one go. (I would of course oppose the removal of the test_ prefix).

…licate-layers

pageserver/src/tenant/storage_layer/image_layer.rs

jcsp

+1 to the rewrite test changes

… new where appropriate

…l without it

problame · 2024-06-04T15:35:07Z

I decided to revert the changes that made the code abort: b48c21a

This PR achieves its goal without doing the aborts.

Ready to merge IMO, let's see what the test suite run produces.

problame added 6 commits May 17, 2024 17:44

WIP: abort on duplicate layers & fail early

7e91015

fixes #7790

make linux compile

4e5013a

cargo fmt & clippy

101a8eb

remove TODOs

3e17e1d

move to utils, write some tests for the wrapper

80ebef4

fixups

8f19839

problame requested a review from a team as a code owner May 17, 2024 16:33

problame requested a review from koivunej May 17, 2024 16:33

Merge remote-tracking branch 'origin/main' into problame/abort-on-dup…

52c4267

…licate-layers

problame self-assigned this May 17, 2024

koivunej self-assigned this May 20, 2024

koivunej reviewed May 20, 2024

View reviewed changes

libs/utils/src/fs_ext/rename_noreplace.rs Show resolved Hide resolved

koivunej reviewed May 20, 2024

View reviewed changes

pageserver/src/tenant/timeline.rs Outdated Show resolved Hide resolved

koivunej reviewed May 20, 2024

View reviewed changes

test_runner/regress/test_duplicate_layers.py Show resolved Hide resolved

koivunej reviewed May 20, 2024

View reviewed changes

pageserver/src/tenant/timeline.rs Outdated Show resolved Hide resolved

koivunej requested changes May 22, 2024

View reviewed changes

problame added 2 commits May 27, 2024 16:50

remove useless comments

4aee43c

re-add the test_actually_duplicated_l1 under different name

9795a9d

What it actually does nowadays is test the RFC-27. It pre-dates implementation of RFC-27.

share the code that cleans up the temp file on finish() failure betwe…

99019b4

…en image & delta before this patch, only deltas would get cleaned up but images benefit similarly, so, share the code this could be a preliminary PR, but, finish() failure is subject of this PR, so, it fits in