-
Notifications
You must be signed in to change notification settings - Fork 359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proxy release 2024-05-16 #7776
Proxy release 2024-05-16 #7776
Conversation
Improves the tiered compaction tests: * Adds a new test that is a simpler version of the ignored `test_many_updates_for_single_key` test. * Reduces the amount of data that `test_many_updates_for_single_key` processes to make it execute more quickly. * Adds logging support.
## Problem The main point of this PR is to get rid of `python-jose` and `ecdsa` packages as transitive dependencies through `moto`. They have a bunch of open vulnerabilities[1][2][3] (which don't affect us directly), but it's nice not to have them at all. - [1] GHSA-wj6h-64fc-37mp - [2] GHSA-6c5p-j8vq-pqhj - [3] GHSA-cjwg-qfpm-7377 ## Summary of changes - Update `moto` from 4.1.2 to 5.0.6 - Update code to accommodate breaking changes in `moto_server`
…ServerConf` (#7642) Before this PR, `neon_local` would store a copy of a subset of the initial `pageserver.toml` in its `.neon/config`, e.g, `listen_pg_addr`. That copy is represented as `struct PageServerConf`. This copy was used to inform e.g., `neon_local endpoint` and other commands that depend on Pageserver about which port to connect to. The problem with that scheme is that the duplicated information in `.neon/config` can get stale if `pageserver.toml` is changed. This PR fixes that by eliminating populating `struct PageServerConf` from the `pageserver.toml`s. The `[[pageservers]]` TOML table in the `.neon/config` is obsolete. As of this PR, `neon_local` will fail to start and print an error informing about this change. Code-level changes: - Remove the `--pg-version` flag, it was only used for some checks during `neon_local init` - Remove the warn-but-continue behavior for when auth key creation fails but auth keys are not required. It's just complexity that is unjustified for a tool like `neon_local`. - Introduce a type-system-level distinction between the runtime state and the two (!) toml formats that are almost the same but not quite. - runtime state: `struct PageServerConf`, now without `serde` derives - toml format 1: the state in `.neon/config` => `struct OnDiskState` - toml format 2: the `neon_local init --config TMPFILE` that, unlike `struct OnDiskState`, allows specifying `pageservers` - Remove `[[pageservers]]` from the `struct OnDiskState` and load the data from the individual `pageserver.toml`s instead.
Fixes flaky test `test_gc_of_remote_layers`, which was failing because of the `Nothing to GC` pageserver log. I looked into the fails, it seems that backround `gc_loop` sometimes started GC for initial tenant, which wasn't configured to disable GC. The fix is to not create initial tenant with enabled gc at all. Fixes #7538
## Problem We currently have no way to see what the current LSN of a compute its, and in case of read replicas, we don't know what the difference in LSNs is. ## Summary of changes Adds these metrics
The test utils should only be used during tests. Users should not be able to create this extension on their own. Signed-off-by: Alex Chi Z <chi@neon.tech>
- Rename "filename" types which no longer map directly to a filename (LayerFileName -> LayerName) - Add a -v1- part to local layer paths to smooth the path to future updates (we anticipate a -v2- that uses checksums later) - Rename methods that refer to the string-ized version of a LayerName to no longer be called "filename" - Refactor reconcile() function to use a LocalLayerFileMetadata type that includes the local path, rather than carrying local path separately in a tuple and unwrap()'ing it later.
## Problem See #6714, #6967 ## Summary of changes Completely ignore page header when comparing VM pages. ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
For aux file keys (v1 or v2) the vectored read path does not return an error when they're missing. Instead they are omitted from the resulting btree (this is a requirement, not a bug). Skip updating the metric in these cases to avoid infinite results.
…(updates AWS SDKs) (#7664) Before this PR, using the AWS SDK profile feature for running against minio didn't work because * our SDK versions were too old and didn't include awslabs/aws-sdk-rust#1060 and * we didn't massage the s3 client config builder correctly. This PR * udpates all the AWS SDKs we use to, respectively, the latest version I could find on crates.io (Is there a better process?) * changes the way remote_storage constructs the S3 client, and * documents how to run the test suite against real S3 & local minio. Regarding the changes to `remote_storage`: if one reads the SDK docs, it is clear that the recommended way is to use `aws_config::from_env`, then customize. What we were doing instead is to use the `aws_sdk_s3` builder directly. To get the `local-minio` in the added docs working, I needed to update both the SDKs and make the changes to the `remote_storage`. See the commit history in this PR for details. Refs: * byproduct: smithy-lang/smithy-rs#3633 * follow-up on deprecation: #7665 * follow-up for scrubber S3 setup: #7667
## Problem Various performance test cases were destabilized by the recent upgrade of `reqwest`, because it changes an error string. Examples: - https://neon-github-public-dev.s3.amazonaws.com/reports/main/9005532594/index.html#testresult/3f984e471a9029a5/ - https://neon-github-public-dev.s3.amazonaws.com/reports/main/9005532594/index.html#testresult/8bd0f095fe0402b7/ The performance tests suffer from this more than most tests, because they churn enough data that the pageserver is still trying to contact the storage controller while it is shut down at the end of tests. ## Summary of changes s/Connection refused/error sending request/
…cheduling optimization (#7673) ## Problem Storage controller was using a zero layer count in SecondaryProgress as a proxy for "not initialized". However, in tenants with zero timelines (a legitimate state), the layer count remains zero forever. This caused #7583 to destabilize the storage controller scale test, which creates lots of tenants, some of which don't get any timelines. ## Summary of changes - Use a None mtime instead of zero layer count to determine if a SecondaryProgress should be ignored. - Adjust the test to use a shorter heatmap upload period to let it proceed faster while waiting for scheduling optimizations to complete.
This PR does two things: First, it fixes a bug with tiered compaction's k-merge implementation. It ignored the lsn of a key during ordering, so multiple updates of the same key could be read in arbitrary order, say from different layers. For example there is layers `[(a, 2),(b, 3)]` and `[(a, 1),(c, 2)]` in the heap, they might return `(a,2)` and `(a,1)`. Ultimately, this change wasn't enough to fix the ordering issues in #7296, in other words there is likely still bugs in the k-merge. So as the second thing, we switch away from the k-merge to an in-memory based one, similar to #4839, but leave the code around to be improved and maybe switched to later on. Part of #7296
…ts/csharp/npgsql (#7680)
## Problem #7637 breaks forward compat test. On commit ea531d4. https://neon-github-public-dev.s3.amazonaws.com/reports/main/8988324349/index.html ``` test_create_snapshot 2024-05-07T16:03:11.331883Z INFO version: git-env:ea531d448eb65c4f58abb9ef7d8cd461952f7c5f failpoints: true, features: ["testing"] launch_timestamp: 2024-05-07 16:03:11.316131763 UTC build_tag: build_tag-env:5159 test_forward_compatibility 2024-05-07T16:07:02.310769Z INFO version: git-env:ea531d448eb65c4f58abb9ef7d8cd461952f7c5f failpoints: true, features: ["testing"] launch_timestamp: 2024-05-07 16:07:02.294676183 UTC build_tag: build_tag-env:5159 ``` The forward compatibility test is actually using the same tag as the current build. The commit before that, https://neon-github-public-dev.s3.amazonaws.com/reports/main/8988126011/index.html ``` test_create_snapshot 2024-05-07T15:47:21.900796Z INFO version: git-env:2dbd1c1ed5cd0458933e8ffd40a9c0a5f4d610b8 failpoints: true, features: ["testing"] launch_timestamp: 2024-05-07 15:47:21.882784185 UTC build_tag: build_tag-env:5158 test_forward_compatibility 2024-05-07T15:50:48.828733Z INFO version: git-env:c4d7d5982553d2cf66634d1fbf85d95ef44a6524 failpoints: true, features: ["testing"] launch_timestamp: 2024-05-07 15:50:48.816635176 UTC build_tag: build_tag-env:release-5434 ``` This pull request patches the bin path so that the new neon_local will use the old binary. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem There is no global per-ep rate limiter in proxy. ## Summary of changes * Return global per-ep rate limiter back. * Rename weak compute rate limiter (the cli flags were not used anywhere, so it's safe to rename).
## Problem This caused a variation of the stats bug fixed by #7662. That PR also fixed this case, but we still shouldn't make redundant get calls. ## Summary of changes - Only call get in the create image layers loop at the end of a range if some keys have been accumulated
## Problem Move from aws based arm64 runners to bare-metal based ## Summary of changes Changes in GitHub action workflows where `runs-on: arm64` used. More parallelism added, build time for `neon with extra platform builds` workflow reduced from 45m to 25m
We didn't check permission in `"/v1/failpoints"` endpoint, it means that everyone with per-tenant token could modify the failpoints. This commit fixes that.
We had accidentally left two endpoints for `tenant`: `/synthetic_size` and `/size`. Size had the more extensive description but has returned 404 since renaming. Remove the `/size` in favor of the working one and describe the `text/html` output.
in addition to layer names, expand the input vocabulary to recognize lines in the form of: ${kind}:${lsn} where: - kind in `gc_cutoff` or `branch` - lsn is accepted in Lsn display format (x/y) or hex (as used in layer names) gc_cutoff and branch have different colors.
While switching to use nextest with the repository in f28bdb6, we had not noticed that it doesn't yet support running doctests. Run the doc tests before other tests.
The old test based on the immutable `target_file_size` that was a parameter to the function. It makes no sense to go further once `current_level_target_height` has reached `u64::MAX`, as lsn's are u64 typed. In practice, we should only run into this if there is a bug, as the practical lsn range usually ends much earlier. Testing on `target_file_size` makes less sense, it basically implements an invocation mode that turns off the looping and only runs one iteration of it. @hlinnaka agrees that `current_level_target_height` is better here. Part of #7554
Split checkpoint_stats into two separate metrics: checkpoints_req and checkpoints_timed Fixes commit 21e1a49 --------- Co-authored-by: Peter Bendel <peterbendel@neon.tech>
The first implementation #7456 did not include `index_part.json` changes in an attempt to keep amount of changes down. Tracks the historic reparentings and earlier detach in `index_part.json`. - `index_part.json` receives a new field `lineage: Lineage` - `Lineage` is queried through RemoteTimelineClient during basebackup, creating `PREV LSN: none` for the invalid prev record lsn just as it would had been created for a newly created timeline - as `struct IndexPart` grew, it is now boxed in places Cc: #6994
In timeline detach ancestor tests there is no way to really be sure that there were no subtle off-by one bugs. One such bug is demoed and reverted. Add verifying fullbackup is equal before and after detaching ancestor. Fullbackup is expected to be equal apart from `zenith.signal`, which is known to be good because endpoint can be started without the detached branch receiving writes.
The test has been flaky since 2024-04-11 for unknown reason, and the logging was off. Fix the logging and raise the limit a bit. The problematic ratio reproduces with pg14 and added sleep (not included) but not on pg15. The new ratio abs diff limit works for all inspected examples. Cc: #7536
Per [evidence] the timeline ancestor detach tests can panic while shutting down on vectored get validation. Allow the error because tenant is restarted twice in the test. [evidence]: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-7708/9058185709/index.html#suites/a1c2be32556270764423c495fad75d47/d444f7e5c0a18ce9
## Problem `report-benchmarks-failures` job is triggered for any failure in the CI pipeline, but we need it to be triggered only for failed `benchmarks` job ## Summary of changes - replace `failure()` with `needs.benchmarks.result == 'failure'` in the condition
A test for #7684. This pull request checks if the pageserver version we specified is the one actually running by comparing the git hash in forward compatibility tests. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem Shards with number >0 could hang waiting for `await_initial_logical_size`, as we don't calculate logical size on these shards. This causes them to hold onto semaphore units and starve other tenants out from proceeding with warmup activation. That doesn't hurt availability (we still have on-demand activation), but it does mean that some background tasks like consumption metrics would omit some tenants. ## Summary of changes - Skip waiting for logical size calculation on shards >0 - Upgrade unexpected code paths to use debug_assert!(), which acts as an implicit regression test for this issue, and make the info() one into a warn()
The [2.31.0 release](https://github.com/rui314/mold/releases/tag/v2.31.0) of mold includes a 10% speed improvement for binaries with a lot of debug info. As we have such, it might be useful to update mold to the latest release. The jump is from 2.4.0 to 2.31.0, but it's not been many releases in between as the version number was raised by the mold maintainers to 2.30.0 after 2.4.1 [to avoid confusion for some tools](https://github.com/rui314/mold/releases/tag/v2.30.0).
…#7747) This PR allows setting the `PAGESERVER_DEFAULT_TENANT_CONFIG_COMPACTION_ALGORITHM` env var to override the `tenant_config.compaction_algorithm` field in the initial `pageserver.toml` for all tests. I tested manually that this works by halting a test using pdb and inspecting the `effective_config` in the tenant status managment API. If the env var is set, the tests are parametrized by the `kind` tag field, allowing to do a matrix build in CI and let Allure summarize everything in a nice report. If the env var is not set, the tests are not parametrized. So, merging this PR doesn't cause problems for flaky test detection. In fact, it doesn't cause any runtime change if the env var is not set. There are some tests in the test suite that set used to override the entire tenant_config using `NeonEnvBuilder.pageserver_config_override`. Since config overrides are merged non-recursively, such overrides that don't specify `kind = ` cause a fallback to pageserver's built-in `DEFAULT_COMPACTION_ALGORITHM`. Such cases can be found using ``` ["']tenant_config\s*[='"] ``` We'll deal with these tests in a future PR. closes #7555
…#7740) ## Problem There are not enough arm runners and jobs in `neon-extra-builds` workflow take about the same amount of time on a small-arm runner as on large-arm. ## Summary of changes - Switch `neon-extra-builds` workflow from `large-arm64` to `small-arm64` runners
…file refs (#7752) ## Problem This is historical baggage from when the pageserver could be run with local disk only: we had a bunch of places where we had to treat remote storage as optional. Closes: #6890 ## Changes - Remove Option<> around remote storage (in #7722 we made remote storage clearly mandatory) - Remove code for deleting old metadata files: they're all gone now. - Remove other references to metadata files when loading directories, as none exist. I checked last 14 days of logs for "found legacy metadata", there are no instances.
They have merged our PR nical/rust_debug#4 but they haven't released a new crate version yet. refs #7763
refs #7753 This PR is step (1) of removing sync walredo from Pageserver. Changes: * Remove the sync impl * If sync is configured, warn! and use async instead * Remove the metric that exposes `kind` * Remove the tenant status API that exposes `kind` Future Work ----------- After we've released this change to prod and are sure we won't roll back, we will 1. update the prod Ansible to remove the config flag from the prod pageserver.toml. 2. remove the remaining `kind` code in pageserver These two changes need no release inbetween. See #7753 for details.
Adds a test that is a reproducer for many tiered compaction bugs, both ones that have since been fixed as well as still unfxied ones: * (now fixed) #7296 * #7707 * #7759 * Likely also #7244 but I haven't tried that. The key ordering bug can be reproduced by switching to `merge_delta_keys` instead of `merge_delta_keys_buffered`, so reverting a big part of #7661, although it only sometimes reproduces (30-50% of cases). part of #7554
Add performance regress test for on-demand download throughput. Closes #7146 Co-authored-by: Christian Schwarz <christian@neon.tech> Co-authored-by: Alexander Bayandin <alexander@neon.tech>
…_policy (#7769) Signed-off-by: Alex Chi Z <chi@neon.tech>
3060 tests run: 2933 passed, 0 failed, 127 skipped (full report)Code coverage* (full report)
* collected from Rust tests only The comment gets automatically updated with the latest test results
c6d5ff9 at 2024-05-16T06:49:18.333Z :recycle: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM as I don't think proxy release should be blocked on / affected by storage changes.
There's no codeowner checks for release PRs ;) |
Reviewed for changelog. |
Proxy release 2024-05-16
Please merge this Pull Request using 'Create a merge commit' button