Session Cleanup+Simplifications #7315

sipsma · 2024-05-08T05:15:00Z

Another part of the effort to support #6916 while also doing general cruft cleanup and setup for various upcoming efforts.

This changeset focuses on making sessions + associated state management simpler:

More comprehensible+centralized state management
- Rather than spread all over the place and tied together in random places, all of the state associated with a given session is in a daggerSession object and all of the state associated with a given client in a session is a daggerClient object
- The code is also a lot more structured and "boring" in terms of locking/mutating state/etc. Not a rube goldberg machine anymore
- The whole "pre-register a nested client's state before it calls", which was a fountain of confusion and bugs, is gone.
  - e.g. a bug was reported recently with use of terminal and nested sessions that was caused by this registration, but this PR had already accidentally fixed it, so there's just a commit with test coverage here
No more insane gRPC tunneling, the engine API is just an HTTP server now
- graphQL http requests are just that, don't have to tunnel them through gRPC streams
- session attachables are still gRPC based, but over a hijacked http conn (as opposed to a gRPC stream embedded in another gRPC stream)
- This allowed us to move off the Session method from buildkit's upstream controller interface
- That in turn let us delete huge chunks of complicated code around handing conns (i.e. engine/server/conn.go) and no longer need to be paranoid about gRPC max message limits in as many places
- This also allowed us to enable connection re-use for requests from the engine client to the engine server
The overall engine-wide state (mostly various buildkit+containerd entities) is also centralized now rather than spread confusingly amongst many files, which is slightly tangential but supported the above efforts.

Details

Objects + state + naming

Server - formerly known as BuildkitController
- This is not to be confused with the thing previously called Server, which was really more like the session state (and was thus confusing)
- All the "global" state for various buildkit+containerd entities like snapshotters, various cache dbs, the solver, worker+executor, etc. Also top level state for which sessions currently exist
  - There's a lot in there, but I personally much prefer it in one place rather than spread all over.
- Serves an HTTP API for gql queries, session attachables and shutdown, with requests scoped to the client based on clientMetadata (which is now sent in an http header). Code.
- Still implements the BuildkitController API since we do have some reliance on ListWorker at least, though we are free to change any+all of those to core API (i.e. gql) calls at any point (just got a bit too out of scope here)
daggerSession and daggerClient
- Basically what it says on the tin: the session-wide state for each session and the client-specific state for each client in a session
- Does state tracking with an enum of possible states like uninitialized, initialized, deleted (not complicated enough to go full-on state machine, but this still makes it all more obvious and easy to follow I think, especially when it comes to locking the state for mutations)
- I moved all the state that used to co-exist in core.Query and buildkit.Client to be in these structs too so there's less places to look+think about
- One notable thing gone is ClientCallContext - instead of trying to register all of that we're "stateless" in that the module+function-call metadata for a client is just plumbed through ExecutionMetadata+ClientMetadata, following the request path rather than both that and a pre-registration side-channel
  - e.g. here's where the executor supplies the ClientMetadata it was plumbed in the requests made by the nested client
- The logic for deciding when to end a session is now just "when the main client caller has no more active connections", at which point the session state is torn down and released. This is done by just incrementing and decrementing that count at the beginning/end of each http request

HTTP/2 Usage

I updated all the HTTP servers we create to explicitly support HTTP/2 (with h2c, aka no TLS), while also supporting HTTP/1 clients too
The main motivation here was:
- We wanted to get rid of the gRPC tunneling (for simplicity, performance, detaching from the BuildkitController.Session API and it's associated complications, etc.)
- But that meant that every time the http client needed to add to the connection pool it would have to invoke the connhelper (rather than open a new gRPC stream) which is an expensive operation for e.g. docker-container, kube-pod, etc. (spawns a subprocess)
- HTTP/2 solves that problem though via stream multiplexing; go's http/2 client by default only needs a pool of 2 conns (one for reqs, one for resps) and can just multiplex everything from there
  - I briefly looked at HTTP/3 since just sending udp packets back and forth would be even simpler conceptually, but it's still too immature+low-level in the go ecosystem
This seems to have worked pretty seamlessly, other than one gotcha I hit where typescript tests using node 18 only were erroring out (fix with details here)
There is also still a need to serve some gRPC APIs for the few remaining buildkit controller APIs we use and OTel, which is done via gRPC http handlers
- The docs on that suggest there are some missing advanced gRPC features (e.g. BFD, big frame detection, which is a performance optimization) but none of them have been obviously relevant to our use case. Utterly worst case scenario, there is a fallback option of serving http + grpc on separate listeners, but avoiding that complication unless proven 100% necessary
- These can also be migrated to pure graphql/plain-http APIs as desired

Session Attachables

As mentioned above, session attachables no longer require 2 layers of gRPC tunnels; instead there's just a /sessionAttachables http endpoint, which the server hijacks and uses as a raw conn for establishing the gRPC streams
That "hijack and invert client/server relationship" process involves a small dance in order to be robust against accidentally mixing/overlapping http+gRPC traffic, which can, unsurprisingly, confuse the computer
- Client-side and server-side implementation, with comments explaining. Basically just http req/resp + a 1 byte ack to synchronize the switch to gRPC
- I wanted to use upstream buildkit's builtin SessionManager.HandleHTTP method, which is somewhat similar, but it didn't handle the switch from http->grpc synchronously and was resulting in data getting mixed sometimes
A nice side effect of this in combination with the session state simplifications is that we no longer need to do the whole "retry making a request to verify the session is working"
- Instead, if we successfully connect these session attachables, we can know the session as a whole was successfully initialized and we can unblock client connect, returning to the caller
- That had more nice side effects of reducing possibilities of race conditions when requesting the caller session in various server-side APIs

sipsma · 2024-05-20T21:26:17Z

Ended up spinning out the support for serving nested execs from the executor here (ended up just becoming removal of the shim entirely). Coming back here now to finish up the rest of this refactor on top of that.

cmd/engine/main.go

Signed-off-by: Erik Sipsma <erik@sipsma.dev>

The --experimental-privileged-nesting flag was broken when used with terminal due to a panic around registering clients. This was fixed by commits before this one which completely removed the need to register clients, but backfilling the coverage now. Signed-off-by: Erik Sipsma <erik@sipsma.dev>

sipsma · 2024-06-08T01:38:34Z

analytics/analytics.go

+	// TODO: is it safe to update the json name or do we need cloud coordination?
+	SessionID string `json:"server_id,omitempty"`


This is a question for @vito @aluzzardi, not sure if we can just change the name of this json field and cloud will be fine or if it's more involved.

It doesn't actually matter that much, mostly aesthetics, but I guess discrepancies like this will add up in confusion in the long term.

sipsma · 2024-06-08T01:58:23Z

Hit a test failure flake I haven't seen previously:

services_test.go:478: 
        	Error Trace:	/app/core/integration/services_test.go:478
        	Error:      	"input: container.from.withServiceBinding.withExec.sync resolve: start d8dj7hiquom5i (aliased as www): health check errored: checking for port 8080/tcp: namespace for kjvnbk4mnlt2v514715uk730l not found in running state\n" does not contain "start d8dj7hiquom5i (aliased as www): exited:"

Will presume it's this PR's fault until proven otherwise, may be missing some synchronization of services somewhere? Or could be the service exiting prematurely for unrelated reasons?

sipsma · 2024-06-08T02:42:56Z

@vito FYI I think I may have hit the theoretical flake you described in the OTEL PR:

1534

    client_test.go:104: 

1535

        	Error Trace:	/app/core/integration/client_test.go:104

1536

        	Error:      	Not equal: 

1537

        	            	expected: 1

1538

        	            	actual  : 0

1539

1540

        	Test:       	TestClientMultiSameTrace

(tangential feature request - easier to copy logs from the cloud traces output 😄)

Does that look like it may be the flake you were imagining? This PR obviously changes all kinds of timing of almost everything, so not sure if it's jsu tthat or a legit issue.

vito · 2024-06-08T14:24:02Z

Does that look like it may be the flake you were imagining? This PR obviously changes all kinds of timing of almost everything, so not sure if it's jsu tthat or a legit issue.

Yep, that's the one. Sorry about that! I have an idea of how to fix it but it's a bit tricky. The problem is that trace and log data arrives independently, and we can't know whether a span has logs until we see logs for it for the first time, after which point we wait until EOF. But the test can still flake if we don't see the start of the logs before calling Close(). There's a echo hey; sleep 0.5 to try to counteract that, I suppose we could bump that sleep, but it might still flake under load.

In terms of an actual fix, I'm thinking we would need to set an attribute on the span to indicate that logs or at least an EOF should be consumed for it, and update the draining logic accordingly. But the problem is we don't control the span creation. We can man-in-the-middle it, and look for spans starting with exec I suppose, simillar to how we already man-in-the-middle the [internal] prefix into an attribute.

Signed-off-by: Erik Sipsma <erik@sipsma.dev>

These values are prohibited in HTTP 2+, but since we are proxying requests in dagger session/listen that may be coming from HTTP 1 clients, we need to be sure to clear them out before forwarding to the engine server, which is now HTTP/2. Before this, I was seeing failures in just the typescript SDK tests that used node 18, which apparently set `Connection: close` and caused all sorts of strange behavior in the go http client. Signed-off-by: Erik Sipsma <erik@sipsma.dev>

I realized that since FunctionCall includes arbitrary arg literal values and arbitrary parent object field literal vals, it can theoretically get almost arbitrarily large. This is a problem with HTTP headers since they have a max size (Go default is 1 MB, though it can be raised). Fortunately, there's no actual need to include this in HTTP headers; it was mildly convenient but since function call clients are served direct from the executor in the engine process, we can just directly provide the metadata to the session server *alongside* the http requests rather than stuff it into the http requests itself. Another possibility would be to move it to the body one way or another, but this approach was simpler overall than that. Included an integ test for coverage of this (fails before this commit, passes now). Signed-off-by: Erik Sipsma <erik@sipsma.dev>

Signed-off-by: Alex Suraci <alex@dagger.io>

I was still having to wait 5 minutes for OTEL logs to drain on some test runs that involved service tunneling. I believe there was a race condition where the tunnel code could end up writing logs after they were closed, which seems to create a leak and cause us to have to wait 5 minutes before a timeout is hit. I also saw a few other place otel logs were created and not closed, so added a close there too. Signed-off-by: Erik Sipsma <erik@sipsma.dev>

sipsma · 2024-06-10T09:32:10Z

Hit a test failure flake I haven't seen previously:
services_test.go:478: 
        	Error Trace:	/app/core/integration/services_test.go:478
        	Error:      	"input: container.from.withServiceBinding.withExec.sync resolve: start d8dj7hiquom5i (aliased as www): health check errored: checking for port 8080/tcp: namespace for kjvnbk4mnlt2v514715uk730l not found in running state\n" does not contain "start d8dj7hiquom5i (aliased as www): exited:"
Will presume it's this PR's fault until proven otherwise, may be missing some synchronization of services somewhere? Or could be the service exiting prematurely for unrelated reasons?

Can repro this one locally, think I know what's happening:

Service startContainer runs health checks and container start in parallel
The container exits
But the select on healthcheck+wait-for-exit happens to hit a health check error first https://github.com/sipsma/dagger/blob/4b96edabf78e0ae10d5a7b4425e5bd0afa207baf/core/service.go#L447-L463
It thus returns the health check error instead of the the container exiting error

cc @vito to double check this adds up to you too

If that's correct, I think this is technically independent of this PR and more likely just getting triggered now due to timing differences introduced here. But need to double check and will attempt a fix in the interest of not increasing probability of flakes here.

sipsma · 2024-06-10T09:33:15Z

On a positive note, with the extra telemetry draining fix commits appended to this PR, I'm seeing full engine tests take as low as 9 minutes, which is the fastest I've seen in a very long time 🎉

sipsma · 2024-06-11T05:54:22Z

Hit a test failure flake I haven't seen previously:
services_test.go:478: 
        	Error Trace:	/app/core/integration/services_test.go:478
        	Error:      	"input: container.from.withServiceBinding.withExec.sync resolve: start d8dj7hiquom5i (aliased as www): health check errored: checking for port 8080/tcp: namespace for kjvnbk4mnlt2v514715uk730l not found in running state\n" does not contain "start d8dj7hiquom5i (aliased as www): exited:"
Will presume it's this PR's fault until proven otherwise, may be missing some synchronization of services somewhere? Or could be the service exiting prematurely for unrelated reasons?
Can repro this one locally, think I know what's happening:

Service startContainer runs health checks and container start in parallel

The container exits

But the select on healthcheck+wait-for-exit happens to hit a health check error first https://github.com/sipsma/dagger/blob/4b96edabf78e0ae10d5a7b4425e5bd0afa207baf/core/service.go#L447-L463

It thus returns the health check error instead of the the container exiting error

If that's correct, I think this is technically independent of this PR and more likely just getting triggered now due to timing differences introduced here. But need to double check and will attempt a fix in the interest of not increasing probability of flakes here.

Found and fixed that flake here #7610

This was referenced May 10, 2024

Support Socket args from the CLI #6747

Open

core: fix custom CA certs in modules + add integ test coverage #7356

Merged

sipsma force-pushed the refactor-server-and-bk branch 2 times, most recently from 0c8f915 to 1878470 Compare May 24, 2024 02:50

sipsma mentioned this pull request May 31, 2024

engine: fix engine panic during dagger call ... terminal with nesting #7519

Closed

2 tasks

sipsma force-pushed the refactor-server-and-bk branch 5 times, most recently from dd3c007 to a615f02 Compare June 4, 2024 19:22

sipsma mentioned this pull request Jun 5, 2024

core: reduce module ID size by not including full schemas #7549

Merged

sipsma force-pushed the refactor-server-and-bk branch 2 times, most recently from 6f2b173 to 55a8e04 Compare June 5, 2024 05:59

sipsma mentioned this pull request Jun 5, 2024

OTel cleanups + better draining #7452

Merged

6 tasks

sipsma commented Jun 5, 2024

View reviewed changes

cmd/engine/main.go Outdated Show resolved Hide resolved

sipsma force-pushed the refactor-server-and-bk branch 5 times, most recently from 9542792 to 77406e4 Compare June 6, 2024 20:55

sipsma added this to the v0.11.7 milestone Jun 6, 2024

sipsma force-pushed the refactor-server-and-bk branch from 77406e4 to a5932ad Compare June 7, 2024 03:35

sipsma mentioned this pull request Jun 7, 2024

Various TS + general test fixes + improvements #7571

Merged

sipsma force-pushed the refactor-server-and-bk branch 2 times, most recently from 927d700 to 2780ed7 Compare June 7, 2024 20:19

sipsma modified the milestones: v0.11.7, next Jun 7, 2024

sipsma force-pushed the refactor-server-and-bk branch from 2780ed7 to 3c2985e Compare June 7, 2024 22:35

engine: centralize engine server-side setup

f0bfba1

Signed-off-by: Erik Sipsma <erik@sipsma.dev>

sipsma added 2 commits June 7, 2024 16:41

remove grpc tunneling

d4db3ce

Signed-off-by: Erik Sipsma <erik@sipsma.dev>

sipsma force-pushed the refactor-server-and-bk branch 2 times, most recently from 632daea to 22deba6 Compare June 8, 2024 00:12

sipsma changed the title ~~WIP refactor of server/sessions/buildkit-interfaces~~ Session Cleanup+Simplifications Jun 8, 2024

sipsma marked this pull request as ready for review June 8, 2024 01:31

sipsma requested review from vito and jedevc June 8, 2024 01:35

sipsma commented Jun 8, 2024

View reviewed changes

sipsma and others added 5 commits June 10, 2024 01:40

refactor session+client management

5a61b1b

Signed-off-by: Erik Sipsma <erik@sipsma.dev>

close service logs too

f362812

Signed-off-by: Alex Suraci <alex@dagger.io>

sipsma force-pushed the refactor-server-and-bk branch from ce8cf09 to 4b96eda Compare June 10, 2024 08:41

jedevc mentioned this pull request Jun 10, 2024

fix: ensure nested frontend builds get secret translation #7595

Merged

This was referenced Jun 10, 2024

OTel log draining follow-ups #7605

Merged

engine: fix context error wrapping in executor #7608

Merged

executor: fix deadlock in service exit from netns workers #7610

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Session Cleanup+Simplifications #7315

Session Cleanup+Simplifications #7315

sipsma commented May 8, 2024 •

edited

sipsma commented May 20, 2024

sipsma Jun 8, 2024

sipsma commented Jun 8, 2024

sipsma commented Jun 8, 2024

vito commented Jun 8, 2024

sipsma commented Jun 10, 2024

sipsma commented Jun 10, 2024

sipsma commented Jun 11, 2024

		// TODO: is it safe to update the json name or do we need cloud coordination?
		SessionID string `json:"server_id,omitempty"`

Session Cleanup+Simplifications #7315

Are you sure you want to change the base?

Session Cleanup+Simplifications #7315

Conversation

sipsma commented May 8, 2024 • edited

Details

Objects + state + naming

HTTP/2 Usage

Session Attachables

sipsma commented May 20, 2024

sipsma Jun 8, 2024

Choose a reason for hiding this comment

sipsma commented Jun 8, 2024

sipsma commented Jun 8, 2024

vito commented Jun 8, 2024

sipsma commented Jun 10, 2024

sipsma commented Jun 10, 2024

sipsma commented Jun 11, 2024

sipsma commented May 8, 2024 •

edited