Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Session Cleanup+Simplifications #7315

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

sipsma
Copy link
Contributor

@sipsma sipsma commented May 8, 2024

Another part of the effort to support #6916 while also doing general cruft cleanup and setup for various upcoming efforts.

This changeset focuses on making sessions + associated state management simpler:

  1. More comprehensible+centralized state management
    • Rather than spread all over the place and tied together in random places, all of the state associated with a given session is in a daggerSession object and all of the state associated with a given client in a session is a daggerClient object
    • The code is also a lot more structured and "boring" in terms of locking/mutating state/etc. Not a rube goldberg machine anymore
    • The whole "pre-register a nested client's state before it calls", which was a fountain of confusion and bugs, is gone.
      • e.g. a bug was reported recently with use of terminal and nested sessions that was caused by this registration, but this PR had already accidentally fixed it, so there's just a commit with test coverage here
  2. No more insane gRPC tunneling, the engine API is just an HTTP server now
    • graphQL http requests are just that, don't have to tunnel them through gRPC streams
    • session attachables are still gRPC based, but over a hijacked http conn (as opposed to a gRPC stream embedded in another gRPC stream)
    • This allowed us to move off the Session method from buildkit's upstream controller interface
    • That in turn let us delete huge chunks of complicated code around handing conns (i.e. engine/server/conn.go) and no longer need to be paranoid about gRPC max message limits in as many places
    • This also allowed us to enable connection re-use for requests from the engine client to the engine server
  3. The overall engine-wide state (mostly various buildkit+containerd entities) is also centralized now rather than spread confusingly amongst many files, which is slightly tangential but supported the above efforts.

Details

Objects + state + naming

  • Server - formerly known as BuildkitController
    • This is not to be confused with the thing previously called Server, which was really more like the session state (and was thus confusing)
    • All the "global" state for various buildkit+containerd entities like snapshotters, various cache dbs, the solver, worker+executor, etc. Also top level state for which sessions currently exist
      • There's a lot in there, but I personally much prefer it in one place rather than spread all over.
    • Serves an HTTP API for gql queries, session attachables and shutdown, with requests scoped to the client based on clientMetadata (which is now sent in an http header). Code.
    • Still implements the BuildkitController API since we do have some reliance on ListWorker at least, though we are free to change any+all of those to core API (i.e. gql) calls at any point (just got a bit too out of scope here)
  • daggerSession and daggerClient
    • Basically what it says on the tin: the session-wide state for each session and the client-specific state for each client in a session
    • Does state tracking with an enum of possible states like uninitialized, initialized, deleted (not complicated enough to go full-on state machine, but this still makes it all more obvious and easy to follow I think, especially when it comes to locking the state for mutations)
    • I moved all the state that used to co-exist in core.Query and buildkit.Client to be in these structs too so there's less places to look+think about
    • One notable thing gone is ClientCallContext - instead of trying to register all of that we're "stateless" in that the module+function-call metadata for a client is just plumbed through ExecutionMetadata+ClientMetadata, following the request path rather than both that and a pre-registration side-channel
      • e.g. here's where the executor supplies the ClientMetadata it was plumbed in the requests made by the nested client
    • The logic for deciding when to end a session is now just "when the main client caller has no more active connections", at which point the session state is torn down and released. This is done by just incrementing and decrementing that count at the beginning/end of each http request

HTTP/2 Usage

  • I updated all the HTTP servers we create to explicitly support HTTP/2 (with h2c, aka no TLS), while also supporting HTTP/1 clients too
  • The main motivation here was:
    • We wanted to get rid of the gRPC tunneling (for simplicity, performance, detaching from the BuildkitController.Session API and it's associated complications, etc.)
    • But that meant that every time the http client needed to add to the connection pool it would have to invoke the connhelper (rather than open a new gRPC stream) which is an expensive operation for e.g. docker-container, kube-pod, etc. (spawns a subprocess)
    • HTTP/2 solves that problem though via stream multiplexing; go's http/2 client by default only needs a pool of 2 conns (one for reqs, one for resps) and can just multiplex everything from there
      • I briefly looked at HTTP/3 since just sending udp packets back and forth would be even simpler conceptually, but it's still too immature+low-level in the go ecosystem
  • This seems to have worked pretty seamlessly, other than one gotcha I hit where typescript tests using node 18 only were erroring out (fix with details here)
  • There is also still a need to serve some gRPC APIs for the few remaining buildkit controller APIs we use and OTel, which is done via gRPC http handlers
    • The docs on that suggest there are some missing advanced gRPC features (e.g. BFD, big frame detection, which is a performance optimization) but none of them have been obviously relevant to our use case. Utterly worst case scenario, there is a fallback option of serving http + grpc on separate listeners, but avoiding that complication unless proven 100% necessary
    • These can also be migrated to pure graphql/plain-http APIs as desired

Session Attachables

  • As mentioned above, session attachables no longer require 2 layers of gRPC tunnels; instead there's just a /sessionAttachables http endpoint, which the server hijacks and uses as a raw conn for establishing the gRPC streams
  • That "hijack and invert client/server relationship" process involves a small dance in order to be robust against accidentally mixing/overlapping http+gRPC traffic, which can, unsurprisingly, confuse the computer
    • Client-side and server-side implementation, with comments explaining. Basically just http req/resp + a 1 byte ack to synchronize the switch to gRPC
    • I wanted to use upstream buildkit's builtin SessionManager.HandleHTTP method, which is somewhat similar, but it didn't handle the switch from http->grpc synchronously and was resulting in data getting mixed sometimes
  • A nice side effect of this in combination with the session state simplifications is that we no longer need to do the whole "retry making a request to verify the session is working"
    • Instead, if we successfully connect these session attachables, we can know the session as a whole was successfully initialized and we can unblock client connect, returning to the caller
    • That had more nice side effects of reducing possibilities of race conditions when requesting the caller session in various server-side APIs

@sipsma
Copy link
Contributor Author

sipsma commented May 20, 2024

Ended up spinning out the support for serving nested execs from the executor here (ended up just becoming removal of the shim entirely). Coming back here now to finish up the rest of this refactor on top of that.

@sipsma sipsma force-pushed the refactor-server-and-bk branch 2 times, most recently from 0c8f915 to 1878470 Compare May 24, 2024 02:50
@sipsma sipsma force-pushed the refactor-server-and-bk branch 5 times, most recently from dd3c007 to a615f02 Compare June 4, 2024 19:22
@sipsma sipsma force-pushed the refactor-server-and-bk branch 2 times, most recently from 6f2b173 to 55a8e04 Compare June 5, 2024 05:59
@sipsma sipsma mentioned this pull request Jun 5, 2024
6 tasks
cmd/engine/main.go Outdated Show resolved Hide resolved
@sipsma sipsma force-pushed the refactor-server-and-bk branch 5 times, most recently from 9542792 to 77406e4 Compare June 6, 2024 20:55
@sipsma sipsma added this to the v0.11.7 milestone Jun 6, 2024
@sipsma sipsma force-pushed the refactor-server-and-bk branch 2 times, most recently from 927d700 to 2780ed7 Compare June 7, 2024 20:19
@sipsma sipsma modified the milestones: v0.11.7, next Jun 7, 2024
Signed-off-by: Erik Sipsma <erik@sipsma.dev>
Signed-off-by: Erik Sipsma <erik@sipsma.dev>
The --experimental-privileged-nesting flag was broken when used with
terminal due to a panic around registering clients.

This was fixed by commits before this one which completely removed the
need to register clients, but backfilling the coverage now.

Signed-off-by: Erik Sipsma <erik@sipsma.dev>
@sipsma sipsma force-pushed the refactor-server-and-bk branch 2 times, most recently from 632daea to 22deba6 Compare June 8, 2024 00:12
@sipsma sipsma changed the title WIP refactor of server/sessions/buildkit-interfaces Session Cleanup+Simplifications Jun 8, 2024
@sipsma sipsma marked this pull request as ready for review June 8, 2024 01:31
@sipsma sipsma requested review from vito and jedevc June 8, 2024 01:35
Comment on lines +33 to +34
// TODO: is it safe to update the json name or do we need cloud coordination?
SessionID string `json:"server_id,omitempty"`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a question for @vito @aluzzardi, not sure if we can just change the name of this json field and cloud will be fine or if it's more involved.

It doesn't actually matter that much, mostly aesthetics, but I guess discrepancies like this will add up in confusion in the long term.

@sipsma
Copy link
Contributor Author

sipsma commented Jun 8, 2024

Hit a test failure flake I haven't seen previously:

services_test.go:478: 
        	Error Trace:	/app/core/integration/services_test.go:478
        	Error:      	"input: container.from.withServiceBinding.withExec.sync resolve: start d8dj7hiquom5i (aliased as www): health check errored: checking for port 8080/tcp: namespace for kjvnbk4mnlt2v514715uk730l not found in running state\n" does not contain "start d8dj7hiquom5i (aliased as www): exited:"

Will presume it's this PR's fault until proven otherwise, may be missing some synchronization of services somewhere? Or could be the service exiting prematurely for unrelated reasons?

@sipsma
Copy link
Contributor Author

sipsma commented Jun 8, 2024

@vito FYI I think I may have hit the theoretical flake you described in the OTEL PR:

1534

    client_test.go:104: 

1535

        	Error Trace:	/app/core/integration/client_test.go:104

1536

        	Error:      	Not equal: 

1537

        	            	expected: 1

1538

        	            	actual  : 0

1539

1540

        	Test:       	TestClientMultiSameTrace

(tangential feature request - easier to copy logs from the cloud traces output 😄)

Does that look like it may be the flake you were imagining? This PR obviously changes all kinds of timing of almost everything, so not sure if it's jsu tthat or a legit issue.

@vito
Copy link
Contributor

vito commented Jun 8, 2024

Does that look like it may be the flake you were imagining? This PR obviously changes all kinds of timing of almost everything, so not sure if it's jsu tthat or a legit issue.

Yep, that's the one. Sorry about that! I have an idea of how to fix it but it's a bit tricky. The problem is that trace and log data arrives independently, and we can't know whether a span has logs until we see logs for it for the first time, after which point we wait until EOF. But the test can still flake if we don't see the start of the logs before calling Close(). There's a echo hey; sleep 0.5 to try to counteract that, I suppose we could bump that sleep, but it might still flake under load.

In terms of an actual fix, I'm thinking we would need to set an attribute on the span to indicate that logs or at least an EOF should be consumed for it, and update the draining logic accordingly. But the problem is we don't control the span creation. We can man-in-the-middle it, and look for spans starting with exec I suppose, simillar to how we already man-in-the-middle the [internal] prefix into an attribute.

sipsma and others added 5 commits June 10, 2024 01:40
Signed-off-by: Erik Sipsma <erik@sipsma.dev>
These values are prohibited in HTTP 2+, but since we are proxying
requests in dagger session/listen that may be coming from HTTP 1
clients, we need to be sure to clear them out before forwarding to the
engine server, which is now HTTP/2.

Before this, I was seeing failures in just the typescript SDK tests that
used node 18, which apparently set `Connection: close` and caused all
sorts of strange behavior in the go http client.

Signed-off-by: Erik Sipsma <erik@sipsma.dev>
I realized that since FunctionCall includes arbitrary arg literal values
and arbitrary parent object field literal vals, it can theoretically get
almost arbitrarily large. This is a problem with HTTP headers since they
have a max size (Go default is 1 MB, though it can be raised).

Fortunately, there's no actual need to include this in HTTP headers; it
was mildly convenient but since function call clients are served direct
from the executor in the engine process, we can just directly provide
the metadata to the session server *alongside* the http requests rather
than stuff it into the http requests itself.

Another possibility would be to move it to the body one way or another,
but this approach was simpler overall than that.

Included an integ test for coverage of this (fails before this commit,
passes now).

Signed-off-by: Erik Sipsma <erik@sipsma.dev>
Signed-off-by: Alex Suraci <alex@dagger.io>
I was still having to wait 5 minutes for OTEL logs to drain on some test
runs that involved service tunneling. I believe there was a race
condition where the tunnel code could end up writing logs after they
were closed, which seems to create a leak and cause us to have to wait 5
minutes before a timeout is hit.

I also saw a few other place otel logs were created and not closed, so
added a close there too.

Signed-off-by: Erik Sipsma <erik@sipsma.dev>
@sipsma
Copy link
Contributor Author

sipsma commented Jun 10, 2024

Hit a test failure flake I haven't seen previously:

services_test.go:478: 
        	Error Trace:	/app/core/integration/services_test.go:478
        	Error:      	"input: container.from.withServiceBinding.withExec.sync resolve: start d8dj7hiquom5i (aliased as www): health check errored: checking for port 8080/tcp: namespace for kjvnbk4mnlt2v514715uk730l not found in running state\n" does not contain "start d8dj7hiquom5i (aliased as www): exited:"

Will presume it's this PR's fault until proven otherwise, may be missing some synchronization of services somewhere? Or could be the service exiting prematurely for unrelated reasons?

Can repro this one locally, think I know what's happening:

cc @vito to double check this adds up to you too

If that's correct, I think this is technically independent of this PR and more likely just getting triggered now due to timing differences introduced here. But need to double check and will attempt a fix in the interest of not increasing probability of flakes here.

@sipsma
Copy link
Contributor Author

sipsma commented Jun 10, 2024

On a positive note, with the extra telemetry draining fix commits appended to this PR, I'm seeing full engine tests take as low as 9 minutes, which is the fastest I've seen in a very long time 🎉

@sipsma
Copy link
Contributor Author

sipsma commented Jun 11, 2024

Hit a test failure flake I haven't seen previously:

services_test.go:478: 
        	Error Trace:	/app/core/integration/services_test.go:478
        	Error:      	"input: container.from.withServiceBinding.withExec.sync resolve: start d8dj7hiquom5i (aliased as www): health check errored: checking for port 8080/tcp: namespace for kjvnbk4mnlt2v514715uk730l not found in running state\n" does not contain "start d8dj7hiquom5i (aliased as www): exited:"

Will presume it's this PR's fault until proven otherwise, may be missing some synchronization of services somewhere? Or could be the service exiting prematurely for unrelated reasons?

Can repro this one locally, think I know what's happening:

If that's correct, I think this is technically independent of this PR and more likely just getting triggered now due to timing differences introduced here. But need to double check and will attempt a fix in the interest of not increasing probability of flakes here.

Found and fixed that flake here #7610

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants