-
Notifications
You must be signed in to change notification settings - Fork 557
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐞 Dagger CLI v0.11.4 deadlocks when running dagger call --source=.:default test all
#7387
Comments
Just linking to another client deadlock here: #7048. Might not be the same, but TL;DR - if the span doesn't ever complete, then the client will deadlock. |
You don't happen to have a dagger cloud run or an engine SIGQUIT result do you? Without those, it's difficult to see what the engine is trying to do that's taking so long, or whether there's a runaway span somewhere. Glancing through the stack trace, there's one specific one that stands out: goroutine 1 is stuck executing a query - it looks like everything else is just progress waiting for it terminate (there's also a goroutine leak for the streaming output, which should be fixed as part of #7272 when it gets removed entirely). |
Today I was running testdev on our CI and it looks like it got stuck (possibly a deadlock?) for about 40 minutes until finally exiting: Checking the runner pod I could see the process was still up and the dagger engines (within the runner container) were up too: Not sure if its useful but here are the logs of the engine in the node: I don't have a link to dagger cloud, can one be dug up from the logs? The run was this one: https://github.com/dagger/dagger/actions/runs/9084221778/job/25002579112. Raw logs: https://pipelinesghubeus2.actions.githubusercontent.com/QzgPDIDyToE3J2EXFsF53FDrKXVmSNMaCDdKFJvQQISWafQ8Xj/_apis/pipelines/1/runs/126051/signedlogcontent/22?urlExpires=2024-05-15T17%3A48%3A16.9360768Z&urlSigningMethod=HMACV1&urlSignature=Unk2A%2F5EyznUgRLhgdfM39O0F8C1mzMXqHtg5vlf4Z4%3D |
I'm able to make the engine misbehave by doing some stress testing:
code: package main
import (
"context"
"fmt"
"golang.org/x/sync/errgroup"
)
type Repro struct{}
// Returns a container that echoes whatever string argument is provided
func (m *Repro) Run(ctx context.Context) string {
base := dag.Container().From("alpine:latest")
eg := errgroup.Group{}
for i := 0; i < 10000; i++ {
i := i
eg.Go(func() error {
_, err := base.
WithExec([]string{"echo", fmt.Sprintf("hello %d", i)}).
WithExec([]string{"sleep", "5"}).
Sync(ctx)
return err
})
}
if err := eg.Wait(); err != nil {
panic(err)
}
return "done"
} |
@jedevc Dagger Cloud trace + GitHub Actions job from last time this deadlock happened in our CI:
I am trying to reproduce locally so that I can get that goroutines dump from the Engine. |
@aluzzardi I think that's definitely a legitimate failure, but I think this is likely different, that error message looks quite different from what we see in https://dagger.cloud/dagger/traces/3add32bb1a460bde62b9ccbbfd25d0ca?span=4e373cd85ec633b2&logs#4e373cd85ec633b2:L1542 (love the permalinking for logs in dagger cloud). A couple notes on the linked trace:
Debugging this isn't very fun as well. For the next release, I think we should make sure to get:
|
Starting with @aluzzardi's repro, I spawned 10k x
I didn't have patience to wait for the rest. This is what the above looks like in 1 picture: This looks like a "thundering herd problem" to me. Without being able to look inside BuildKit, I cannot tell which component in stressed. My guess is that there is some sort of a worker pool that gets overwhelmed by the 10k containers that need to get spawned. Looking at the disk activity, my assumption is that there is a special goroutine that allocates an fs layer for each container which gets overwhelmed. Maybe @sipsma or @jedevc know more. I am going to see if I can attach some sort of continuous profiler and understand which function blocks the most. cc @thorfour @brancz By the time I finished typing this up, things are back to normal: |
My first guess (truly just a hunch though) at the error hit in @aluzzardi's example is that the system is running out of memory and processes are getting OOM killed. This is just based on the repro spinning up thousands of containers and the |
When I tried to reproduce @sipsma, I didn't get any OOM's, just a very slow start, and a slow progress. Most of it happened after |
@gerhard if you didn't see the same error as Andrea about exit code 137 then yeah it's probably not getting OOM killed (which is gonna vary host to host of course). That's just my guess about what happened for him. I'm gonna start taking a look into this now. I yesterday independently hit an integ test case that hung for 5 minutes while waiting for OTEL to drain (and the 5 minute timeout was just the timeout of SDKs waiting for the No clue if it's related to everything else here though, will try to see if I can repro the other behavior here. |
Okay yeah the 5 minute hang on OTEL draining repros consistently on main for me:
Funnily if you SIGQUIT on the
Attached the engine container logs, w/
cc @vito I think either way one output of all this should be to not block indefinitely for OTEL to drain (though having a generous timeout makes perfect sense). Gonna try to repro the other hangs reported here. I guess it's possible they are related somehow, but the description of them so far does sound like something else entirely. EDIT: of course after I post this and start reading I see this comment and issue #7387 (comment) Will mention this over in that issue too. |
@gerhard the SIGQUIT stack you attached in the description works in swirly for me, the logs seem to have one stack trace, then logs from the engine running again and then another SIGQUIT dump; if you trim it down to just the last goroutine dump it loads quickly. I created a gist you can put as the URL in swirly here: https://gist.githubusercontent.com/sipsma/bfd34d24b3a51c64fafa0eb93abbe5a1/raw/0b3640b1034ae509bdfec9791c985b011b2e85e7/gistfile1.txt Looking at it, the most significant thing that sticks out is that there are actual containers still running that the engine has been waiting on for 53 minutes. Which suggests that there are processes (possibly function calls) that are stuck and never exiting. I am trying to repro locally now by just running the tests repeatedly, but if anyone else manages to repro it, execing into the engine container and running
|
Thanks for helping out @sipsma! I was able to trigger the hang again, and capture Here is the full CLI log: And this is just the Engine goroutines dump: This is what they look like loaded in Swirly: Screen.Recording.2024-05-20.at.13.32.46.mp4There are a lot of grpc-related goroutines - is that normal? Anything else that looks odd? I am going to try and use your repro to see if I can trigger this quicker. Before, it was taking up to 50mins. I now have a second host which is able to finish the same test suite in 30mins. Put together, I should be able to repro quicker. This helps: dagger-test-all() {
local parallel="${1:-16}"
local timeout="${2:-180m}"
time dagger call --debug --source=".:default" test all --race=true --failfast=true --parallel=$parallel --timeout=$timeout 2>&1 | ts | tee test.all.race.failfast.parallel${parallel}.timeout${timeout}.$(date "+%Y-%m-%d.%H-%M-%S").txt
} I am going to open a PR which makes those flags configurable. |
Noticed one issue. tl;dr telemetry draining is keyed on trace ID, so if you open multiple clients in one trace, closing the second client will hang waiting for the first client's session span, since it's in the same trace ID. (The timeout is 5 minutes.) ContextI'm doing a bunch of refactoring now that the Go OTel SDK has logging (only HTTP, no gRPC - that'll come up later). Along the way I noticed Looking deeper, I noticed the test does something unusual. First, it connects like normal: dagger/core/integration/container_test.go Line 1124 in da68978
But then in two of the sub-tests it uses dagger/core/integration/container_test.go Lines 1135 to 1144 in da68978
Internally, So there's a fundamental problem with the telemetry pub/sub; it shouldn't be keyed by (Side issue: using the trace ID also meant there was cross-talk; a nested client would see all activity across the trace, instead of only its progress. Nothing breaks, but it's a bit weird.) Solution?Current idea is to just do away with the pub/sub and push telemetry straight to the client over gRPC, similar to how Buildkit attachables work - just gRPC services served on the cilent conn. Will look into this today. I mentioned earlier that there's no OTLP gRPC client for logging, but I can keep the one we wrote. It's easy to use for the engine (e.g. stdout/stderr log forwarding), we just can't package it for the Go SDK (long story - tl;dr it's a lot of code and our code sharing between the Go SDK and Go modules is limited to 1 package). Footnotes
|
Thank you @vito for joining the party! That is definitely worth trying out, glad to help test the PR 💪 Just hit this which is related, but different: @sipsma I am able to reliably reproduce: ./hack/dev go test -v -count=1 -timeout=1h -run=TestFileServiceSync/triggers_error ./core/integration/ I can see the timeout, and this would explain why some tests take a much longer time than expected to return. I will spend more time debugging here: dagger/sdk/go/internal/engineconn/session.go Lines 165 to 167 in da68978
Anything else that you would like me to help reproduce / share? |
@vito if we do away with the pubsub, does that break |
@jedevc Not necessarily; we could export to both the pub/sub and the per-client telemetry pretty easily. |
This was discovered part of:
dagger call --source=.:default test all
is timing out locally - inconsistent behaviour #7339Somewhat similar to this, but different since in this case host has 16CPUs & no oci max parallelism:
--oci-max-parallelism 2
causes deadlocks #6894What is the issue?
When running
dagger call --source=.:default test all
locally the CLI will remain stuck and it will need to be manually killed, as mentioned by @jpadams in a private thread. This happens approximately 1 in every 10 runs.When a timeout is used - defaults to
30m
in our case - this is what the system looks like for that entire period (I have this set to60m
in my local runs):This was observed with the standard
v0.11.4
Engine running in Docker:Notice that there is:
--oci-max-parallelism
setWhen we hit this deadlocks, sending a
SIGQUIT
to the CLI returns the following: dagger-cli-v0.11.4-sigquit.txt. To visualise it, https://vito.github.io/swirly/ is most useful - thanks @vito 💪Highlighting a couple of goroutines that caught my attention:
👆 there are a lot more channels here
👆 there are a lot more channels here
Tip
To visualize all the goroutines, load dagger-cli-v0.11.4-sigquit.txt into https://vito.github.io/swirly/
I was able to
SIGQUIT
the Engine when it deadlocked: dagger-engine-v0.11.4-sigquit.txt.gzThere were
4202
goroutines, which is a bit too much for https://vito.github.io/swirly/ to handle, but it did load part of it. These looked interesting:👆 how many can
sem.Acquire()
at the same time?👆 All GC workers were idle, which makes me believe that this is not GC related.
Do you see anything else out of place?
Questions
Does anything look suspicious to you @sipsma @vito @jedevc @aluzzardi ?
Anything else that we could add to make debugging easier?
Next steps
SIGQUIT
the Engine, upload the goroutines dumpSIGQUIT
the Engine, upload that goroutines dumpFWIW, in the last 7 days, we have hit this issue
9%
of the time across all our CI runs cc @samalbaThe text was updated successfully, but these errors were encountered: