Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐞 Dagger CLI v0.11.4 deadlocks when running dagger call --source=.:default test all #7387

Open
3 of 4 tasks
gerhard opened this issue May 15, 2024 · 18 comments
Open
3 of 4 tasks

Comments

@gerhard
Copy link
Member

gerhard commented May 15, 2024

This was discovered part of:

Somewhat similar to this, but different since in this case host has 16CPUs & no oci max parallelism:

What is the issue?

When running dagger call --source=.:default test all locally the CLI will remain stuck and it will need to be manually killed, as mentioned by @jpadams in a private thread. This happens approximately 1 in every 10 runs.

When a timeout is used - defaults to 30m in our case - this is what the system looks like for that entire period (I have this set to 60m in my local runs):
image

This was observed with the standard v0.11.4 Engine running in Docker:

/ # ps aux | grep '[e]ngine'
    1 root      9:20 /usr/local/bin/dagger-engine --config /etc/dagger/engine.toml --debug
751327 root      4:07 /usr/local/bin/dagger-engine --config /etc/dagger/engine.toml --network-cidr=10.88.0.0/16 --network-name=dagger-dev

/ # cat /etc/dagger/engine.toml

debug = true
trace = false
insecure-entitlements = ["security.insecure"]

Notice that there is:

  • no --oci-max-parallelism set
  • no GC custom config

When we hit this deadlocks, sending a SIGQUIT to the CLI returns the following: dagger-cli-v0.11.4-sigquit.txt. To visualise it, https://vito.github.io/swirly/ is most useful - thanks @vito 💪

Highlighting a couple of goroutines that caught my attention:
image
image
image
👆 there are a lot more channels here
image
👆 there are a lot more channels here

Tip

To visualize all the goroutines, load dagger-cli-v0.11.4-sigquit.txt into https://vito.github.io/swirly/

I was able to SIGQUIT the Engine when it deadlocked: dagger-engine-v0.11.4-sigquit.txt.gz

There were 4202 goroutines, which is a bit too much for https://vito.github.io/swirly/ to handle, but it did load part of it. These looked interesting:

image

👆 how many can sem.Acquire() at the same time?

image

👆 All GC workers were idle, which makes me believe that this is not GC related.

Do you see anything else out of place?

Questions

Does anything look suspicious to you @sipsma @vito @jedevc @aluzzardi ?

Anything else that we could add to make debugging easier?

Next steps


FWIW, in the last 7 days, we have hit this issue 9% of the time across all our CI runs cc @samalba

@gerhard gerhard added the kind/bug Something isn't working label May 15, 2024
@jedevc
Copy link
Member

jedevc commented May 15, 2024

Just linking to another client deadlock here: #7048. Might not be the same, but TL;DR - if the span doesn't ever complete, then the client will deadlock.

@jedevc
Copy link
Member

jedevc commented May 15, 2024

You don't happen to have a dagger cloud run or an engine SIGQUIT result do you? Without those, it's difficult to see what the engine is trying to do that's taking so long, or whether there's a runaway span somewhere.

Glancing through the stack trace, there's one specific one that stands out: goroutine 1 is stuck executing a query - it looks like everything else is just progress waiting for it terminate (there's also a goroutine leak for the streaming output, which should be fixed as part of #7272 when it gets removed entirely).

@matipan
Copy link
Contributor

matipan commented May 15, 2024

Today I was running testdev on our CI and it looks like it got stuck (possibly a deadlock?) for about 40 minutes until finally exiting:

Screenshot from 2024-05-15 11-36-04

Checking the runner pod I could see the process was still up and the dagger engines (within the runner container) were up too:
Screenshot from 2024-05-15 11-38-00

Screenshot from 2024-05-15 11-39-25

Not sure if its useful but here are the logs of the engine in the node:

Screenshot from 2024-05-15 11-36-52

I don't have a link to dagger cloud, can one be dug up from the logs? The run was this one: https://github.com/dagger/dagger/actions/runs/9084221778/job/25002579112. Raw logs: https://pipelinesghubeus2.actions.githubusercontent.com/QzgPDIDyToE3J2EXFsF53FDrKXVmSNMaCDdKFJvQQISWafQ8Xj/_apis/pipelines/1/runs/126051/signedlogcontent/22?urlExpires=2024-05-15T17%3A48%3A16.9360768Z&urlSigningMethod=HMACV1&urlSignature=Unk2A%2F5EyznUgRLhgdfM39O0F8C1mzMXqHtg5vlf4Z4%3D

@aluzzardi
Copy link
Member

I'm able to make the engine misbehave by doing some stress testing:

Error: response from query: returned error 502 Bad Gateway: http do: Post "http://dagger/query": rpc error: code = Unavailable desc = error reading from server: command [docker exec -i dagger-engine-63121e921843d412 buildctl dial-stdio] has exited with exit status 137, please make sure the URL is valid, and Docker 18.09 or later is installed on the remote host: stderr=

code:

package main

import (
	"context"
	"fmt"

	"golang.org/x/sync/errgroup"
)

type Repro struct{}

// Returns a container that echoes whatever string argument is provided
func (m *Repro) Run(ctx context.Context) string {
	base := dag.Container().From("alpine:latest")

	eg := errgroup.Group{}
	for i := 0; i < 10000; i++ {
		i := i
		eg.Go(func() error {
			_, err := base.
				WithExec([]string{"echo", fmt.Sprintf("hello %d", i)}).
				WithExec([]string{"sleep", "5"}).
				Sync(ctx)
			return err
		})
	}

	if err := eg.Wait(); err != nil {
		panic(err)
	}

	return "done"
}

@gerhard
Copy link
Member Author

gerhard commented May 16, 2024

@jedevc Dagger Cloud trace + GitHub Actions job from last time this deadlock happened in our CI:

I am trying to reproduce locally so that I can get that goroutines dump from the Engine.

@jedevc
Copy link
Member

jedevc commented May 16, 2024

@aluzzardi I think that's definitely a legitimate failure, but I think this is likely different, that error message looks quite different from what we see in https://dagger.cloud/dagger/traces/3add32bb1a460bde62b9ccbbfd25d0ca?span=4e373cd85ec633b2&logs#4e373cd85ec633b2:L1542 (love the permalinking for logs in dagger cloud).


A couple notes on the linked trace:

  • Failed to connect; retrying... make request: returned error 500 Internal Server Error: client call for vkfcdrid1evyi1rddug25vrgg not found is weird. I haven't seen this before.
  • Interesting that a lot of the failed tests are typescript. A few things could be causing this:

Debugging this isn't very fun as well. For the next release, I think we should make sure to get:

@gerhard
Copy link
Member Author

gerhard commented May 16, 2024

Starting with @aluzzardi's repro, I spawned 10k x sleep 1 in parallel. While the behaviour is completely different from what we have been hunting here, it's even worse:

  1. There is a 30% spike in CPU usage
  2. Not much else happens apart from CPU iowait staying 6%
  3. ^C the dagger client
  4. There is a smaller CPU usage spike - 6%
  5. CPU iowait remains at the same 6%
  6. CPU iowait spikes to 11%, drops back to 6%, then spikes again to 11%
  7. CPU spikes to 60% - stuf is happening now! BTW, I exited the dagger client a good 5 minutes ago
  8. Engine continues with the high activity

I didn't have patience to wait for the rest. This is what the above looks like in 1 picture:
image

This looks like a "thundering herd problem" to me. Without being able to look inside BuildKit, I cannot tell which component in stressed. My guess is that there is some sort of a worker pool that gets overwhelmed by the 10k containers that need to get spawned. Looking at the disk activity, my assumption is that there is a special goroutine that allocates an fs layer for each container which gets overwhelmed. Maybe @sipsma or @jedevc know more. I am going to see if I can attach some sort of continuous profiler and understand which function blocks the most. cc @thorfour @brancz


By the time I finished typing this up, things are back to normal:
image

@sipsma
Copy link
Contributor

sipsma commented May 16, 2024

My first guess (truly just a hunch though) at the error hit in @aluzzardi's example is that the system is running out of memory and processes are getting OOM killed. This is just based on the repro spinning up thousands of containers and the 137 exit code. If that is what's happening, starting dmesg -w in one terminal and then running the repro in another should show kernel logs about the OOM killer in the dmesg output.

@gerhard
Copy link
Member Author

gerhard commented May 16, 2024

When I tried to reproduce @sipsma, I didn't get any OOM's, just a very slow start, and a slow progress. Most of it happened after dagger CLI exited. Happy to pair-up sync in case that will help us make progress with this.

@sipsma
Copy link
Contributor

sipsma commented May 17, 2024

When I tried to reproduce @sipsma, I didn't get any OOM's, just a very slow start, and a slow progress. Most of it happened after dagger CLI exited. Happy to pair-up sync in case that will help us make progress with this.

@gerhard if you didn't see the same error as Andrea about exit code 137 then yeah it's probably not getting OOM killed (which is gonna vary host to host of course). That's just my guess about what happened for him.

I'm gonna start taking a look into this now. I yesterday independently hit an integ test case that hung for 5 minutes while waiting for OTEL to drain (and the 5 minute timeout was just the timeout of SDKs waiting for the dagger session subcommand to exit after Close is called, otherwise it would have hung forever afaict). Will make sure it still repros on main.

No clue if it's related to everything else here though, will try to see if I can repro the other behavior here.

@sipsma
Copy link
Contributor

sipsma commented May 17, 2024

Okay yeah the 5 minute hang on OTEL draining repros consistently on main for me:

./hack/dev go test -v -count=1 -timeout=1h -run=TestFileServiceSync/triggers_error ./core/integration/

Funnily if you SIGQUIT or otherwise kill the dagger session subprocess at any point once the hang starts but before the 5 minute timeout is hit the test still passes 😄 This is due to the fact that we aren't checking the error returned by Close in that test case (or almost any test cases for that matter, it's not something we really thought was needed to assert on previously).

SIGQUIT on the dagger session subprocess of the Go SDK client used in the tests shows it stuck on these lines:

  1. https://github.com/sipsma/dagger/blob/3aeea0ea8940db6ea415cf4740c5289a7331a923/engine/client/client.go#L491-L491
  2. https://github.com/sipsma/dagger/blob/3aeea0ea8940db6ea415cf4740c5289a7331a923/engine/client/client.go#L535-L535

Attached the engine container logs, w/ ExtraDebug enabled so it shows OTEL logs, for the full 5 minutes of waiting for the timeout: engine-hang.log

  • The span ID that seems to never end has id 2f421bb5aaf68593
    • e.g. time="2024-05-17T02:02:09Z" level=debug msg="waiting for span" activeSpans=1 draining=true id=2f421bb5aaf68593 immediate=false span="HTTP GET" trace=f3101384332f0374dd87c5270c7453bd
  • If I had to fathom a guess, maybe buildkit or another lib used by buildkit is failing to end a span in an error-case code path? The integ test in question is testing a failure case 🤷‍♂️

cc @vito


I think either way one output of all this should be to not block indefinitely for OTEL to drain (though having a generous timeout makes perfect sense).

Gonna try to repro the other hangs reported here. I guess it's possible they are related somehow, but the description of them so far does sound like something else entirely.


EDIT: of course after I post this and start reading I see this comment and issue #7387 (comment)

Will mention this over in that issue too.

@sipsma
Copy link
Contributor

sipsma commented May 17, 2024

@gerhard the SIGQUIT stack you attached in the description works in swirly for me, the logs seem to have one stack trace, then logs from the engine running again and then another SIGQUIT dump; if you trim it down to just the last goroutine dump it loads quickly. I created a gist you can put as the URL in swirly here: https://gist.githubusercontent.com/sipsma/bfd34d24b3a51c64fafa0eb93abbe5a1/raw/0b3640b1034ae509bdfec9791c985b011b2e85e7/gistfile1.txt

Looking at it, the most significant thing that sticks out is that there are actual containers still running that the engine has been waiting on for 53 minutes.
Screenshot 2024-05-16 at 7 56 12 PM

Which suggests that there are processes (possibly function calls) that are stuck and never exiting.

I am trying to repro locally now by just running the tests repeatedly, but if anyone else manages to repro it, execing into the engine container and running ps aux or similar would be helpful to see what's there.

  • This requires finding the actual engine being tested and doing something like sudo nsenter -m -p -t <the pid> ps aux
  • But honestly just ps aux on the outer host would be sufficient too, just will need to sift through more processes.

@gerhard
Copy link
Member Author

gerhard commented May 20, 2024

Thanks for helping out @sipsma!

I was able to trigger the hang again, and capture pstree -w on the Engine: test.all.race.failfast.parallel16.timeout180m.pstree-w.2024-05-18.20-31-24.txt

Here is the full CLI log:
test.all.race.failfast.parallel16.timeout180m.2024-05-18.20-31-24.txt.gz

And this is just the Engine goroutines dump:
test.all.race.failfast.parallel16.timeout180m.sigquit.2024-05-18.20-31-24.txt.gz

This shows 3754 goroutines:
image

This is what they look like loaded in Swirly:

Screen.Recording.2024-05-20.at.13.32.46.mp4

There are a lot of grpc-related goroutines - is that normal? Anything else that looks odd?

I am going to try and use your repro to see if I can trigger this quicker. Before, it was taking up to 50mins. I now have a second host which is able to finish the same test suite in 30mins. Put together, I should be able to repro quicker.


This helps:

dagger-test-all() {
        local parallel="${1:-16}"
        local timeout="${2:-180m}"
        time dagger call --debug --source=".:default" test all --race=true --failfast=true --parallel=$parallel --timeout=$timeout 2>&1 | ts | tee test.all.race.failfast.parallel${parallel}.timeout${timeout}.$(date "+%Y-%m-%d.%H-%M-%S").txt
}

I am going to open a PR which makes those flags configurable.

@vito
Copy link
Contributor

vito commented May 20, 2024

Noticed one issue.

tl;dr telemetry draining is keyed on trace ID, so if you open multiple clients in one trace, closing the second client will hang waiting for the first client's session span, since it's in the same trace ID. (The timeout is 5 minutes.)

Context

I'm doing a bunch of refactoring now that the Go OTel SDK has logging (only HTTP, no gRPC - that'll come up later). Along the way I noticed TestContainerLabel was taking 10+ minutes - which is especially painful because this test isn't t.Parallel(), so that 10 minutes is all spent watching this test awkwardly pass.

Looking deeper, I noticed the test does something unusual.

First, it connects like normal:

c, ctx := connect(t)

But then in two of the sub-tests it uses testutil.Query (for legacy reasons that can be fixed now):

t.Run("container labels", func(t *testing.T) {
res := struct {
Container struct {
From struct {
Labels []schema.Label
}
}
}{}
err := testutil.Query(

Internally, testutil.Query creates its own connection and calls Close() on it once it's done. Great! Except closing waits for telemetry to end, which will never end, because the other connection has a span hanging out.

So there's a fundamental problem with the telemetry pub/sub; it shouldn't be keyed by TraceID. Using the trace ID as the pub/sub topic was super convenient because that bit is always available1, but a trace ID is not actually 1:1 with the client.

(Side issue: using the trace ID also meant there was cross-talk; a nested client would see all activity across the trace, instead of only its progress. Nothing breaks, but it's a bit weird.)

Solution?

Current idea is to just do away with the pub/sub and push telemetry straight to the client over gRPC, similar to how Buildkit attachables work - just gRPC services served on the cilent conn. Will look into this today. I mentioned earlier that there's no OTLP gRPC client for logging, but I can keep the one we wrote. It's easy to use for the engine (e.g. stdout/stderr log forwarding), we just can't package it for the Go SDK (long story - tl;dr it's a lot of code and our code sharing between the Go SDK and Go modules is limited to 1 package).

Footnotes

  1. Using a span/log attribute or something instead would involve a custom processor or something to add it to everything.

@gerhard
Copy link
Member Author

gerhard commented May 20, 2024

Thank you @vito for joining the party! That is definitely worth trying out, glad to help test the PR 💪

Just hit this which is related, but different:

@sipsma I am able to reliably reproduce:

./hack/dev go test -v -count=1 -timeout=1h -run=TestFileServiceSync/triggers_error ./core/integration/
image

I can see the timeout, and this would explain why some tests take a much longer time than expected to return. I will spend more time debugging here:

// Set a long timeout to give time for any cache exports to pack layers up
// which currently has to happen synchronously with the session.
proc.WaitDelay = 300 * time.Second // 5 mins

Anything else that you would like me to help reproduce / share?

@jedevc
Copy link
Member

jedevc commented May 20, 2024

@vito if we do away with the pubsub, does that break dagger watch? No objections, just wanted to make sure I understand ❤️

@vito
Copy link
Contributor

vito commented May 20, 2024

@jedevc Not necessarily; we could export to both the pub/sub and the per-client telemetry pretty easily.

@gerhard
Copy link
Member Author

gerhard commented May 21, 2024

@jedevc @vito can we move the conversation regarding telemetry draining to?

We can then keep this top-level and close it when running this command no longer hangs in e.g. 20 consecutive runs.

@samalba samalba removed the kind/bug Something isn't working label May 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants