🐞 Dagger CLI v0.11.4 deadlocks when running `dagger call --source=.:default test all` #7387

gerhard · 2024-05-15T15:49:56Z

This was discovered part of:

🧐 dagger call --source=.:default test all is timing out locally - inconsistent behaviour #7339

Somewhat similar to this, but different since in this case host has 16CPUs & no oci max parallelism:

🐞 Setting --oci-max-parallelism 2 causes deadlocks #6894

What is the issue?

When running dagger call --source=.:default test all locally the CLI will remain stuck and it will need to be manually killed, as mentioned by @jpadams in a private thread. This happens approximately 1 in every 10 runs.

When a timeout is used - defaults to 30m in our case - this is what the system looks like for that entire period (I have this set to 60m in my local runs):

This was observed with the standard v0.11.4 Engine running in Docker:

/ # ps aux | grep '[e]ngine'
    1 root      9:20 /usr/local/bin/dagger-engine --config /etc/dagger/engine.toml --debug
751327 root      4:07 /usr/local/bin/dagger-engine --config /etc/dagger/engine.toml --network-cidr=10.88.0.0/16 --network-name=dagger-dev

/ # cat /etc/dagger/engine.toml

debug = true
trace = false
insecure-entitlements = ["security.insecure"]

Notice that there is:

no --oci-max-parallelism set
no GC custom config

When we hit this deadlocks, sending a SIGQUIT to the CLI returns the following: dagger-cli-v0.11.4-sigquit.txt. To visualise it, https://vito.github.io/swirly/ is most useful - thanks @vito 💪

Highlighting a couple of goroutines that caught my attention:

👆 there are a lot more channels here

👆 there are a lot more channels here

Tip

To visualize all the goroutines, load dagger-cli-v0.11.4-sigquit.txt into https://vito.github.io/swirly/

I was able to SIGQUIT the Engine when it deadlocked: dagger-engine-v0.11.4-sigquit.txt.gz

There were 4202 goroutines, which is a bit too much for https://vito.github.io/swirly/ to handle, but it did load part of it. These looked interesting:

👆 how many can sem.Acquire() at the same time?

👆 All GC workers were idle, which makes me believe that this is not GC related.

Do you see anything else out of place?

Questions

Does anything look suspicious to you @sipsma @vito @jedevc @aluzzardi ?

Anything else that we could add to make debugging easier?

Next steps

Trigger this using our Engine & CLI tests, SIGQUIT the Engine, upload the goroutines dump
- dagger-engine-v0.11.4-sigquit.txt.gz
Find a Dagger Cloud trace for when this happened in our own CI
- https://dagger.cloud/dagger/traces/3add32bb1a460bde62b9ccbbfd25d0ca
- https://github.com/dagger/dagger/actions/runs/9064274086/job/24902159559
Have a simpler repro - our Engine & CLI tests are heavy & it takes a while to trigger this
- Thank you @aluzzardi 💪
Trigger this using the repro, SIGQUIT the Engine, upload that goroutines dump

FWIW, in the last 7 days, we have hit this issue 9% of the time across all our CI runs cc @samalba

The text was updated successfully, but these errors were encountered:

jedevc · 2024-05-15T16:03:39Z

Just linking to another client deadlock here: #7048. Might not be the same, but TL;DR - if the span doesn't ever complete, then the client will deadlock.

jedevc · 2024-05-15T16:11:44Z

You don't happen to have a dagger cloud run or an engine SIGQUIT result do you? Without those, it's difficult to see what the engine is trying to do that's taking so long, or whether there's a runaway span somewhere.

Glancing through the stack trace, there's one specific one that stands out: goroutine 1 is stuck executing a query - it looks like everything else is just progress waiting for it terminate (there's also a goroutine leak for the streaming output, which should be fixed as part of #7272 when it gets removed entirely).

matipan · 2024-05-15T17:51:51Z

Today I was running testdev on our CI and it looks like it got stuck (possibly a deadlock?) for about 40 minutes until finally exiting:

Checking the runner pod I could see the process was still up and the dagger engines (within the runner container) were up too:

Not sure if its useful but here are the logs of the engine in the node:

I don't have a link to dagger cloud, can one be dug up from the logs? The run was this one: https://github.com/dagger/dagger/actions/runs/9084221778/job/25002579112. Raw logs: https://pipelinesghubeus2.actions.githubusercontent.com/QzgPDIDyToE3J2EXFsF53FDrKXVmSNMaCDdKFJvQQISWafQ8Xj/_apis/pipelines/1/runs/126051/signedlogcontent/22?urlExpires=2024-05-15T17%3A48%3A16.9360768Z&urlSigningMethod=HMACV1&urlSignature=Unk2A%2F5EyznUgRLhgdfM39O0F8C1mzMXqHtg5vlf4Z4%3D

aluzzardi · 2024-05-15T18:35:18Z

I'm able to make the engine misbehave by doing some stress testing:

Error: response from query: returned error 502 Bad Gateway: http do: Post "http://dagger/query": rpc error: code = Unavailable desc = error reading from server: command [docker exec -i dagger-engine-63121e921843d412 buildctl dial-stdio] has exited with exit status 137, please make sure the URL is valid, and Docker 18.09 or later is installed on the remote host: stderr=

code:

package main

import (
	"context"
	"fmt"

	"golang.org/x/sync/errgroup"
)

type Repro struct{}

// Returns a container that echoes whatever string argument is provided
func (m *Repro) Run(ctx context.Context) string {
	base := dag.Container().From("alpine:latest")

	eg := errgroup.Group{}
	for i := 0; i < 10000; i++ {
		i := i
		eg.Go(func() error {
			_, err := base.
				WithExec([]string{"echo", fmt.Sprintf("hello %d", i)}).
				WithExec([]string{"sleep", "5"}).
				Sync(ctx)
			return err
		})
	}

	if err := eg.Wait(); err != nil {
		panic(err)
	}

	return "done"
}

gerhard · 2024-05-16T05:54:22Z

@jedevc Dagger Cloud trace + GitHub Actions job from last time this deadlock happened in our CI:

I am trying to reproduce locally so that I can get that goroutines dump from the Engine.

jedevc · 2024-05-16T10:03:18Z

@aluzzardi I think that's definitely a legitimate failure, but I think this is likely different, that error message looks quite different from what we see in https://dagger.cloud/dagger/traces/3add32bb1a460bde62b9ccbbfd25d0ca?span=4e373cd85ec633b2&logs#4e373cd85ec633b2:L1542 (love the permalinking for logs in dagger cloud).

A couple notes on the linked trace:

Failed to connect; retrying... make request: returned error 500 Internal Server Error: client call for vkfcdrid1evyi1rddug25vrgg not found is weird. I haven't seen this before.
Interesting that a lot of the failed tests are typescript. A few things could be causing this:
- The tests all have a similar setup - so we'd expect these to edge merge. We've changed that in the last release, see chore: use temporary buildkit fork with bug fixes pending upstream merge #7295 chore: update dep on buildkit #7309. It's not impossible that these changes could have caused some weird issue. cc @sipsma
- We do see typescript tests timing out (even just the SDK tests) - potentially some changes there could have caused this. @TomChv

Debugging this isn't very fun as well. For the next release, I think we should make sure to get:

Service.Stop should allow passing an arbitrary linux signal #7391
feat: improve plain progress #7272
Having both of these functionalities would probably help quite a bit in being able to hunt this down.

gerhard · 2024-05-16T15:55:04Z

Starting with @aluzzardi's repro, I spawned 10k x sleep 1 in parallel. While the behaviour is completely different from what we have been hunting here, it's even worse:

There is a 30% spike in CPU usage
Not much else happens apart from CPU iowait staying 6%
^C the dagger client
There is a smaller CPU usage spike - 6%
CPU iowait remains at the same 6%
CPU iowait spikes to 11%, drops back to 6%, then spikes again to 11%
CPU spikes to 60% - stuf is happening now! BTW, I exited the dagger client a good 5 minutes ago
Engine continues with the high activity

I didn't have patience to wait for the rest. This is what the above looks like in 1 picture:

This looks like a "thundering herd problem" to me. Without being able to look inside BuildKit, I cannot tell which component in stressed. My guess is that there is some sort of a worker pool that gets overwhelmed by the 10k containers that need to get spawned. Looking at the disk activity, my assumption is that there is a special goroutine that allocates an fs layer for each container which gets overwhelmed. Maybe @sipsma or @jedevc know more. I am going to see if I can attach some sort of continuous profiler and understand which function blocks the most. cc @thorfour @brancz

By the time I finished typing this up, things are back to normal:

sipsma · 2024-05-16T16:00:42Z

My first guess (truly just a hunch though) at the error hit in @aluzzardi's example is that the system is running out of memory and processes are getting OOM killed. This is just based on the repro spinning up thousands of containers and the 137 exit code. If that is what's happening, starting dmesg -w in one terminal and then running the repro in another should show kernel logs about the OOM killer in the dmesg output.

gerhard · 2024-05-16T16:40:51Z

When I tried to reproduce @sipsma, I didn't get any OOM's, just a very slow start, and a slow progress. Most of it happened after dagger CLI exited. Happy to pair-up sync in case that will help us make progress with this.

sipsma · 2024-05-17T01:39:05Z

When I tried to reproduce @sipsma, I didn't get any OOM's, just a very slow start, and a slow progress. Most of it happened after dagger CLI exited. Happy to pair-up sync in case that will help us make progress with this.

@gerhard if you didn't see the same error as Andrea about exit code 137 then yeah it's probably not getting OOM killed (which is gonna vary host to host of course). That's just my guess about what happened for him.

I'm gonna start taking a look into this now. I yesterday independently hit an integ test case that hung for 5 minutes while waiting for OTEL to drain (and the 5 minute timeout was just the timeout of SDKs waiting for the dagger session subcommand to exit after Close is called, otherwise it would have hung forever afaict). Will make sure it still repros on main.

No clue if it's related to everything else here though, will try to see if I can repro the other behavior here.

sipsma · 2024-05-17T02:15:48Z

Okay yeah the 5 minute hang on OTEL draining repros consistently on main for me:

./hack/dev go test -v -count=1 -timeout=1h -run=TestFileServiceSync/triggers_error ./core/integration/

Funnily if you SIGQUIT or otherwise kill the dagger session subprocess at any point once the hang starts but before the 5 minute timeout is hit the test still passes 😄 This is due to the fact that we aren't checking the error returned by Close in that test case (or almost any test cases for that matter, it's not something we really thought was needed to assert on previously).

SIGQUIT on the dagger session subprocess of the Go SDK client used in the tests shows it stuck on these lines:

Attached the engine container logs, w/ ExtraDebug enabled so it shows OTEL logs, for the full 5 minutes of waiting for the timeout: engine-hang.log

The span ID that seems to never end has id 2f421bb5aaf68593
- e.g. time="2024-05-17T02:02:09Z" level=debug msg="waiting for span" activeSpans=1 draining=true id=2f421bb5aaf68593 immediate=false span="HTTP GET" trace=f3101384332f0374dd87c5270c7453bd
If I had to fathom a guess, maybe buildkit or another lib used by buildkit is failing to end a span in an error-case code path? The integ test in question is testing a failure case 🤷‍♂️

cc @vito

I think either way one output of all this should be to not block indefinitely for OTEL to drain (though having a generous timeout makes perfect sense).

Gonna try to repro the other hangs reported here. I guess it's possible they are related somehow, but the description of them so far does sound like something else entirely.

EDIT: of course after I post this and start reading I see this comment and issue #7387 (comment)

Will mention this over in that issue too.

sipsma · 2024-05-17T03:00:20Z

@gerhard the SIGQUIT stack you attached in the description works in swirly for me, the logs seem to have one stack trace, then logs from the engine running again and then another SIGQUIT dump; if you trim it down to just the last goroutine dump it loads quickly. I created a gist you can put as the URL in swirly here: https://gist.githubusercontent.com/sipsma/bfd34d24b3a51c64fafa0eb93abbe5a1/raw/0b3640b1034ae509bdfec9791c985b011b2e85e7/gistfile1.txt

Looking at it, the most significant thing that sticks out is that there are actual containers still running that the engine has been waiting on for 53 minutes.

Which suggests that there are processes (possibly function calls) that are stuck and never exiting.

I am trying to repro locally now by just running the tests repeatedly, but if anyone else manages to repro it, execing into the engine container and running ps aux or similar would be helpful to see what's there.

This requires finding the actual engine being tested and doing something like sudo nsenter -m -p -t <the pid> ps aux
But honestly just ps aux on the outer host would be sufficient too, just will need to sift through more processes.

gerhard · 2024-05-20T12:41:53Z

Thanks for helping out @sipsma!

I was able to trigger the hang again, and capture pstree -w on the Engine: test.all.race.failfast.parallel16.timeout180m.pstree-w.2024-05-18.20-31-24.txt

Here is the full CLI log:
test.all.race.failfast.parallel16.timeout180m.2024-05-18.20-31-24.txt.gz

And this is just the Engine goroutines dump:
test.all.race.failfast.parallel16.timeout180m.sigquit.2024-05-18.20-31-24.txt.gz

This shows 3754 goroutines:

This is what they look like loaded in Swirly:

Screen.Recording.2024-05-20.at.13.32.46.mp4

There are a lot of grpc-related goroutines - is that normal? Anything else that looks odd?

I am going to try and use your repro to see if I can trigger this quicker. Before, it was taking up to 50mins. I now have a second host which is able to finish the same test suite in 30mins. Put together, I should be able to repro quicker.

This helps:

dagger-test-all() {
        local parallel="${1:-16}"
        local timeout="${2:-180m}"
        time dagger call --debug --source=".:default" test all --race=true --failfast=true --parallel=$parallel --timeout=$timeout 2>&1 | ts | tee test.all.race.failfast.parallel${parallel}.timeout${timeout}.$(date "+%Y-%m-%d.%H-%M-%S").txt
}

I am going to open a PR which makes those flags configurable.

vito · 2024-05-20T14:48:49Z

Noticed one issue.

tl;dr telemetry draining is keyed on trace ID, so if you open multiple clients in one trace, closing the second client will hang waiting for the first client's session span, since it's in the same trace ID. (The timeout is 5 minutes.)

Context

I'm doing a bunch of refactoring now that the Go OTel SDK has logging (only HTTP, no gRPC - that'll come up later). Along the way I noticed TestContainerLabel was taking 10+ minutes - which is especially painful because this test isn't t.Parallel(), so that 10 minutes is all spent watching this test awkwardly pass.

Looking deeper, I noticed the test does something unusual.

First, it connects like normal:

dagger/core/integration/container_test.go

Line 1124 in da68978

c, ctx := connect(t)

But then in two of the sub-tests it uses testutil.Query (for legacy reasons that can be fixed now):

dagger/core/integration/container_test.go

Lines 1135 to 1144 in da68978

    
           t.Run("container labels", func(t *testing.T) { 
        
           	res := struct { 
        
           		Container struct { 
        
           			From struct { 
        
           				Labels []schema.Label 
        
           			} 
        
           		} 
        
           	}{} 
        
           	err := testutil.Query(

Internally, testutil.Query creates its own connection and calls Close() on it once it's done. Great! Except closing waits for telemetry to end, which will never end, because the other connection has a span hanging out.

So there's a fundamental problem with the telemetry pub/sub; it shouldn't be keyed by TraceID. Using the trace ID as the pub/sub topic was super convenient because that bit is always available¹, but a trace ID is not actually 1:1 with the client.

(Side issue: using the trace ID also meant there was cross-talk; a nested client would see all activity across the trace, instead of only its progress. Nothing breaks, but it's a bit weird.)

Solution?

Current idea is to just do away with the pub/sub and push telemetry straight to the client over gRPC, similar to how Buildkit attachables work - just gRPC services served on the cilent conn. Will look into this today. I mentioned earlier that there's no OTLP gRPC client for logging, but I can keep the one we wrote. It's easy to use for the engine (e.g. stdout/stderr log forwarding), we just can't package it for the Go SDK (long story - tl;dr it's a lot of code and our code sharing between the Go SDK and Go modules is limited to 1 package).

Using a span/log attribute or something instead would involve a custom processor or something to add it to everything. ↩

gerhard · 2024-05-20T17:34:20Z

Thank you @vito for joining the party! That is definitely worth trying out, glad to help test the PR 💪

Just hit this which is related, but different:

🐞 Failed dagger call check fails due to missing span & does not get fixed when re-running the job #7423

@sipsma I am able to reliably reproduce:

./hack/dev go test -v -count=1 -timeout=1h -run=TestFileServiceSync/triggers_error ./core/integration/

I can see the timeout, and this would explain why some tests take a much longer time than expected to return. I will spend more time debugging here:

dagger/sdk/go/internal/engineconn/session.go

Lines 165 to 167 in da68978

    
           // Set a long timeout to give time for any cache exports to pack layers up 
        
           // which currently has to happen synchronously with the session. 
        
           proc.WaitDelay = 300 * time.Second // 5 mins

Anything else that you would like me to help reproduce / share?

jedevc · 2024-05-20T17:39:13Z

@vito if we do away with the pubsub, does that break dagger watch? No objections, just wanted to make sure I understand ❤️

vito · 2024-05-20T18:11:02Z

@jedevc Not necessarily; we could export to both the pub/sub and the per-client telemetry pretty easily.

gerhard · 2024-05-21T17:01:36Z

@jedevc @vito can we move the conversation regarding telemetry draining to?

Dagger call hangs on telemetry draining #7048

We can then keep this top-level and close it when running this command no longer hangs in e.g. 20 consecutive runs.

gerhard added the kind/bug Something isn't working label May 15, 2024

jedevc mentioned this issue May 16, 2024

Service.Stop should allow passing an arbitrary linux signal #7391

Open

sipsma mentioned this issue May 17, 2024

Dagger call hangs on telemetry draining #7048

Open

gerhard mentioned this issue May 20, 2024

ci: Add extra flags for Engine & CLI tests #7416

Merged

samalba removed the kind/bug Something isn't working label May 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐞 Dagger CLI v0.11.4 deadlocks when running `dagger call --source=.:default test all` #7387

🐞 Dagger CLI v0.11.4 deadlocks when running `dagger call --source=.:default test all` #7387

gerhard commented May 15, 2024 •

edited

jedevc commented May 15, 2024

jedevc commented May 15, 2024

matipan commented May 15, 2024

aluzzardi commented May 15, 2024

gerhard commented May 16, 2024 •

edited

jedevc commented May 16, 2024 •

edited

gerhard commented May 16, 2024

sipsma commented May 16, 2024

gerhard commented May 16, 2024

sipsma commented May 17, 2024

sipsma commented May 17, 2024 •

edited

sipsma commented May 17, 2024 •

edited

gerhard commented May 20, 2024

vito commented May 20, 2024 •

edited

gerhard commented May 20, 2024

jedevc commented May 20, 2024

vito commented May 20, 2024

gerhard commented May 21, 2024 •

edited

🐞 Dagger CLI v0.11.4 deadlocks when running dagger call --source=.:default test all #7387

🐞 Dagger CLI v0.11.4 deadlocks when running dagger call --source=.:default test all #7387

Comments

gerhard commented May 15, 2024 • edited

What is the issue?

Questions

Next steps

jedevc commented May 15, 2024

jedevc commented May 15, 2024

matipan commented May 15, 2024

aluzzardi commented May 15, 2024

gerhard commented May 16, 2024 • edited

jedevc commented May 16, 2024 • edited

gerhard commented May 16, 2024

sipsma commented May 16, 2024

gerhard commented May 16, 2024

sipsma commented May 17, 2024

sipsma commented May 17, 2024 • edited

sipsma commented May 17, 2024 • edited

gerhard commented May 20, 2024

vito commented May 20, 2024 • edited

Context

Solution?

Footnotes

gerhard commented May 20, 2024

jedevc commented May 20, 2024

vito commented May 20, 2024

gerhard commented May 21, 2024 • edited

🐞 Dagger CLI v0.11.4 deadlocks when running `dagger call --source=.:default test all` #7387

🐞 Dagger CLI v0.11.4 deadlocks when running `dagger call --source=.:default test all` #7387

gerhard commented May 15, 2024 •

edited

gerhard commented May 16, 2024 •

edited

jedevc commented May 16, 2024 •

edited

sipsma commented May 17, 2024 •

edited

sipsma commented May 17, 2024 •

edited

vito commented May 20, 2024 •

edited

gerhard commented May 21, 2024 •

edited