-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Egress proxy pod loses connection with controlmap to receive netmap updates #12153
Comments
I think it's actually a timing issue that happened before |
Hi @gjhenrique thank you for the detailed issue description. So far I haven't been able to reproduce it- as you say it seems like this happens only occassionally. I will keep trying, but if you happen to see this again, would you be able to get logs from the proxy up till the |
Slightly similar #11519 Another thing on top of logs/bug report that would be good to find out what be how much data gets transferred via the egress proxy? |
Hi @irbekrm Thanks for jumping in I will take a bug report when I see this in the wild I have the logs previously recorded here:
|
I was busy with some other topics, and couldn't reply back earlier with an interesting piece of information. When seeing the context deadline exceeded, I saw the packets from the control plane to the host going to the Doing a The
In the broken cases, it registers the packet coming from control plane as part of a NAT connection. Check the part after I have a strong suspicion that Tailscale only activates the Maybe the problem doesn't reproduce in other cases because of the order of packets (when the first packet comes from local to controlplane) or when Next steps would be:
|
What is the issue?
Hi,
Since we bumped Tailscale from 1.58.2, the egress proxy pods loses the connection sometimes. We receive the dreaded
map response long-poll timed out!
and the machine shows the offline status.I could pinpoint to some facts:
tcpdump
to check the second connection and it looks like it reuses the first one. Only the client sends PUSH flags, but no response from controlplane. Probably something weird with the FW blocking those packets.10.122.1.124.35913 > 3.73.239.57.80: Flags [P.], seq 1:1360, ack 1, win 443, options [nop,nop,TS val 974238051 ecr 2051401776], length 1359: HTTP
10.122.1.124.35913 > 3.73.239.57.80: Flags [P.], seq 1:1360, ack 1, win 443, options [nop,nop,TS val 974238915 ecr 2051401776], length 1359: HTTP
Turning on the assumption mode:
The
DNAT
rule thatcontainerboot
is injecting impacts the long-polling connection somehow between the client and the controlplane server.I couldn't find any commits that could pinpint something breaking between 1.58.2 to 1.60.1, so I assume it's some timing issue that now matches the ongoing connection with tailscale inserting that rule.
An acceptable fix would be if TS created a new connection when
direct.go
cancels the context. This is not the case and the process is not "self-healing".Apparently, the blocking happens on the client, so it's going to be a bit hard for you to investigate only on the server, so I'm open for a call if you want it.
Steps to reproduce
Deploy an egress proxy pod with
.1.60.1
. It's hard to reproduce this issue, so you might need to kill the pod sometimes.Are there any recent changes that introduced the issue?
Not that I'm aware of
OS
Linux
OS version
Ubuntu 22.04
Tailscale version
1.60.2
Other software
No response
Bug report
No response
The text was updated successfully, but these errors were encountered: