Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple connections hang at times #146

Open
danielschonfeld opened this issue Apr 18, 2023 · 4 comments
Open

Multiple connections hang at times #146

danielschonfeld opened this issue Apr 18, 2023 · 4 comments
Labels
bug/possible A possible bug that has not yet been confirmed

Comments

@danielschonfeld
Copy link

Package version

1.0.20220627

Firmware version

2.0.9-hotfix.4

Device

EdgeRouter Lite / PoE - e100

Issue description

I have two connections going to a host. Occasionally my connection to that host will stop working. I see the packets going on eth0 outbound from that host with tcpdump towards my machine, but I never receive them. Normally you'd think the problem is with my machine... but... when I delete the wireguard interfaces and use a new port on the remote machine it starts working again, then I delete interfaces again and reload my old configuration and everything works as it should.

Of interesting to note is the following connections shared by these two machines

My machine = A
Remote Machine = B
Some other machine not mentioned above = C

Wireguard tunnels setup:
A->B
A->C

B->A
B->C

B is the problematic machine with the above mentioned peculiar behavior. I am not sure if the fact they share a tunnel to C plays a role here but that's the only distinguishing feature I have to make this tunnel setup different than other tunnels I have from A->other wireguard tunnels with similar endpoint equipment on the remote side not exhibiting this problem.

Configuration and log output

No response

@danielschonfeld danielschonfeld added the bug/possible A possible bug that has not yet been confirmed label Apr 18, 2023
@dulitz
Copy link

dulitz commented Apr 19, 2023

While this isn't an exact match for your fact pattern... often when you see packets leaving one machine and not arriving at another, the issue is that some link on the path has a low MTU and fragmentation is not happening.

If changing the port number fixes the issue, that could suggest that multiple routing paths are in use and only one path has the low MTU. The part that isn't a good match for your facts is that switching back to the old port doesn't cause the problem to recur.

Can you make the problem go away by lowering the MTU of the wireguard interface on B? Does ping of large packets (equal to the size of your current MTU) from B to A or C reliably get responses?

@danielschonfeld
Copy link
Author

Can you make the problem go away by lowering the MTU of the wireguard interface on B? Does ping of large packets (equal to the size of your current MTU) from B to A or C reliably get responses?

Unfortunately the problem just starts after a long while of operating fine so it'll be hard for me to test right away. I will try next time it happens. I can tell you though that when it happens pings don't normally work as the handshake doesn't seem to occur.

@dulitz
Copy link

dulitz commented Apr 20, 2023

I see. When I mentioned pings, I mean pings to the tunnel endpoint (in the "underlay network"), not pings inside the tunnel.

If you are diagnosing a potential path MTU issue -- and I don't know that's what this is but I'm suspicious -- you should characterize the path when it's working and then again when it's not, and look for differences. So do a traceroute (outside the tunnel) to show the path in the underlay network which the encrypted packets traverse. Use ping -s to find the largest packet that will pass. Record this info. Then when it's not working, try the traceroute again and the ping -s again, and see whether it's the same path and the same ping size, or not.

Good luck.

@danielschonfeld
Copy link
Author

I have made some progress in gaining insight to this problem. I still don't fully grasp it though, but it's not an MTU issue.

It appears that when conntrack opens a translation on the same ports as used by listening ports on both ends, the problem manifests itself.

Concrete example:

Machine A listens on 56018 this is some linux distro, set to persist the conx and hit the endpoint Machine-B:56019
Machine B listens on 56019, this is a UBNT EdgeOS. is set to persist the conx and hit the endpoint Machine-A:56018

If Machine B conntrack opened the following translation, it all works fine:
udp src=Machine-B dst=Machine-A sport=56019 dport=56018 src=Machine-A dst=Machine-B sport=56019 dport=(random port)

If Machine B conntrack opened the following translation, it hangs every now and then and once it hangs it does not recover:
udp src=Machine-B dst=Machine-A sport=56019 dport=56018 src=Machine-A dst=Machine-B sport=56019 dport=56018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug/possible A possible bug that has not yet been confirmed
Development

No branches or pull requests

2 participants