Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ChannelClosedException when remote end closes connection before idleTimeMs #2104

Open
1 of 2 tasks
hsmade opened this issue Aug 16, 2018 · 5 comments
Open
1 of 2 tasks

Comments

@hsmade
Copy link

hsmade commented Aug 16, 2018

Issue Type:

  • Bug report
  • Feature request

What happened:
When routing thrift, if the remote end closes the connection before the hostConnectionPool.idleTimeMs expires, the next call for that connection fails with a com.twitter.finagle.ChannelClosedException which is bubbled up to the calling service. On top of that, it seems that hostConnectionPool.idleTimeMs is actually between it's value and twice it's value (or even more).

What you expected to happen:
I would expect linkerd to recognise that the connection is closed, before trying to call. Also, I would expect that in this case (when it's recognised), the failure is marked as requeue-able.

I'm not sure if it's possible to (cheaply) ask the connection, that we get from the pool, if is still open before sending data. If so, we can mark it as failed if it's closed and let it requeue. If this is too costly, we should probably focus on why this fails even if the remote end timeout is set to more than twice the idleTimeMs (and of course document the fact that this value might be doubled as per finagles documentation).

How to reproduce it (as minimally and precisely as possible):
Setup a remote service that uses apache's thrift-0.9 and set the clientTimeout to 6 seconds. Setup a router to that service with the idleTimeMs set to 5 seconds. Start tcpdump and do a single call. Notice that linkerd does not close the connection, the remote end does. Do another call and notice the exception.
Next set the clientTimeout to 12 seconds, do some calls, pause for 12 seconds, do some other. Notice that the exception still happens (albeit less often).

Anything else we need to know?:
Here's the inline doc from finagle about the cache ttl (which is in fact the idleTimeMs): https://github.com/twitter/finagle/blob/finagle-7.1.0/finagle-core/src/main/scala/com/twitter/finagle/util/Cache.scala#L16

Environment:

  • linkerd version: 1.3.5
  • thrift: apache thrift 0.9
@dadjeibaah
Copy link
Contributor

Thanks for filing this issue @hsmade and thanks for the detail repro steps! We will take a look.

@hsmade
Copy link
Author

hsmade commented Aug 20, 2018 via email

@adleong
Copy link
Member

adleong commented Aug 21, 2018

Thanks for this report @hsmade. That is a surprising behavior. I'd recommend taking a look in WatermarkPool to see if something funny is going on there.

The way that I would expect this to work is that when Linkerd issues the request, it should see that the WatermarkPool connection queue doesn't have any Services in it that are not in the Closed state. Therefore, it should attempt to create a new connection. You can attach a debugger and see if the service's in WatermarkPool's queue are closed (as they should be, if the connection has been closed by the remote).

Hope this is helpful. I'm happy to chat about this more if you get stuck.

@adleong
Copy link
Member

adleong commented Sep 10, 2018

@hsmade any luck investigating this?

@hsmade
Copy link
Author

hsmade commented Sep 11, 2018 via email

@adleong adleong moved this from Needs investigation to Low Priority (small) in Linkerd 1.x Backlog. See https://github.com/linkerd/linkerd2/blob/main/ROADMAP.md Oct 29, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

No branches or pull requests

3 participants