ChannelClosedException when remote end closes connection before idleTimeMs #2104

hsmade · 2018-08-16T07:15:45Z

Issue Type:

Bug report
Feature request

What happened:
When routing thrift, if the remote end closes the connection before the hostConnectionPool.idleTimeMs expires, the next call for that connection fails with a com.twitter.finagle.ChannelClosedException which is bubbled up to the calling service. On top of that, it seems that hostConnectionPool.idleTimeMs is actually between it's value and twice it's value (or even more).

What you expected to happen:
I would expect linkerd to recognise that the connection is closed, before trying to call. Also, I would expect that in this case (when it's recognised), the failure is marked as requeue-able.

I'm not sure if it's possible to (cheaply) ask the connection, that we get from the pool, if is still open before sending data. If so, we can mark it as failed if it's closed and let it requeue. If this is too costly, we should probably focus on why this fails even if the remote end timeout is set to more than twice the idleTimeMs (and of course document the fact that this value might be doubled as per finagles documentation).

How to reproduce it (as minimally and precisely as possible):
Setup a remote service that uses apache's thrift-0.9 and set the clientTimeout to 6 seconds. Setup a router to that service with the idleTimeMs set to 5 seconds. Start tcpdump and do a single call. Notice that linkerd does not close the connection, the remote end does. Do another call and notice the exception.
Next set the clientTimeout to 12 seconds, do some calls, pause for 12 seconds, do some other. Notice that the exception still happens (albeit less often).

Anything else we need to know?:
Here's the inline doc from finagle about the cache ttl (which is in fact the idleTimeMs): https://github.com/twitter/finagle/blob/finagle-7.1.0/finagle-core/src/main/scala/com/twitter/finagle/util/Cache.scala#L16

Environment:

linkerd version: 1.3.5
thrift: apache thrift 0.9

The text was updated successfully, but these errors were encountered:

dadjeibaah · 2018-08-20T17:25:47Z

Thanks for filing this issue @hsmade and thanks for the detail repro steps! We will take a look.

hsmade · 2018-08-20T17:34:22Z

Np, you can ping me on slack if you need more info. Thx, Wim Fournier

…

On Mon, Aug 20, 2018, 19:26 Dennis Adjei-Baah ***@***.***> wrote: Thanks for filing this issue @hsmade <https://github.com/hsmade> and thanks for the detail repro steps! We will take a look. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2104 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABFW9bCAix7-sR22O3LDIV4TlqalaDxeks5uSvEogaJpZM4V_Uoh> .

adleong · 2018-08-21T22:06:36Z

Thanks for this report @hsmade. That is a surprising behavior. I'd recommend taking a look in WatermarkPool to see if something funny is going on there.

The way that I would expect this to work is that when Linkerd issues the request, it should see that the WatermarkPool connection queue doesn't have any Services in it that are not in the Closed state. Therefore, it should attempt to create a new connection. You can attach a debugger and see if the service's in WatermarkPool's queue are closed (as they should be, if the connection has been closed by the remote).

Hope this is helpful. I'm happy to chat about this more if you get stuck.

adleong · 2018-09-10T22:20:07Z

@hsmade any luck investigating this?

hsmade · 2018-09-11T07:09:39Z

Sorry, haven't had the time yet. It's on my todo list though.

…

On Tue, Sep 11, 2018, 00:20 Alex Leong ***@***.***> wrote: @hsmade <https://github.com/hsmade> any luck investigating this? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2104 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABFW9QPdn8nkvQ1eKQKfELmI1pEwMcMsks5uZuWegaJpZM4V_Uoh> .

adleong self-assigned this Aug 21, 2018

adleong added this to Triage in Linkerd 1.x Backlog. See https://github.com/linkerd/linkerd2/blob/main/ROADMAP.md Aug 21, 2018

adleong moved this from Needs investigation to Low Priority (small) in Linkerd 1.x Backlog. See https://github.com/linkerd/linkerd2/blob/main/ROADMAP.md Oct 29, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ChannelClosedException when remote end closes connection before idleTimeMs #2104

ChannelClosedException when remote end closes connection before idleTimeMs #2104

hsmade commented Aug 16, 2018

dadjeibaah commented Aug 20, 2018

hsmade commented Aug 20, 2018 via email

adleong commented Aug 21, 2018

adleong commented Sep 10, 2018

hsmade commented Sep 11, 2018 via email

ChannelClosedException when remote end closes connection before idleTimeMs #2104

ChannelClosedException when remote end closes connection before idleTimeMs #2104

Comments

hsmade commented Aug 16, 2018

dadjeibaah commented Aug 20, 2018

hsmade commented Aug 20, 2018 via email

adleong commented Aug 21, 2018

adleong commented Sep 10, 2018

hsmade commented Sep 11, 2018 via email