Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sufficient deadlines and countermeasures to handle hung node scenario #19688

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

shtripat
Copy link
Contributor

@shtripat shtripat commented May 7, 2024

Community Contribution License

All community contributions in this pull request are licensed to the project maintainers
under the terms of the Apache 2 license.
By creating this pull request I represent that I have the right to license the
contributions to the project maintainers under the Apache 2 license.

Description

Add sufficient deadlines and countermeasures to handle the hung node scenario

Motivation and Context

This PR tries to address the hung node scenario by adding sufficient
deadlines and counter measures for such an eventuality.

How to test this PR?

This PR already adds the relevant tests; you may reproduce them
locally as needed by following the mint automation piece.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Optimization (provides speedup with no functional changes)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • Fixes a regression (If yes, please add commit-id or PR # here)
  • Unit tests added/updated
  • Internal documentation updated
  • Create a documentation update request here

@harshavardhana harshavardhana force-pushed the pause-container-test branch 4 times, most recently from 118c7b2 to 3d473d9 Compare May 14, 2024 22:20
@harshavardhana harshavardhana force-pushed the pause-container-test branch 3 times, most recently from 01aab05 to 23766c5 Compare May 17, 2024 10:49
Comment on lines -69 to +71
peersLogOnceIf(context.Background(), err, nodeName)
if xnet.IsNetworkOrHostDown(err, false) {
network[nodeName] = string(madmin.ItemOffline)
} else if xnet.IsNetworkOrHostDown(err, true) {
network[nodeName] = "connection attempt timedout"
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To differentiate between actually "offline v/s timedout"

cmd/metrics-resource.go Outdated Show resolved Hide resolved
Comment on lines -106 to -117
cli.DurationFlag{
Name: "conn-client-read-deadline",
Usage: "custom connection READ deadline for incoming requests",
Hidden: true,
EnvVar: "MINIO_CONN_CLIENT_READ_DEADLINE",
},
cli.DurationFlag{
Name: "conn-client-write-deadline",
Usage: "custom connection WRITE deadline for outgoing requests",
Hidden: true,
EnvVar: "MINIO_CONN_CLIENT_WRITE_DEADLINE",
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed remove it.

gridLogIf(ctx, fmt.Errorf("ws write: %w", err))
if !xnet.IsNetworkOrHostDown(err, true) {
gridLogIf(ctx, fmt.Errorf("ws write: %w", err))
}
return
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apply valid deadlines and only unexpected logs, not repeated network errors.

@harshavardhana harshavardhana changed the title Added a flow to test one node paused scenario @harshavardhana Add sufficient deadlines and countermeasures to handle hung node scen… 6998367 …ario May 17, 2024
@harshavardhana harshavardhana changed the title @harshavardhana Add sufficient deadlines and countermeasures to handle hung node scen… 6998367 …ario Add sufficient deadlines and countermeasures to handle hung node scenario May 17, 2024
@harshavardhana harshavardhana force-pushed the pause-container-test branch 4 times, most recently from 12a1a10 to 80daf94 Compare May 17, 2024 18:05
…ario

Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
Signed-off-by: Harshavardhana <harsha@minio.io>
Comment on lines -618 to -621
err = conn.SetWriteDeadline(time.Now().Add(connWriteTimeout))
if err != nil {
return err
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this removed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants