Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance Event Spam Key to Preserve Important Events During Reconciliations #124747

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

l-technicore
Copy link

@l-technicore l-technicore commented May 8, 2024

What type of PR is this?

/kind bug

What this PR does / why we need it:

Description:

The current Kubernetes event spam key doesn't distinguish between event types and reasons. This leads to important events being lost during reconcile loops for stuck Kubernetes objects once the --event-ttl is reached (default: 1 hour)

Problem:
  • When a controller encounters errors and retries repeatedly, it can trigger a burst of events for the same object.
  • The event spam filter limits the number of events to prevent flooding the Etcd.
  • If multiple event types occur during a reconcile loop (e.g., Warning and Normal), only the first event among those gets recorded.
  • This can mask critical information like the reason for a service sync failure, leading to troubleshooting difficulties.
Example:
  • The default Kubernetes service controller encounters an error during service synchronization.
  • It retries and generates multiple events, including a Normal event with "EnsuringLoadBalancer" reason.
  • Due to the spam filter, only the "EnsuringLoadBalancer" event gets recorded, hiding any Warning events containing the actual error details.
Events:
  Type     Reason                  Age                    From                Message
  ----     ------                  ----                   ----                -------
  Normal   EnsuringLoadBalancer    6m7s (x268 over 22h)   service-controller  Ensuring load balancer
Solution:

This pull request proposes modifying the default event spam key to include event.Type and event.Reason. This allows the spam filter to differentiate between events based on their type and reason, preventing the loss of crucial information in scenarios like reconcile loops.

Events:
  Type     Reason                  Age                    From                Message
  ----     ------                  ----                   ----                -------
  Normal   EnsuringLoadBalancer    4s (x265 over 21h)  service-controller  Ensuring load balancer
  Warning  SyncLoadBalancerFailed  1s (x256 over 21h)  service-controller  (combined from similar events): Error syncing load balancer: failed to ensure load balancer: xyz reason crucial for debugging.
Benefits:
  • Improved visibility into controller behaviour during reconcile loops.
  • Easier troubleshooting of service and other object failures.
  • More informative event logs for debugging and analysis.
Possible Approaches:
  • Modify the default spam key behaviour to include event.Type and event.Reason.
  • Introduce overrides for specific upstream controllers (e.g., service controller) to achieve the same outcome. (example)
Impact:

This approach should have minimal impact on etcd storage since only unique events (based on type and reason) are recorded, and the recording frequency is limited to one event every 5 minutes.

This pull request aims to enhance event visibility in Kubernetes by addressing the limitations of the current spam filter during reconcile loops. By differentiating events based on type and reason, we can ensure that critical information is preserved and readily accessible for debugging purposes.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

@k8s-ci-robot
Copy link
Contributor

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/bug Categorizes issue or PR as related to a bug. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels May 8, 2024
Copy link

linux-foundation-easycla bot commented May 8, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: l-technicore / name: Lalit Kumar Singh (fd554eb)

@k8s-ci-robot k8s-ci-robot added the do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label May 8, 2024
@k8s-ci-robot
Copy link
Contributor

Welcome @l-technicore!

It looks like this is your first PR to kubernetes/kubernetes 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/kubernetes has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 8, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @l-technicore. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-priority Indicates a PR lacks a `priority/foo` label and requires one. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 8, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: l-technicore
Once this PR has been reviewed and has the lgtm label, please assign rainbowmango for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@l-technicore l-technicore changed the title Enhance Event Spam Key to Preserve Important Information During Reconciliations Enhance Event Spam Key to Preserve Important Events During Reconciliations May 8, 2024
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels May 8, 2024
@l-technicore l-technicore marked this pull request as ready for review May 8, 2024 11:02
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 8, 2024
Comment on lines +80 to +81
event.Type,
event.Reason,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spam filter is meant to reduce the spam caused by faulty objects (https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/tools/record/events_cache.go#L128-L129) on any of the events they produce. If we were to distinguish between type and reason, the rate limiting would be less effective than it is today.

Maybe it could be worth looking into making the spam filter aware of the event type so that there are dedicated priority seats available for critical events when there is a burst of both normal and critical events.

@fedebongio
Copy link
Contributor

/assign @wojtek-t
/triage accepted

@k8s-ci-robot k8s-ci-robot added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label May 14, 2024
@k8s-ci-robot k8s-ci-robot removed the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label May 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. kind/bug Categorizes issue or PR as related to a bug. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants