Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPUStats.UsageCoreNanoSeconds is inaccurate on windows - 1/10 of the real value. #124734

Open
qiutongs opened this issue May 8, 2024 · 4 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/windows Categorizes an issue or PR as relevant to SIG Windows.

Comments

@qiutongs
Copy link
Contributor

qiutongs commented May 8, 2024

What happened?

The CPU utilization metrics from Kubelet summary API CPUStats.UsageCoreNanoSeconds is not accurate on Windows. The value is only 1/10 of the real value which I have other VM level metrics as a proof.

summary API CPU utilization metric

image

VM CPU utilization metric

image

What did you expect to happen?

The CPUStats.UsageCoreNanoSeconds should be accurate on Windows too.

How can we reproduce it (as minimally and precisely as possible)?

$ kubectl get --raw "/api/v1/nodes/<NODE>/proxy/stats/summary"

T1

  "cpu": {
   "time": "2023-12-06T21:23:48Z",
   "usageNanoCores": 8000000,
   "usageCoreNanoSeconds": 20985800000000
  },

T2

  "cpu": {
   "time": "2023-12-06T21:24:38Z",
   "usageNanoCores": 8000000,
   "usageCoreNanoSeconds": 20986320000000
  },

(20986320000000 - 20985800000000)ns / 50s = 0.01 s/s
This matches the magnitude of the metric chart.

Anything else we need to know?

This problem doesn't exist on Linux.

Call stack

  • metric is set here by kubelet -
    func (c *StatsClient) createRootContainerInfo() (*cadvisorapiv2.ContainerInfo, error) {
    nodeMetrics, err := c.client.getNodeMetrics()
    if err != nil {
    return nil, err
    }
    var stats []*cadvisorapiv2.ContainerStats
    stats = append(stats, &cadvisorapiv2.ContainerStats{
    Timestamp: nodeMetrics.timeStamp,
    Cpu: &cadvisorapi.CpuStats{
    Usage: cadvisorapi.CpuUsage{
    Total: nodeMetrics.cpuUsageCoreNanoSeconds,
  • the data collected by using windows perf counters -
    cpuValue, err := cpuCounter.getData()
    cpuCores := ProcessorCount()
    if err != nil {
    klog.ErrorS(err, "Unable to get cpu perf counter data")
    return
    }
  • query here -
    cpuQuery = "\\Processor(_Total)\\% Processor Time"

Kubernetes version

$ kubectl version
1.27

Cloud provider

GKE

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

@qiutongs qiutongs added the kind/bug Categorizes issue or PR as related to a bug. label May 8, 2024
@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label May 8, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label May 8, 2024
@brianpursley
Copy link
Member

Could usageNanoCores be wrong?

This code caught my attention:

cachePeriodSeconds := uint64(defaultCachePeriod / time.Second)
perfCounterUpdatePeriodSeconds := uint64(perfCounterUpdatePeriod / time.Second)
cpuUsageNanoCores := ((p.cpuUsageCoreNanoSecondsCache.latestValue - p.cpuUsageCoreNanoSecondsCache.previousValue) * perfCounterUpdatePeriodSeconds) / cachePeriodSeconds

usageNanoCores is computed in cri_stats_provider.go, like this:

nanoSeconds := newStats.Timestamp - cachedStats.Timestamp
if nanoSeconds <= 0 {
return nil, fmt.Errorf("zero or negative interval (%v - %v)", newStats.Timestamp, cachedStats.Timestamp)
}
usageNanoCores := uint64(float64(newStats.UsageCoreNanoSeconds.Value-cachedStats.UsageCoreNanoSeconds.Value) /
float64(nanoSeconds) * float64(time.Second/time.Nanosecond))

So wouldn't something like this be equivalent (just going by looking at the code)?

cpuUsageNanoCores := (p.cpuUsageCoreNanoSecondsCache.latestValue - p.cpuUsageCoreNanoSecondsCache.previousValue) / float64(perfCounterUpdatePeriod) * float64(time.Second/time.Nanosecond)

@neolit123
Copy link
Member

/sig windows

you can open discusion with https://github.com/kubernetes/community/tree/master/sig-windows

@k8s-ci-robot k8s-ci-robot added sig/windows Categorizes an issue or PR as relevant to SIG Windows. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 16, 2024
@jsturtevant
Copy link
Contributor

@qiutongs What version of containerd are you using? There is an issue with the way it is collected in 1.27+ and containerd 1.7 #122092

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/windows Categorizes an issue or PR as relevant to SIG Windows.
Projects
Status: No status
Development

No branches or pull requests

5 participants