Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Child node service reports as active, but seems hung and is stale in Netdata Cloud #16809

Open
asteinlein opened this issue Jan 18, 2024 · 8 comments
Labels
bug needs triage Issues which need to be manually labelled

Comments

@asteinlein
Copy link

Bug description

After a few hours of running the client, the agent it seems to somehow die/stop collecting and reporting data, even though systemctl status netdata reports the state as active.

The log seems to indicate that a child was killed somehow, and that collectors/replication have stopped, but I have no idea why nor why the agent seems to still run. When I finally tried to stop the client with systemctl stop netplan, it hung for a long time, before it was eventually automatically killed.

Expected behavior

It should continue to run, collect metrics and have the metrics show up in Netdata Cloud.

Steps to reproduce

  1. Installed Netplan through kickstart script
  2. Disabled the web interface
  3. Set this node up as a child, reporting data to a parent server
  4. Restart Netplan
  5. Wait for the problem as mentioned to appear

Installation method

kickstart.sh

System info

Linux [...] 5.4.0-131-generic #147-Ubuntu SMP Fri Oct 14 17:07:22 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
/etc/lsb-release:DISTRIB_ID=Ubuntu
/etc/lsb-release:DISTRIB_RELEASE=20.04
/etc/lsb-release:DISTRIB_CODENAME=focal
/etc/lsb-release:DISTRIB_DESCRIPTION="Ubuntu 20.04.6 LTS"
/etc/os-release:NAME="Ubuntu"
/etc/os-release:VERSION="20.04.6 LTS (Focal Fossa)"
/etc/os-release:ID=ubuntu
/etc/os-release:ID_LIKE=debian
/etc/os-release:PRETTY_NAME="Ubuntu 20.04.6 LTS"
/etc/os-release:VERSION_ID="20.04"
/etc/os-release:VERSION_CODENAME=focal
/etc/os-release:UBUNTU_CODENAME=focal

Netdata build info

Packaging:
    Netdata Version ____________________________________________ : v1.44.1
    Installation Type __________________________________________ : binpkg-deb
    Package Architecture _______________________________________ : x86_64
    Package Distro _____________________________________________ :  
    Configure Options __________________________________________ :  '--build=x86_64-linux-gnu' '--includedir=${prefix}/include' '--mandir=${prefix}/share/man' '--infodir=${prefix}/share/info' '--disable-silent-rules' '--libdir=${prefix}/lib/x86_64-linux-gnu' '--libexecdir=${prefix}/lib/x86_64-linux-gnu' '--disable-maintainer-mode' '--prefix=/usr' '--sysconfdir=/etc' '--localstatedir=/var' '--libdir=/usr/lib' '--libexecdir=/usr/libexec' '--with-user=netdata' '--with-math' '--with-zlib' '--with-webdir=/var/lib/netdata/www' '--disable-dependency-tracking' 'build_alias=x86_64-linux-gnu' 'CFLAGS=-g -O2 -fdebug-prefix-map=/usr/src/netdata=. -fstack-protector-strong -Wformat -Werror=format-security' 'LDFLAGS=-Wl,-Bsymbolic-functions -Wl,-z,relro' 'CPPFLAGS=-Wdate-time -D_FORTIFY_SOURCE=2' 'CXXFLAGS=-g -O2 -fdebug-prefix-map=/usr/src/netdata=. -fstack-protector-strong -Wformat -Werror=format-security'
Default Directories:
    User Configurations ________________________________________ : /etc/netdata
    Stock Configurations _______________________________________ : /usr/lib/netdata/conf.d
    Ephemeral Databases (metrics data, metadata) _______________ : /var/cache/netdata
    Permanent Databases ________________________________________ : /var/lib/netdata
    Plugins ____________________________________________________ : /usr/libexec/netdata/plugins.d
    Static Web Files ___________________________________________ : /var/lib/netdata/www
    Log Files __________________________________________________ : /var/log/netdata
    Lock Files _________________________________________________ : /var/lib/netdata/lock
    Home _______________________________________________________ : /var/lib/netdata
Operating System:
    Kernel _____________________________________________________ : Linux
    Kernel Version _____________________________________________ : 5.4.0-131-generic
    Operating System ___________________________________________ : Ubuntu
    Operating System ID ________________________________________ : ubuntu
    Operating System ID Like ___________________________________ : debian
    Operating System Version ___________________________________ : 20.04.6 LTS (Focal Fossa)
    Operating System Version ID ________________________________ : none
    Detection __________________________________________________ : /etc/os-release
Hardware:
    CPU Cores __________________________________________________ : 32
    CPU Frequency ______________________________________________ : 3000000000
    RAM Bytes __________________________________________________ : 67240386560
    Disk Capacity ______________________________________________ : 1920394248192
    CPU Architecture ___________________________________________ : x86_64
    Virtualization Technology __________________________________ : none
    Virtualization Detection ___________________________________ : systemd-detect-virt
Container:
    Container __________________________________________________ : none
    Container Detection ________________________________________ : systemd-detect-virt
    Container Orchestrator _____________________________________ : none
    Container Operating System _________________________________ : none
    Container Operating System ID ______________________________ : none
    Container Operating System ID Like _________________________ : none
    Container Operating System Version _________________________ : none
    Container Operating System Version ID ______________________ : none
    Container Operating System Detection _______________________ : none
Features:
    Built For __________________________________________________ : Linux
    Netdata Cloud ______________________________________________ : YES
    Health (trigger alerts and send notifications) _____________ : YES
    Streaming (stream metrics to parent Netdata servers) _______ : YES
    Back-filling (of higher database tiers) ____________________ : YES
    Replication (fill the gaps of parent Netdata servers) ______ : YES
    Streaming and Replication Compression ______________________ : YES (zstd lz4 gzip)
    Contexts (index all active and archived metrics) ___________ : YES
    Tiering (multiple dbs with different metrics resolution) ___ : YES (5)
    Machine Learning ___________________________________________ : YES
Database Engines:
    dbengine ___________________________________________________ : YES
    alloc ______________________________________________________ : YES
    ram ________________________________________________________ : YES
    map ________________________________________________________ : YES
    save _______________________________________________________ : YES
    none _______________________________________________________ : YES
Connectivity Capabilities:
    ACLK (Agent-Cloud Link: MQTT over WebSockets over TLS) _____ : YES
    static (Netdata internal web server) _______________________ : YES
    h2o (web server) ___________________________________________ : YES
    WebRTC (experimental) ______________________________________ : NO
    Native HTTPS (TLS Support) _________________________________ : YES
    TLS Host Verification ______________________________________ : YES
Libraries:
    LZ4 (extremely fast lossless compression algorithm) ________ : YES
    ZSTD (fast, lossless compression algorithm) ________________ : YES
    zlib (lossless data-compression library) ___________________ : YES
    Judy (high-performance dynamic arrays and hashtables) ______ : YES (bundled)
    dlib (robust machine learning toolkit) _____________________ : YES (bundled)
    protobuf (platform-neutral data serialization protocol) ____ : YES (system)
    OpenSSL (cryptography) _____________________________________ : YES
    libdatachannel (stand-alone WebRTC data channels) __________ : NO
    JSON-C (lightweight JSON manipulation) _____________________ : YES
    libcap (Linux capabilities system operations) ______________ : NO
    libcrypto (cryptographic functions) ________________________ : YES
    libm (mathematical functions) ______________________________ : YES
    jemalloc ___________________________________________________ : NO
    TCMalloc ___________________________________________________ : NO
Plugins:
    apps (monitor processes) ___________________________________ : YES
    cgroups (monitor containers and VMs) _______________________ : YES
    cgroup-network (associate interfaces to CGROUPS) ___________ : YES
    proc (monitor Linux systems) _______________________________ : YES
    tc (monitor Linux network QoS) _____________________________ : YES
    diskspace (monitor Linux mount points) _____________________ : YES
    freebsd (monitor FreeBSD systems) __________________________ : NO
    macos (monitor MacOS systems) ______________________________ : NO
    statsd (collect custom application metrics) ________________ : YES
    timex (check system clock synchronization) _________________ : YES
    idlejitter (check system latency and jitter) _______________ : YES
    bash (support shell data collection jobs - charts.d) _______ : YES
    debugfs (kernel debugging metrics) _________________________ : YES
    cups (monitor printers and print jobs) _____________________ : YES
    ebpf (monitor system calls) ________________________________ : YES
    freeipmi (monitor enterprise server H/W) ___________________ : YES
    nfacct (gather netfilter accounting) _______________________ : YES
    perf (collect kernel performance events) ___________________ : YES
    slabinfo (monitor kernel object caching) ___________________ : YES
    Xen ________________________________________________________ : NO
    Xen VBD Error Tracking _____________________________________ : NO
    Logs Management ____________________________________________ : YES
Exporters:
    AWS Kinesis ________________________________________________ : NO
    GCP PubSub _________________________________________________ : NO
    MongoDB ____________________________________________________ : YES
    Prometheus (OpenMetrics) Exporter __________________________ : YES
    Prometheus Remote Write ____________________________________ : YES
    Graphite ___________________________________________________ : YES
    Graphite HTTP / HTTPS ______________________________________ : YES
    JSON _______________________________________________________ : YES
    JSON HTTP / HTTPS __________________________________________ : YES
    OpenTSDB ___________________________________________________ : YES
    OpenTSDB HTTP / HTTPS ______________________________________ : YES
    All Metrics API ____________________________________________ : YES
    Shell (use metrics in shell scripts) _______________________ : YES
Debug/Developer Features:
    Trace All Netdata Allocations (with charts) ________________ : NO
    Developer Mode (more runtime checks, slower) _______________ : NO

Additional info

systemctl status netdata when the agent had hung/stopped reporting:

● netdata.service - Real time performance monitoring
     Loaded: loaded (/lib/systemd/system/netdata.service; enabled; vendor preset: enabled)
     Active: active (running) since Thu 2024-01-18 01:02:41 CET; 19h ago
   Main PID: 1662549 (netdata)
      Tasks: 212 (limit: 76875)
     Memory: 75.8M
     CGroup: /system.slice/netdata.service
             ├─1662549 /usr/sbin/netdata -D -P /var/run/netdata/netdata.pid
             └─1662573 /usr/sbin/netdata --special-spawn-server

jan. 18 06:41:26 dick.e5r.no netdata[1662549]: SERVICE CONTROL: waiting for the following 9 services [ COLLECTORS STREAMING ] to exit: 'P[tc]' (1662950), 'PREDICT' (1662954), 'PD[ebpf]' (1662964), >
jan. 18 06:41:27 dick.e5r.no netdata[1662549]: SIGNAL: reap_child(1662979) killed by signal: 15
jan. 18 06:41:27 dick.e5r.no netdata[1662549]: SERVICE CONTROL: waiting for the following 3 services [ COLLECTORS STREAMING ] to exit: 'PREDICT' (1662954), 'P[cgroups]' (1663044), 'SNDR[dick.e5r.n'>
jan. 18 06:41:28 dick.e5r.no netdata[1662549]: SERVICE CONTROL: waiting for the following 3 services [ COLLECTORS STREAMING ] to exit: 'PREDICT' (1662954), 'P[cgroups]' (1663044), 'SNDR[dick.e5r.n'>
jan. 18 06:41:29 dick.e5r.no netdata[1662549]: SERVICE CONTROL: waiting for the following 3 services [ COLLECTORS STREAMING ] to exit: 'PREDICT' (1662954), 'P[cgroups]' (1663044), 'SNDR[dick.e5r.n'>
jan. 18 06:41:30 dick.e5r.no netdata[1662549]: SERVICE CONTROL: waiting for the following 3 services [ COLLECTORS STREAMING ] to exit: 'PREDICT' (1662954), 'P[cgroups]' (1663044), 'SNDR[dick.e5r.n'>
jan. 18 06:41:30 dick.e5r.no netdata[1662549]: SERVICE CONTROL: the following 3 service(s) [ COLLECTORS STREAMING ] take too long to exit: 'PREDICT' (1662954), 'P[cgroups]' (1663044), 'SNDR[dick.e5>
jan. 18 06:41:30 dick.e5r.no netdata[1662549]: NETDATA SHUTDOWN: in    5060 ms, (TIMEOUT) stop collectors and streaming threads - next: stop replication threads
jan. 18 06:41:30 dick.e5r.no netdata[1662549]: SERVICE CONTROL: waiting for the following 1 services [ REPLICATION ] to exit: 'REPLAY[1]' (1662948)
jan. 18 06:41:30 dick.e5r.no netdata[1662549]: NETDATA SHUTDOWN: in      50 ms, stop replication threads - next: prepare metasync shutdown

I then tried inspecting logs from Netdata, but I'm a bit confused why I'm not seeing the messages reported above. Instead, all I'm seeing with journalctl -u netdata is:

jan. 18 01:02:12 dick.e5r.no systemd[1]: netdata.service: State 'stop-sigterm' timed out. Killing.
jan. 18 01:02:12 dick.e5r.no systemd[1]: netdata.service: Killing process 2457365 (netdata) with signal SIGKILL.
jan. 18 01:02:12 dick.e5r.no systemd[1]: netdata.service: Killing process 2457464 (netdata) with signal SIGKILL.
jan. 18 01:02:12 dick.e5r.no systemd[1]: netdata.service: Killing process 2457463 (DAEMON_SPAWN) with signal SIGKILL.
jan. 18 01:02:12 dick.e5r.no systemd[1]: netdata.service: Killing process 2457468 (netdata) with signal SIGKILL.
jan. 18 01:02:12 dick.e5r.no systemd[1]: netdata.service: Killing process 2457822 (DBENGINE) with signal SIGKILL.
jan. 18 01:02:12 dick.e5r.no systemd[1]: netdata.service: Killing process 2457824 (UV_WORKER[2]) with signal SIGKILL.
jan. 18 01:02:12 dick.e5r.no systemd[1]: netdata.service: Killing process 2457833 (UV_WORKER[173]) with signal SIGKILL.
jan. 18 01:02:12 dick.e5r.no systemd[1]: netdata.service: Killing process 2457835 (UV_WORKER[13]) with signal SIGKILL.
jan. 18 01:02:12 dick.e5r.no systemd[1]: netdata.service: Killing process 2457837 (UV_WORKER[16]) with signal SIGKILL.
jan. 18 01:02:12 dick.e5r.no systemd[1]: netdata.service: Killing process 2457840 (UV_WORKER[14]) with signal SIGKILL.
jan. 18 01:02:12 dick.e5r.no systemd[1]: netdata.service: Killing process 2457841 (n/a) with signal SIGKILL.
jan. 18 01:02:12 dick.e5r.no systemd[1]: netdata.service: Killing process 2457843 (UV_WORKER[12]) with signal SIGKILL.
jan. 18 01:02:12 dick.e5r.no systemd[1]: netdata.service: Killing process 2457844 (n/a) with signal SIGKILL.
jan. 18 01:02:12 dick.e5r.no systemd[1]: netdata.service: Killing process 2457845 (UV_WORKER[11]) with signal SIGKILL.
jan. 18 01:02:12 dick.e5r.no systemd[1]: netdata.service: Killing process 2457846 (n/a) with signal SIGKILL.
jan. 18 01:02:12 dick.e5r.no systemd[1]: netdata.service: Killing process 2457848 (UV_WORKER[35]) with signal SIGKILL.
jan. 18 01:02:12 dick.e5r.no systemd[1]: netdata.service: Killing process 2457849 (n/a) with signal SIGKILL.
jan. 18 01:02:12 dick.e5r.no systemd[1]: netdata.service: Killing process 2457850 (UV_WORKER[143]) with signal SIGKILL.
jan. 18 01:02:12 dick.e5r.no systemd[1]: netdata.service: Killing process 2457851 (n/a) with signal SIGKILL.
jan. 18 01:02:12 dick.e5r.no systemd[1]: netdata.service: Killing process 2457852 (UV_WORKER[34]) with signal SIGKILL.
jan. 18 01:02:12 dick.e5r.no systemd[1]: netdata.service: Killing process 2457853 (n/a) with signal SIGKILL.
jan. 18 01:02:12 dick.e5r.no systemd[1]: netdata.service: Killing process 2457854 (UV_WORKER[136]) with signal SIGKILL.
jan. 18 01:02:12 dick.e5r.no systemd[1]: netdata.service: Killing process 2457856 (UV_WORKER[33]) with signal SIGKILL.
jan. 18 01:02:12 dick.e5r.no systemd[1]: netdata.service: Killing process 2457857 (UV_WORKER[175]) with signal SIGKILL.
[... REPEAT ...]
[... REPEAT ...]
[... maybe 100s of times ...]
jan. 18 01:02:12 dick.e5r.no systemd[1]: netdata.service: Killing process 2457878 (n/a) with signal SIGKILL.
jan. 18 01:02:12 dick.e5r.no systemd[1]: netdata.service: Killing process 2457879 (n/a) with signal SIGKILL.
jan. 18 01:02:12 dick.e5r.no systemd[1]: netdata.service: Killing process 2457880 (UV_WORKER[37]) with signal SIGKILL.
jan. 18 01:02:12 dick.e5r.no systemd[1]: netdata.service: Killing process 2457881 (UV_WORKER[47]) with signal SIGKILL.
jan. 18 01:02:12 dick.e5r.no systemd[1]: netdata.service: Killing process 2457882 (n/a) with signal SIGKILL.
jan. 18 01:02:12 dick.e5r.no systemd[1]: netdata.service: Killing process 2457883 (n/a) with signal SIGKILL.
jan. 18 01:02:12 dick.e5r.no systemd[1]: netdata.service: Killing process 2457884 (n/a) with signal SIGKILL.
jan. 18 01:02:12 dick.e5r.no systemd[1]: netdata.service: Killing process 2457885 (n/a) with signal SIGKILL.
jan. 18 01:02:12 dick.e5r.no systemd[1]: netdata.service: Killing process 2457886 (UV_WORKER[59]) with signal SIGKILL.
jan. 18 01:02:12 dick.e5r.no systemd[1]: netdata.service: Killing process 2457887 (n/a) with signal SIGKILL.
jan. 18 01:02:12 dick.e5r.no systemd[1]: netdata.service: Killing process 2457888 (n/a) with signal SIGKILL.
jan. 18 01:02:12 dick.e5r.no systemd[1]: netdata.service: Killing process 2457889 (UV_WORKER[48]) with signal SIGKILL.
jan. 18 01:02:12 dick.e5r.no systemd[1]: netdata.service: Killing process 2457890 (UV_WORKER[41]) with signal SIGKILL.
jan. 18 01:02:12 dick.e5r.no systemd[1]: netdata.service: Killing process 2457891 (n/a) with signal SIGKILL.
jan. 18 01:02:12 dick.e5r.no systemd[1]: netdata.service: Killing process 2457892 (UV_WORKER[146]) with signal SIGKILL.
jan. 18 01:02:12 dick.e5r.no systemd[1]: netdata.service: Killing process 2457893 (n/a) with signal SIGKILL.
jan. 18 01:02:12 dick.e5r.no systemd[1]: netdata.service: Killing process 2457894 (UV_WORKER[45]) with signal SIGKILL.
jan. 18 01:02:12 dick.e5r.no systemd[1]: netdata.service: Killing process 2457895 (UV_WORKER[44]) with signal SIGKILL.
jan. 18 01:02:12 dick.e5r.no systemd[1]: netdata.service: Killing process 2457896 (n/a) with signal SIGKILL.
jan. 18 01:02:12 dick.e5r.no systemd[1]: netdata.service: Killing process 2457897 (UV_WORKER[144]) with signal SIGKILL.
jan. 18 01:02:12 dick.e5r.no systemd[1]: netdata.service: Killing process 2457898 (n/a) with signal SIGKILL.
jan. 18 01:02:12 dick.e5r.no systemd[1]: netdata.service: Killing process 2457899 (n/a) with signal SIGKILL.
jan. 18 01:02:12 dick.e5r.no systemd[1]: netdata.service: Killing process 2457900 (n/a) with signal SIGKILL.
jan. 18 01:02:12 dick.e5r.no systemd[1]: netdata.service: Killing process 2457901 (n/a) with signal SIGKILL.
jan. 18 01:02:12 dick.e5r.no systemd[1]: netdata.service: Killing process 2457902 (UV_WORKER[62]) with signal SIGKILL.
jan. 18 01:02:12 dick.e5r.no systemd[1]: netdata.service: Killing process 2457903 (n/a) with signal SIGKILL.
jan. 18 01:02:12 dick.e5r.no systemd[1]: netdata.service: Killing process 2457904 (n/a) with signal SIGKILL.
jan. 18 01:02:12 dick.e5r.no systemd[1]: netdata.service: Killing process 2457905 (n/a) with signal SIGKILL.
jan. 18 01:02:12 dick.e5r.no systemd[1]: netdata.service: Killing process 2457906 (n/a) with signal SIGKILL.
[... REPEAT ...]
[... REPEAT ...]
[... maybe 100s of times ...]
jan. 18 01:02:12 dick.e5r.no systemd[1]: netdata.service: Killing process 2458233 (n/a) with signal SIGKILL.
jan. 18 01:02:12 dick.e5r.no systemd[1]: netdata.service: Killing process 2458775 (n/a) with signal SIGKILL.
jan. 18 01:02:12 dick.e5r.no systemd[1]: netdata.service: Killing process 2458808 (n/a) with signal SIGKILL.
jan. 18 01:02:12 dick.e5r.no systemd[1]: netdata.service: Main process exited, code=killed, status=9/KILL
jan. 18 01:02:12 dick.e5r.no systemd[1]: netdata.service: Failed with result 'timeout'.
jan. 18 01:02:12 dick.e5r.no systemd[1]: Stopped Real time performance monitoring.
jan. 18 01:02:41 dick.e5r.no systemd[1]: Started Real time performance monitoring.

Then tried to stop Netdata with systemctl stop netdata which hung, but was eventually killed:

● netdata.service - Real time performance monitoring
     Loaded: loaded (/lib/systemd/system/netdata.service; enabled; vendor preset: enabled)
     Active: failed (Result: timeout) since Thu 2024-01-18 21:00:04 CET; 47s ago
    Process: 1662549 ExecStart=/usr/sbin/netdata -D $EXTRA_OPTS (code=killed, signal=KILL)
   Main PID: 1662549 (code=killed, signal=KILL)

jan. 18 21:00:03 dick.e5r.no systemd[1]: netdata.service: Killing process 1662947 (n/a) with signal SIGKILL.
jan. 18 21:00:03 dick.e5r.no systemd[1]: netdata.service: Killing process 1662949 (n/a) with signal SIGKILL.
jan. 18 21:00:03 dick.e5r.no systemd[1]: netdata.service: Killing process 1662954 (n/a) with signal SIGKILL.
jan. 18 21:00:03 dick.e5r.no systemd[1]: netdata.service: Killing process 1662956 (n/a) with signal SIGKILL.
jan. 18 21:00:03 dick.e5r.no systemd[1]: netdata.service: Killing process 1662957 (TRAIN[1]) with signal SIGKILL.
jan. 18 21:00:03 dick.e5r.no systemd[1]: netdata.service: Killing process 1662959 (n/a) with signal SIGKILL.
jan. 18 21:00:03 dick.e5r.no systemd[1]: netdata.service: Killing process 1663160 (n/a) with signal SIGKILL.
jan. 18 21:00:04 dick.e5r.no systemd[1]: netdata.service: Main process exited, code=killed, status=9/KILL
jan. 18 21:00:04 dick.e5r.no systemd[1]: netdata.service: Failed with result 'timeout'.
jan. 18 21:00:04 dick.e5r.no systemd[1]: Stopped Real time performance monitoring.
@asteinlein asteinlein added bug needs triage Issues which need to be manually labelled labels Jan 18, 2024
@shodanshok
Copy link

I experience the very same issue with a netdata parent setup. Some logs:

Jan 19 11:02:07 localhost netdata[3281289]: METRIC: refcount is 0 (zero or negative) during release
Jan 19 11:02:07 localhost netdata[3281289]: /usr/sbin/netdata(+0x9035e)[0x55649c3a035e]
Jan 19 11:02:07 localhost netdata[3281289]: /usr/sbin/netdata(+0x3ce99e)[0x55649c6de99e]
Jan 19 11:02:07 localhost netdata[3281289]: /usr/sbin/netdata(+0x239597)[0x55649c549597]
Jan 19 11:02:07 localhost netdata[3281289]: /usr/sbin/netdata(+0x13b6d6)[0x55649c44b6d6]
Jan 19 11:02:07 localhost netdata[3281289]: /usr/sbin/netdata(+0x13b864)[0x55649c44b864]
Jan 19 11:02:07 localhost netdata[3281289]: /usr/sbin/netdata(+0x8116d)[0x55649c39116d]
Jan 19 11:02:07 localhost netdata[3281289]: /usr/sbin/netdata(+0x81707)[0x55649c391707]
Jan 19 11:02:07 localhost netdata[3281289]: /usr/sbin/netdata(+0x892f1)[0x55649c3992f1]
Jan 19 11:02:07 localhost netdata[3281289]: /usr/sbin/netdata(+0x14e39d)[0x55649c45e39d]
Jan 19 11:02:07 localhost netdata[3281289]: /usr/sbin/netdata(+0x8116d)[0x55649c39116d]
Jan 19 11:02:07 localhost netdata[3281289]: /usr/sbin/netdata(+0x813c7)[0x55649c3913c7]
Jan 19 11:02:07 localhost netdata[3281289]: /usr/sbin/netdata(+0x81e50)[0x55649c391e50]
Jan 19 11:02:07 localhost netdata[3281289]: /usr/sbin/netdata(+0x82efd)[0x55649c392efd]
Jan 19 11:02:07 localhost netdata[3281289]: /usr/sbin/netdata(+0x14ffa9)[0x55649c45ffa9]
Jan 19 11:02:07 localhost netdata[3281289]: /usr/sbin/netdata(+0x117b8b)[0x55649c427b8b]
Jan 19 11:02:07 localhost netdata[3281289]: /usr/sbin/netdata(+0x25a57b)[0x55649c56a57b]
Jan 19 11:02:07 localhost netdata[3281289]: /usr/sbin/netdata(+0x25b8ce)[0x55649c56b8ce]
Jan 19 11:02:07 localhost netdata[3281289]: /usr/sbin/netdata(+0xa3168)[0x55649c3b3168]
Jan 19 11:02:07 localhost netdata[3281289]: /lib64/libc.so.6(+0x9f802)[0x7fdabda9f802]
Jan 19 11:02:07 localhost netdata[3281289]: /lib64/libc.so.6(+0x3f450)[0x7fdabda3f450]
Jan 19 11:02:07 localhost netdata[3281289]: NETDATA SHUTDOWN: initializing shutdown with code 1...
Jan 19 11:02:08 localhost netdata[3281289]: NETDATA SHUTDOWN: next: create shutdown file
Jan 19 11:02:08 localhost netdata[3281289]: NETDATA SHUTDOWN: in       0 ms, create shutdown file - next: dbengine exit mode
Jan 19 11:02:08 localhost netdata[3281289]: NETDATA SHUTDOWN: in       0 ms, dbengine exit mode - next: close webrtc connections
Jan 19 11:02:08 localhost netdata[3281289]: NETDATA SHUTDOWN: in       0 ms, close webrtc connections - next: disable maintenance, new queries, new web requests, new streaming connections and aclk
Jan 19 11:02:08 localhost netdata[3281289]: NETDATA SHUTDOWN: in       0 ms, disable maintenance, new queries, new web requests, new streaming connections and aclk - next: stop replication, exporters, health and web servers thr>
Jan 19 11:02:08 localhost netdata[3281289]: SERVICE CONTROL: waiting for the following 3 services [ WEB_SERVER HEALTH ] to exit: 'HEALTH' (3281433), 'WEB[1]' (3281438), 'WEB[2]' (3281470)
Jan 19 11:02:08 localhost netdata[3281289]: cleaning up...
Jan 19 11:02:08 localhost netdata[3281289]: stopped after 38 connects, 38 disconnects (max concurrent 5), 628 receptions and 926 sends
Jan 19 11:02:08 localhost netdata[3281289]: stopped after 36 connects, 36 disconnects (max concurrent 4), 654 receptions and 882 sends
Jan 19 11:02:08 localhost netdata[3281289]: closing all web server sockets...
Jan 19 11:02:08 localhost netdata[3281289]: all static web threads stopped.
...
Jan 19 11:02:12 localhost netdata[3281289]: SERVICE CONTROL: the following 1 service(s) [ COLLECTORS ] take too long to exit: 'PREDICT' (3281447); giving up on them...

@ilyam8
Copy link
Member

ilyam8 commented Jan 20, 2024

@shodanshok stable version too? I think the problem is fixed in the latest.

@shodanshok
Copy link

Yes, I am using the latest stable version (1.44)

@asteinlein
Copy link
Author

FYI, since I posted the issue, I've tried playing around with a few options/plugins, and it seems disabling the cgroup plugin has solved it -- or rather at the very least mitigated the issue. Not enough time has gone by to say definitively for sure, but as of now it' s been running much longer than it ever did with that plugin enabled.

@ilyam8
Copy link
Member

ilyam8 commented Feb 9, 2024

@asteinlein Are you still having this issue after updating to v1.44.2?

@asteinlein
Copy link
Author

@asteinlein Are you still having this issue after updating to v1.44.2?

I'm still running v1.44.1. Shouldn't it automatically update? (apt update doesn't show any updates either.)

@tkatsoulas
Copy link
Contributor

Check your auto updates

ls -la /etc/cron.daily/
total 72
drwxr-xr-x. 1 root root    92 Jan 31 19:37 .
drwxr-xr-x. 1 root root  5998 Feb  9 13:18 ..
. . . . .maybe more entries. . .  
lrwxrwxrwx  1 root root    39 Jan 31 19:37 netdata-updater -> /usr/libexec/netdata/netdata-updater.sh

Apt-get update just fetches changes in the repo, not upgrading any piece of software.
You need also apt-get upgrade netdata if you want to use the package manager otherwise just run the updater script manually sh usr/libexec/netdata/netdata-updater.sh

@asteinlein
Copy link
Author

asteinlein commented Feb 11, 2024

I've upgraded and enabled the cgroups plugin again now (running without that had been smooth, but the crash was with it). Will report back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug needs triage Issues which need to be manually labelled
Projects
None yet
Development

No branches or pull requests

4 participants