Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Heltec V3 reboot loop when using ATAK-Plugin #3725

Closed
azokthedefiler1 opened this issue Apr 26, 2024 · 22 comments · Fixed by #3922
Closed

[Bug]: Heltec V3 reboot loop when using ATAK-Plugin #3725

azokthedefiler1 opened this issue Apr 26, 2024 · 22 comments · Fixed by #3922
Assignees
Labels
bug Something isn't working

Comments

@azokthedefiler1
Copy link

Category

BLE, Other

Hardware

Heltec V3

Firmware Version

2.3.6.7a3570a

Description

Heltec V3 device reboots every 30-60 seconds while using the ATAK-Plugin. I did some research and it seems to be caused by bluetooth (NimBLE) breaking the stack: Stack canary watchpoint triggered (nimble_host)

Android Meshtastic Version: 2.3.3
Heltec V3 Firmware Version: 2.3.6.7a3570a
ATAK-Plugin Version: 1.0.12

Issue has been around (that I know of) since at least firmware version 2.2.4 and ATAK plugin version 1.0.10. I currently have 4 Heltec V3 devices, and if at least one device is connected/transmitting from ATAK, they will all reboot regardless of whether they are connected to ATAK or not.

I'm happy to help with additional info, just let me know what is needed and how to get it.

Relevant log output

Guru Meditation Error: Core  0 panic'ed (Unhandled debug exception).
Debug exception reason: Stack canary watchpoint triggered (nimble_host)
Core  0 register dump:
PC      : 0x4037a10d  PS      : 0x00060436  A0      : 0x3fcc0eb0  A1      : 0x3fcc0df0
A2      : 0x3fcc1321  A3      : 0x3c151abd  A4      : 0x0000000c  A5      : 0x00000000
A6      : 0x00000000  A7      : 0x3fca9afc  A8      : 0x00000004  A9      : 0x3c151ac1
A10     : 0x0000003f  A11     : 0x3fcec04c  A12     : 0x3fcc1678  A13     : 0x000009b0
A14     : 0x3c15136c  A15     : 0x3fca2248  SAR     : 0x00000004  EXCCAUSE: 0x00000001
EXCVADDR: 0x00000000  LBEG    : 0x400556d5  LEND    : 0x400556e5  LCOUNT  : 0xfffffffe


Backtrace: 0x4037a10a:0x3fcc0df0 0x3fcc0ead:0x3fcc0ed0 |<-CORRUPTED




ELF file SHA256: 8b9bb492df3fd258

E (20286) esp_core_dump_flash: Core dump flash config is corrupted! CRC=0x7bd5c66f instead of 0x0
Rebooting...
␀��ESP-ROM:esp32s3-20210327
Build:Mar 27 2021
rst:0xc (RTC_SW_CPU_RST),boot:0x29 (SPI_FAST_FLASH_BOOT)
Saved PC:0x4037806c
SPIWP:0xee
mode:DIO, clock div:1
load:0x3fce3808,len:0x44c
load:0x403c9700,len:0xbe4
load:0x403cc700,len:0x2a38
entry 0x403c98d4
E (355) esp_core_dump_flash: No core dump partition found!
E (355) esp_core_dump_flash: No core dump partition found!
��␀␂@␁INFO  | ??:??:?? 0

//\ E S H T /\ S T / C

INFO  | ??:??:?? 0 Booted, wake cause 0 (boot count 1), reset_reason=reset
DEBUG | ??:??:?? 0 Filesystem files (16384/1048576 Bytes):
DEBUG | ??:??:?? 0  /prefs/channels.proto (93 Bytes)
DEBUG | ??:??:?? 0  /prefs/config.proto (111 Bytes)
DEBUG | ??:??:?? 0  /prefs/db.proto (456 Bytes)
DEBUG | ??:??:?? 0  /prefs/module.proto (93 Bytes)
[   464][I][esp32-hal-i2c.c:75] i2cInit(): Initialising I2C Master: sda=41 scl=42 freq=100000
[   465][I][esp32-hal-i2c.c:75] i2cInit(): Initialising I2C Master: sda=17 scl=18 freq=100000
DEBUG | ??:??:?? 0 Using analog input 1 for battery level
INFO  | ??:??:?? 0 ADCmod: ADC Characterization based on Two Point values and fitting curve coefficients stored in eFuse
INFO  | ??:??:?? 0 Scanning for i2c devices...
[   497][W][Wire.cpp:301] begin(): Bus already started in Master Mode.
DEBUG | ??:??:?? 0 Scanning for i2c devices on port 2
[   527][W][Wire.cpp:301] begin(): Bus already started in Master Mode.
DEBUG | ??:??:?? 0 Scanning for i2c devices on port 1
DEBUG | ??:??:?? 0 I2C device found at address 0x3c
INFO  | ??:??:?? 0 ssd1306 display found
INFO  | ??:??:?? 0 ssd1306 display found
DEBUG | ??:??:?? 0 0x3 subtype probed in 2 tries
INFO  | ??:??:?? 0 1 I2C devices found
DEBUG | ??:??:?? 0 acc_info = 0
INFO  | ??:??:?? 0 Meshtastic hwvendor=43, swver=2.3.6.7a3570a
DEBUG | ??:??:?? 0 Setting random seed 4081983941
DEBUG | ??:??:?? 0 Total heap: 293960
DEBUG | ??:??:?? 0 Free heap: 257460
DEBUG | ??:??:?? 0 Total PSRAM: 0
DEBUG | ??:??:?? 0 Free PSRAM: 0
DEBUG | ??:??:?? 0 NVS: UsedEntries 89, FreeEntries 541, AllEntries 630, NameSpaces 3
DEBUG | ??:??:?? 0 Setup Preferences in Flash Storage
DEBUG | ??:??:?? 0 Number of Device Reboots: 8
ESP_ERROR_CHECK_WITHOUT_ABORT failed: esp_err_t 0x105 (ESP_ERR_NOT_FOUND) at 0x40380cc3
file: "src/platform/esp32/BleOta.cpp" line 16
func: static const esp_partition_t* BleOta::findEspOtaAppPartition()
expression: esp_ota_get_partition_description(part, &app_desc)
ESP_ERROR_CHECK_WITHOUT_ABORT failed: esp_err_t 0x102 (ESP_ERR_INVALID_ARG) at 0x40380cc3
file: "src/platform/esp32/BleOta.cpp" line 30
func: static String BleOta::getOtaAppVersion()
expression: esp_ota_get_partition_description(part, &app_desc)
DEBUG | ??:??:?? 0 No OTA firmware available
INFO  | ??:??:?? 0 Initializing NodeDB
INFO  | ??:??:?? 0 Loading /prefs/db.proto
INFO  | ??:??:?? 0 Loaded /prefs/db.proto successfully
INFO  | ??:??:?? 0 Loaded saved devicestate version 22, with nodecount: 3
INFO  | ??:??:?? 0 Loading /prefs/config.proto
[   846][E][vfs_api.cpp:105] open(): /littlefs/oem/oem.proto does not exist, no permits for creation
[  1077][D][esp32-hal-cpu.c:244] setCpuFrequencyMhz(): PLL: 480 / 6 = 80 Mhz, APB: 80000000 Hz
@azokthedefiler1 azokthedefiler1 added the bug Something isn't working label Apr 26, 2024
@thebentern thebentern self-assigned this May 5, 2024
@thebentern
Copy link
Contributor

Can you try with the latest version of the ATAK plugin? There were bug fixes, and I want to rule out any client side data issues.

@azokthedefiler1
Copy link
Author

It's still crashing/rebooting, but seems less often. I've updated to:

Android App: 2.3.7
Firmware: 2.3.7.30fbcab Beta
ATAK Plugin: 1.0.16

I've attached the debug logs from one reboot to the next for two Heltec V3's. I got this info using "pio" from the command line, but if there's something better, just let me know.

debug-mt-0420.txt
debug-mt-ca70.txt

@antichamber
Copy link

antichamber commented May 7, 2024

I have the same issue.
Firmware: 2.3.7.30fbcab Beta
ATAK Plugin: 1.0.16

@thebentern
Copy link
Contributor

Please test out @niccellular's latest release of the ATAK plugin (https://github.com/meshtastic/ATAK-Plugin/releases) and this firmware:
#3922

Hopefully less crashy 😄

@azokthedefiler1
Copy link
Author

azokthedefiler1 commented May 17, 2024

I'm testing with 2.3.9.f06c56a firmware and 1.0.21 ATAK-Plugin, and it's still crashing about every 60 seconds.

How do I build your pr on linux? I made it this far from https://meshtastic.org/docs/development/firmware/build/ and some Googling:

git clone -b master https://github.com/meshtastic/firmware.git
git fetch origin pull/3922/head:pull_3922
git checkout pull_3922
git submodule update --init

I have PlatformIO installed, but trying to run pio commands seems to just hang.

@thebentern
Copy link
Contributor

@azokthedefiler1
Copy link
Author

Thanks! That was way easier than trying to build myself.

Just updated from 2.3.9 to 2.3.10.da52ebd and it's still rebooting. I'll try a complete device wipe, and then collect logs.

@thebentern
Copy link
Contributor

Thanks for testing! Look forward to seeing your serial logs if the issue persists.

@azokthedefiler1
Copy link
Author

I reinstalled as fresh install, not upgrade, and then set minimal options: BT Fixed Pin, Role = TAK, and channel Short/Fast. I've attached the debug log from pio from one crash to the next. This time though I noticed it doesn't seem to be NimBLE related, but MeshPacket size?

assert failed: bool perhapsDecode(meshtastic_MeshPacket*) Router.cpp:314 (rawSize <= sizeof(bytes))

Hopefully this helps!

mt-0420_2024-05-17.txt

@azokthedefiler1
Copy link
Author

Here's the trimmed log from the second device:

mt-ca70_2024-05-17.txt

@thebentern
Copy link
Contributor

@azokthedefiler1 does this occur on Long/Fast as well?

@azokthedefiler1
Copy link
Author

Yes. I've gone through all the prefab channels and my own custom too. Short/Fast just seems to crash the fastest, about 30-60secs between crashes. Long/Fast can stretch that out to anywhere from 1-5 minutes.

@niccellular
Copy link

Can you also try the toggle in the meshtastic atak plugin "Only send PLI and Chat messages over Meshtastic"

@azokthedefiler1
Copy link
Author

azokthedefiler1 commented May 17, 2024

I've been testing with WiFi disabled, no server connection, so it's forced to go over Meshtastic. I'll flip that switch anyway and update this post in a few minutes.

UPDATE:
Watching logs and one device is on line 2150. Has not rebooted in like 30 minutes! The other just rebooted twice in the last minute, but was up to 650+ lines before it rebooted. Both are doing nothing, just sitting idle, no movement, no GeoChat.

@azokthedefiler1
Copy link
Author

Update 2: Enabling Only send PLI and Chat messages over Meshtastic has made a drastic difference. The Heltec's are still rebooting, but looks like about 60 minutes in between reboots now.

@thebentern
Copy link
Contributor

I believe there's still some underlying issue with certain payloads not playing nice in the firmware, but I have added a safeguard in my latest commit to the PR to prevent them from rebooting the device at least.

@niccellular
Copy link

When my heltec v3 arrives i'll look more into the payload issue

@azokthedefiler1
Copy link
Author

azokthedefiler1 commented May 18, 2024

I left the Heltec's running for 8+ hours and collected these other errors in the attached files. To sum up so far:

With Only send PLI and Chat messages over Meshtastic disabled, reboots about every 60 secs, crash error is:

assert failed: bool perhapsDecode(meshtastic_MeshPacket*) Router.cpp:314 (rawSize <= sizeof(bytes))

With Only send PLI and Chat messages over Meshtastic enabled, reboots about every 60 mins, crash error is:

Guru Meditation Error: Core 0 panic'ed (Unhandled debug exception).
Debug exception reason: Stack canary watchpoint triggered (nimble_host)

There was one crash slightly different:

Guru Meditation Error: Core 0 panic'ed (Unhandled debug exception).
Debug exception reason: Stack canary watchpoint triggered (btController)

mt-0420_guru_2024-05-18.txt
mt-ca70-guru_2024-05-18.txt

It looks like if the Only send... setting is disabled, it never runs long enough to trigger the BT crash (nimble_host / btController). I am not a programmer, that's just an educated guess.

@thebentern
Copy link
Contributor

Try this one now. Should prevent reboot / crash, but those payloads remain bad.
https://github.com/meshtastic/firmware/actions/runs/9140550399/artifacts/1515882133

@antichamber
Copy link

antichamber commented May 18, 2024

When I try to flash the firmware-heltec-v2_1-2.3.10.2744525.zip , I get this error

C:\Users\redst\Downloads\firmware-heltec-v2_1-2.3.10.2744525.zip (1)>device-update.bat -f firmware-heltec-v2_1-2.3.10.27
44525-update.bin
Trying to flash update firmware-heltec-v2_1-2.3.10.2744525-update.bin
esptool.py v4.7.0
Found 2 serial ports
Serial port COM12
Connecting...
Detecting chip type... ESP32-S3
Chip is ESP32-S3 (QFN56) (revision v0.2)
Features: WiFi, BLE, Embedded Flash 8MB (GD)
Crystal is 40MHz
MAC: 64:e8:33:64:72:90
Uploading stub...
Running stub...
Stub running...
Configuring flash size...
Unexpected chip id in image. Expected 9 but value was 0. Is this image for a different chip model?

@thebentern
Copy link
Contributor

thebentern commented May 18, 2024

Aaaaah! My mistake. I grabbed the v2 instead!
Here is the correct url for the V3: https://github.com/meshtastic/firmware/actions/runs/9140550399/artifacts/1515878461

When I try to flash the firmware-heltec-v2_1-2.3.10.2744525.zip , I get this error

@azokthedefiler1
Copy link
Author

It's running much better now but there are still nimble_host crashes. The logs below are about 1 hour, so 3 reboots on first device and one reboot on the second. This is with the new 2.3.10.2744525 firmware linked above and with Only send PLI... still disabled.

mt-0420-guru_2024-05-19.txt
mt-ca70-guru_2024-05-19.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
4 participants