do not panic when recover from disk failure #354

kk47 · 2024-05-20T12:59:35Z

We use the dragonboat to start an on-disk statemachine. When we test the case of remove all dragonboat data, including the raft log and NodeHost data, it is the same as the physical disk failure, and then we try to start the dragonboat process, we found it panic at handleHeartbeatMessage and it is not resonable because the disk failure cause replica failure forever.

Version: v4.0.0-20231222133740-1d6e2d76cd57
Action: 1. stop process 2. rm -fr /path/to/dragonboat-data/* 3. start process
Log:
5月 20 17:21:08 kk1 xx[193242]: 2024-05-20 17:21:08.872609 D | dragonboat: [00002:00001] on disk SM is beng initialized
5月 20 17:21:08 kk1 xx[193242]: 2024-05-20 17:21:08.882507 I | rsm: [00003:00001] opened disk SM, index 0
5月 20 17:21:08 kk1 xx[193242]: 2024-05-20 17:21:08.882526 I | rsm: [00003:00001] no snapshot available during launch
5月 20 17:21:08 kk1 xx[193242]: 2024-05-20 17:21:08.882541 D | dragonboat: [00003:00001] completed recoverRequested
5月 20 17:21:08 kk1 xx[193242]: 2024-05-20 17:21:08.882556 I | rsm: [00002:00001] opened disk SM, index 0
5月 20 17:21:08 kk1 xx[193242]: 2024-05-20 17:21:08.882560 I | rsm: [00002:00001] no snapshot available during launch
5月 20 17:21:08 kk1 xx[193242]: 2024-05-20 17:21:08.882566 I | rsm: [00004:00001] opened disk SM, index 0
5月 20 17:21:08 kk1 xx[193242]: 2024-05-20 17:21:08.882570 I | rsm: [00004:00001] no snapshot available during launch
5月 20 17:21:08 kk1 xx[193242]: 2024-05-20 17:21:08.882577 I | dragonboat: [00003:00001] initialized using <00003:00001:0>
5月 20 17:21:08 kk1 xx[193242]: 2024-05-20 17:21:08.882597 I | dragonboat: [00003:00001] initial index set to 0
5月 20 17:21:08 kk1 xx[193242]: 2024-05-20 17:21:08.882605 I | dragonboat: [00004:00001] initialized using <00004:00001:0>
5月 20 17:21:08 kk1 xx[193242]: 2024-05-20 17:21:08.882633 I | dragonboat: [00004:00001] initial index set to 0
5月 20 17:21:08 kk1 xx[193242]: 2024-05-20 17:21:08.882645 I | dragonboat: [00002:00001] initialized using <00002:00001:0>
5月 20 17:21:08 kk1 xx[193242]: 2024-05-20 17:21:08.882677 I | dragonboat: [00002:00001] initial index set to 0
5月 20 17:21:08 kk1 xx[193242]: 2024-05-20 17:21:08.882692 D | dragonboat: [00004:00001] completed recoverRequested
5月 20 17:21:08 kk1 xx[193242]: 2024-05-20 17:21:08.882722 D | dragonboat: [00002:00001] completed recoverRequested
5月 20 17:21:08 kk1 xx[193242]: 2024-05-20 17:21:08.882830 I | rsm: [00001:00001] opened disk SM, index 0
5月 20 17:21:08 kk1 xx[193242]: 2024-05-20 17:21:08.882846 I | rsm: [00001:00001] no snapshot available during launch
5月 20 17:21:08 kk1 xx[193242]: 2024-05-20 17:21:08.882862 D | dragonboat: [00001:00001] completed recoverRequested
5月 20 17:21:08 kk1 xx[193242]: 2024-05-20 17:21:08.882882 I | dragonboat: [00001:00001] initialized using <00001:00001:0>
5月 20 17:21:08 kk1 xx[193242]: 2024-05-20 17:21:08.882900 I | dragonboat: [00001:00001] initial index set to 0
5月 20 17:21:08 kk1 xx[193242]: 2024-05-20 17:21:08.895241 W | raft: [f:1,l:3,t:1,c:3,a:0] [00004:00001] t3 received Heartbeat with higher term (66) from n00002
5月 20 17:21:08 kk1 xx[193242]: 2024-05-20 17:21:08.895281 W | raft: [f:1,l:3,t:1,c:3,a:0] [00004:00001] t3 become follower after receiving higher term from n00002
5月 20 17:21:08 kk1 xx[193242]: 2024-05-20 17:21:08.895318 I | raft: [f:1,l:3,t:1,c:3,a:0] [00004:00001] t66 became follower
5月 20 17:21:08 kk1 xx[193242]: 2024-05-20 17:21:08.895325 C | raft: invalid commitTo index 46355, lastIndex() 3
5月 20 17:21:08 kk1 xx[193242]: panic: invalid commitTo index 46355, lastIndex() 3
5月 20 17:21:08 kk1 xx[193242]: goroutine 318 [running]:
5月 20 17:21:08 kk1 xx[193242]: github.com/lni/goutils/logutil/capnslog.(*PackageLogger).Panicf(0x20?, {0x36444c5?, 0xc00012a110?}, {0xc0002b22a0?, 0xc000e5c410?, 0xc001159a18?})
5月 20 17:21:08 kk1 xx[193242]: /root/go/pkg/mod/github.com/lni/goutils@v1.4.0/logutil/capnslog/pkg_logger.go:88 +0xbb
5月 20 17:21:08 kk1 xx[193242]: github.com/lni/dragonboat/v4/logger.(*capnsLog).Panicf(0xc000e5c3f0?, {0x36444c5?, 0x41e225?}, {0xc0002b22a0?, 0x2ef8700?, 0xc18ae361355d7800?})
5月 20 17:21:08 kk1 xx[193242]: /root/go/pkg/mod/github.com/lni/dragonboat/v4@v4.0.0-20231222133740-1d6e2d76cd57/logger/capnslogger.go:74 +0x26
5月 20 17:21:08 kk1 xx[193242]: github.com/lni/dragonboat/v4/logger.(*dragonboatLogger).Panicf(0xb513?, {0x36444c5, 0x29}, {0xc0002b22a0, 0x2, 0x2})
5月 20 17:21:08 kk1 xx[193242]: /root/go/pkg/mod/github.com/lni/dragonboat/v4@v4.0.0-20231222133740-1d6e2d76cd57/logger/logger.go:135 +0x57
5月 20 17:21:08 kk1 xx[193242]: github.com/lni/dragonboat/v4/internal/raft.(*entryLog).commitTo(0xc0002b4310, 0xb513)
5月 20 17:21:08 kk1 xx[193242]: /root/go/pkg/mod/github.com/lni/dragonboat/v4@v4.0.0-20231222133740-1d6e2d76cd57/internal/raft/logentry.go:341 +0x102
5月 20 17:21:08 kk1 xx[193242]: github.com/lni/dragonboat/v4/internal/raft.(*raft).handleHeartbeatMessage(, {0x11, 0x1, 0x2, 0x4, 0x42, 0x0, 0x0, 0xb513, 0x0, ...})
5月 20 17:21:08 kk1 xx[193242]: /root/go/pkg/mod/github.com/lni/dragonboat/v4@v4.0.0-20231222133740-1d6e2d76cd57/internal/raft/raft.go:1398 +0x48
5月 20 17:21:08 kk1 xx[193242]: github.com/lni/dragonboat/v4/internal/raft.(*raft).handleFollowerHeartbeat(, {0x11, 0x1, 0x2, 0x4, 0x42, 0x0, 0x0, 0xb513, 0x0, ...})
5月 20 17:21:08 kk1 xx[193242]: /root/go/pkg/mod/github.com/lni/dragonboat/v4@v4.0.0-20231222133740-1d6e2d76cd57/internal/raft/raft.go:2134 +0x85
5月 20 17:21:08 kk1 xx[193242]: github.com/lni/dragonboat/v4/internal/raft.defaultHandle(, {0x11, 0x1, 0x2, 0x4, 0x42, 0x0, 0x0, 0xb513, 0x0, ...})
5月 20 17:21:08 kk1 xx[193242]: /root/go/pkg/mod/github.com/lni/dragonboat/v4@v4.0.0-20231222133740-1d6e2d76cd57/internal/raft/raft.go:2332 +0x7a
5月 20 17:21:08 kk1 xx[193242]: github.com/lni/dragonboat/v4/internal/raft.(*raft).Handle(, {0x11, 0x1, 0x2, 0x4, 0x42, 0x0, 0x0, 0xb513, 0x0, ...})
5月 20 17:21:08 kk1 xx[193242]: /root/go/pkg/mod/github.com/lni/dragonboat/v4@v4.0.0-20231222133740-1d6e2d76cd57/internal/raft/raft.go:1601 +0x102
5月 20 17:21:08 kk1 xx[193242]: github.com/lni/dragonboat/v4/internal/raft.(*Peer).Handle(_, {0x11, 0x1, 0x2, 0x4, 0x42, 0x0, 0x0, 0xb513, 0x0, ...})

kevburnsjr · 2024-05-20T14:41:11Z

Dropping heartbeats like this could be super dangerous and panic seems like the correct response here.

Looks to me like you're not using NodeHostConfig.DefaultNodeRegistryEnabled with GossipConfig.

Using these options should cause the node to generate a new nodehost id when it starts up with a blank disk, even if the IP Address doesn't change.

Have you tried this configuration?

kk47 · 2024-05-21T12:00:24Z

Dropping heartbeats like this could be super dangerous and panic seems like the correct response here.

Looks to me like you're not using NodeHostConfig.DefaultNodeRegistryEnabled with GossipConfig.

Using these options should cause the node to generate a new nodehost id when it starts up with a blank disk, even if the IP Address doesn't change.

Have you tried this configuration?

Thanks! I have tried the NodeHostConfig.DefaultNodeRegistryEnabled and use gossip communications. The disk failure test was passed. Another questions is the NodeHostId still use the same one, we do not change it after clean the raft log directory and restart the proccess.

initMembers := map[uint64]string{
1: "fddd4733-8608-4393-928d-e3e543494e4a",
2: "bc9eea7b-aa62-477e-9878-c61c9bdcafd8",
3: "7f2b8b83-074b-462a-8da1-5d6e69bacf5e",
}

nodehost.StartOnDiskReplica(initMembers, false, fsm, cfg)

lni · 2024-05-30T06:25:56Z

@kk47

please note that raft is a non-byzantine distributed system, which means it trusts all other peers. By having acknowledgments from enough followers, a proposal is accepted and committed as the leader is assured that as long as those followers are available, the concerned data replica will also be available. The fundamental requirement is such above described dependability.

what you experimented above is very different from that. you basically removed the replica data but still let the follower to continue to exist. nodes start to become confused as earlier promises were not kept.

Upon detected disk failure, what your application should be doing is to panic first and wait for the broken disk to be replaced. On recovery, membership changes should be used first to get the now dead follower removed from the shard and a new replica with new replica ID added to the shard to replace the old dead one.

do not panic when recover from disk failure

b1aefe9

kk47 closed this May 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

do not panic when recover from disk failure #354

do not panic when recover from disk failure #354

kk47 commented May 20, 2024 •

edited

kevburnsjr commented May 20, 2024 •

edited

kk47 commented May 21, 2024 •

edited

lni commented May 30, 2024

do not panic when recover from disk failure #354

do not panic when recover from disk failure #354

Conversation

kk47 commented May 20, 2024 • edited

kevburnsjr commented May 20, 2024 • edited

kk47 commented May 21, 2024 • edited

lni commented May 30, 2024

kk47 commented May 20, 2024 •

edited

kevburnsjr commented May 20, 2024 •

edited

kk47 commented May 21, 2024 •

edited