Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

do not panic when recover from disk failure #354

Closed
wants to merge 1 commit into from

Conversation

kk47
Copy link

@kk47 kk47 commented May 20, 2024

We use the dragonboat to start an on-disk statemachine. When we test the case of remove all dragonboat data, including the raft log and NodeHost data, it is the same as the physical disk failure, and then we try to start the dragonboat process, we found it panic at handleHeartbeatMessage and it is not resonable because the disk failure cause replica failure forever.

Version: v4.0.0-20231222133740-1d6e2d76cd57
Action: 1. stop process 2. rm -fr /path/to/dragonboat-data/* 3. start process
Log:
5月 20 17:21:08 kk1 xx[193242]: 2024-05-20 17:21:08.872609 D | dragonboat: [00002:00001] on disk SM is beng initialized
5月 20 17:21:08 kk1 xx[193242]: 2024-05-20 17:21:08.882507 I | rsm: [00003:00001] opened disk SM, index 0
5月 20 17:21:08 kk1 xx[193242]: 2024-05-20 17:21:08.882526 I | rsm: [00003:00001] no snapshot available during launch
5月 20 17:21:08 kk1 xx[193242]: 2024-05-20 17:21:08.882541 D | dragonboat: [00003:00001] completed recoverRequested
5月 20 17:21:08 kk1 xx[193242]: 2024-05-20 17:21:08.882556 I | rsm: [00002:00001] opened disk SM, index 0
5月 20 17:21:08 kk1 xx[193242]: 2024-05-20 17:21:08.882560 I | rsm: [00002:00001] no snapshot available during launch
5月 20 17:21:08 kk1 xx[193242]: 2024-05-20 17:21:08.882566 I | rsm: [00004:00001] opened disk SM, index 0
5月 20 17:21:08 kk1 xx[193242]: 2024-05-20 17:21:08.882570 I | rsm: [00004:00001] no snapshot available during launch
5月 20 17:21:08 kk1 xx[193242]: 2024-05-20 17:21:08.882577 I | dragonboat: [00003:00001] initialized using <00003:00001:0>
5月 20 17:21:08 kk1 xx[193242]: 2024-05-20 17:21:08.882597 I | dragonboat: [00003:00001] initial index set to 0
5月 20 17:21:08 kk1 xx[193242]: 2024-05-20 17:21:08.882605 I | dragonboat: [00004:00001] initialized using <00004:00001:0>
5月 20 17:21:08 kk1 xx[193242]: 2024-05-20 17:21:08.882633 I | dragonboat: [00004:00001] initial index set to 0
5月 20 17:21:08 kk1 xx[193242]: 2024-05-20 17:21:08.882645 I | dragonboat: [00002:00001] initialized using <00002:00001:0>
5月 20 17:21:08 kk1 xx[193242]: 2024-05-20 17:21:08.882677 I | dragonboat: [00002:00001] initial index set to 0
5月 20 17:21:08 kk1 xx[193242]: 2024-05-20 17:21:08.882692 D | dragonboat: [00004:00001] completed recoverRequested
5月 20 17:21:08 kk1 xx[193242]: 2024-05-20 17:21:08.882722 D | dragonboat: [00002:00001] completed recoverRequested
5月 20 17:21:08 kk1 xx[193242]: 2024-05-20 17:21:08.882830 I | rsm: [00001:00001] opened disk SM, index 0
5月 20 17:21:08 kk1 xx[193242]: 2024-05-20 17:21:08.882846 I | rsm: [00001:00001] no snapshot available during launch
5月 20 17:21:08 kk1 xx[193242]: 2024-05-20 17:21:08.882862 D | dragonboat: [00001:00001] completed recoverRequested
5月 20 17:21:08 kk1 xx[193242]: 2024-05-20 17:21:08.882882 I | dragonboat: [00001:00001] initialized using <00001:00001:0>
5月 20 17:21:08 kk1 xx[193242]: 2024-05-20 17:21:08.882900 I | dragonboat: [00001:00001] initial index set to 0
5月 20 17:21:08 kk1 xx[193242]: 2024-05-20 17:21:08.895241 W | raft: [f:1,l:3,t:1,c:3,a:0] [00004:00001] t3 received Heartbeat with higher term (66) from n00002
5月 20 17:21:08 kk1 xx[193242]: 2024-05-20 17:21:08.895281 W | raft: [f:1,l:3,t:1,c:3,a:0] [00004:00001] t3 become follower after receiving higher term from n00002
5月 20 17:21:08 kk1 xx[193242]: 2024-05-20 17:21:08.895318 I | raft: [f:1,l:3,t:1,c:3,a:0] [00004:00001] t66 became follower
5月 20 17:21:08 kk1 xx[193242]: 2024-05-20 17:21:08.895325 C | raft: invalid commitTo index 46355, lastIndex() 3
5月 20 17:21:08 kk1 xx[193242]: panic: invalid commitTo index 46355, lastIndex() 3
5月 20 17:21:08 kk1 xx[193242]: goroutine 318 [running]:
5月 20 17:21:08 kk1 xx[193242]: github.com/lni/goutils/logutil/capnslog.(*PackageLogger).Panicf(0x20?, {0x36444c5?, 0xc00012a110?}, {0xc0002b22a0?, 0xc000e5c410?, 0xc001159a18?})
5月 20 17:21:08 kk1 xx[193242]: /root/go/pkg/mod/github.com/lni/goutils@v1.4.0/logutil/capnslog/pkg_logger.go:88 +0xbb
5月 20 17:21:08 kk1 xx[193242]: github.com/lni/dragonboat/v4/logger.(*capnsLog).Panicf(0xc000e5c3f0?, {0x36444c5?, 0x41e225?}, {0xc0002b22a0?, 0x2ef8700?, 0xc18ae361355d7800?})
5月 20 17:21:08 kk1 xx[193242]: /root/go/pkg/mod/github.com/lni/dragonboat/v4@v4.0.0-20231222133740-1d6e2d76cd57/logger/capnslogger.go:74 +0x26
5月 20 17:21:08 kk1 xx[193242]: github.com/lni/dragonboat/v4/logger.(*dragonboatLogger).Panicf(0xb513?, {0x36444c5, 0x29}, {0xc0002b22a0, 0x2, 0x2})
5月 20 17:21:08 kk1 xx[193242]: /root/go/pkg/mod/github.com/lni/dragonboat/v4@v4.0.0-20231222133740-1d6e2d76cd57/logger/logger.go:135 +0x57
5月 20 17:21:08 kk1 xx[193242]: github.com/lni/dragonboat/v4/internal/raft.(*entryLog).commitTo(0xc0002b4310, 0xb513)
5月 20 17:21:08 kk1 xx[193242]: /root/go/pkg/mod/github.com/lni/dragonboat/v4@v4.0.0-20231222133740-1d6e2d76cd57/internal/raft/logentry.go:341 +0x102
5月 20 17:21:08 kk1 xx[193242]: github.com/lni/dragonboat/v4/internal/raft.(*raft).handleHeartbeatMessage(, {0x11, 0x1, 0x2, 0x4, 0x42, 0x0, 0x0, 0xb513, 0x0, ...})
5月 20 17:21:08 kk1 xx[193242]: /root/go/pkg/mod/github.com/lni/dragonboat/v4@v4.0.0-20231222133740-1d6e2d76cd57/internal/raft/raft.go:1398 +0x48
5月 20 17:21:08 kk1 xx[193242]: github.com/lni/dragonboat/v4/internal/raft.(*raft).handleFollowerHeartbeat(
, {0x11, 0x1, 0x2, 0x4, 0x42, 0x0, 0x0, 0xb513, 0x0, ...})
5月 20 17:21:08 kk1 xx[193242]: /root/go/pkg/mod/github.com/lni/dragonboat/v4@v4.0.0-20231222133740-1d6e2d76cd57/internal/raft/raft.go:2134 +0x85
5月 20 17:21:08 kk1 xx[193242]: github.com/lni/dragonboat/v4/internal/raft.defaultHandle(, {0x11, 0x1, 0x2, 0x4, 0x42, 0x0, 0x0, 0xb513, 0x0, ...})
5月 20 17:21:08 kk1 xx[193242]: /root/go/pkg/mod/github.com/lni/dragonboat/v4@v4.0.0-20231222133740-1d6e2d76cd57/internal/raft/raft.go:2332 +0x7a
5月 20 17:21:08 kk1 xx[193242]: github.com/lni/dragonboat/v4/internal/raft.(*raft).Handle(
, {0x11, 0x1, 0x2, 0x4, 0x42, 0x0, 0x0, 0xb513, 0x0, ...})
5月 20 17:21:08 kk1 xx[193242]: /root/go/pkg/mod/github.com/lni/dragonboat/v4@v4.0.0-20231222133740-1d6e2d76cd57/internal/raft/raft.go:1601 +0x102
5月 20 17:21:08 kk1 xx[193242]: github.com/lni/dragonboat/v4/internal/raft.(*Peer).Handle(_, {0x11, 0x1, 0x2, 0x4, 0x42, 0x0, 0x0, 0xb513, 0x0, ...})

@kevburnsjr
Copy link
Contributor

kevburnsjr commented May 20, 2024

Dropping heartbeats like this could be super dangerous and panic seems like the correct response here.

Looks to me like you're not using NodeHostConfig.DefaultNodeRegistryEnabled with GossipConfig.

Using these options should cause the node to generate a new nodehost id when it starts up with a blank disk, even if the IP Address doesn't change.

Have you tried this configuration?

@kk47
Copy link
Author

kk47 commented May 21, 2024

Dropping heartbeats like this could be super dangerous and panic seems like the correct response here.

Looks to me like you're not using NodeHostConfig.DefaultNodeRegistryEnabled with GossipConfig.

Using these options should cause the node to generate a new nodehost id when it starts up with a blank disk, even if the IP Address doesn't change.

Have you tried this configuration?

Thanks! I have tried the NodeHostConfig.DefaultNodeRegistryEnabled and use gossip communications. The disk failure test was passed. Another questions is the NodeHostId still use the same one, we do not change it after clean the raft log directory and restart the proccess.

initMembers := map[uint64]string{
1: "fddd4733-8608-4393-928d-e3e543494e4a",
2: "bc9eea7b-aa62-477e-9878-c61c9bdcafd8",
3: "7f2b8b83-074b-462a-8da1-5d6e69bacf5e",
}

nodehost.StartOnDiskReplica(initMembers, false, fsm, cfg)

@kk47 kk47 closed this May 23, 2024
@lni
Copy link
Owner

lni commented May 30, 2024

@kk47

please note that raft is a non-byzantine distributed system, which means it trusts all other peers. By having acknowledgments from enough followers, a proposal is accepted and committed as the leader is assured that as long as those followers are available, the concerned data replica will also be available. The fundamental requirement is such above described dependability.

what you experimented above is very different from that. you basically removed the replica data but still let the follower to continue to exist. nodes start to become confused as earlier promises were not kept.

Upon detected disk failure, what your application should be doing is to panic first and wait for the broken disk to be replaced. On recovery, membership changes should be used first to get the now dead follower removed from the shard and a new replica with new replica ID added to the shard to replace the old dead one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants