Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcdutl snapshot restore cannot allocate memory #16052

Closed
4 tasks done
fatihusta opened this issue Jun 12, 2023 · 9 comments · Fixed by #17277 or #17978
Closed
4 tasks done

etcdutl snapshot restore cannot allocate memory #16052

fatihusta opened this issue Jun 12, 2023 · 9 comments · Fixed by #17277 or #17978

Comments

@fatihusta
Copy link
Contributor

Bug report criteria

What happened?

etcdutl crashs when I restore db from snapshot in the container.

2023-06-09T12:07:38Z	panic	backend/backend.go:186	failed to open database	{"path": "/tmp/etcd/member/snap/db", "error": "cannot allocate memory"}

What did you expect to happen?

.

How can we reproduce it (as minimally and precisely as possible)?

restore db in not enough memory container

Anything else we need to know?

No response

Etcd version (please run commands below)

# Copied from github release
$ etcdutl version
etcdutl version: 3.5.7
API version: 3.5

# I builded from main branch
$ ./etcdutl version
etcdutl version: 3.6.0-alpha.0
API version: 3.6

Etcd configuration (command line flags or environment variables)

paste your configuration here

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

$ etcdctl member list -w table
# paste output here

$ etcdctl --endpoints=<member list> endpoint status -w table
# paste output here

Relevant log output

ETCDCTL_API=3 ./etcdutl snapshot restore --skip-hash-check=true --data-dir=/tmp/etcd --name=etc02 --initial-cluster=etc01=http://10.88.0.19:2380,etc03=http://10.88.0.21:2380,etc02=http://10.88.0.20:2380 --initial-advertise-peer-urls=http://10.88.0.20:2380 /root/snap.db

2023-06-09T12:07:38Z	info	snapshot/v3_snapshot.go:257	restoring snapshot	{"path": "/root/snap.db", "wal-dir": "/tmp/etcd/member/wal", "data-dir": "/tmp/etcd", "snap-dir": "/tmp/etcd/member/snap"}
2023-06-09T12:07:38Z	panic	backend/backend.go:186	failed to open database	{"path": "/tmp/etcd/member/snap/db", "error": "cannot allocate memory"}
go.etcd.io/etcd/server/v3/storage/backend.newBackend
	go.etcd.io/etcd/server/v3@v3.6.0-alpha.0/storage/backend/backend.go:186
go.etcd.io/etcd/server/v3/storage/backend.NewDefaultBackend
	go.etcd.io/etcd/server/v3@v3.6.0-alpha.0/storage/backend/backend.go:170
go.etcd.io/etcd/etcdutl/v3/snapshot.(*v3Manager).saveDB
	go.etcd.io/etcd/etcdutl/v3/snapshot/v3_snapshot.go:303
go.etcd.io/etcd/etcdutl/v3/snapshot.(*v3Manager).Restore
	go.etcd.io/etcd/etcdutl/v3/snapshot/v3_snapshot.go:265
go.etcd.io/etcd/etcdutl/v3/etcdutl.SnapshotRestoreCommandFunc
	go.etcd.io/etcd/etcdutl/v3/etcdutl/snapshot_command.go:132
go.etcd.io/etcd/etcdutl/v3/etcdutl.snapshotRestoreCommandFunc
	go.etcd.io/etcd/etcdutl/v3/etcdutl/snapshot_command.go:102
github.com/spf13/cobra.(*Command).execute
	github.com/spf13/cobra@v1.7.0/command.go:944
github.com/spf13/cobra.(*Command).ExecuteC
	github.com/spf13/cobra@v1.7.0/command.go:1068
github.com/spf13/cobra.(*Command).Execute
	github.com/spf13/cobra@v1.7.0/command.go:992
main.Start
	go.etcd.io/etcd/etcdutl/v3/ctl.go:56
main.main
	go.etcd.io/etcd/etcdutl/v3/main.go:23
runtime.main
	runtime/proc.go:250
panic: failed to open database

goroutine 1 [running]:
go.uber.org/zap/zapcore.CheckWriteAction.OnWrite(0x2?, 0x2?, {0x0?, 0x0?, 0xc0006246c0?})
	go.uber.org/zap@v1.24.0/zapcore/entry.go:198 +0x65
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc00064cb60, {0xc0001c6600, 0x2, 0x2})
	go.uber.org/zap@v1.24.0/zapcore/entry.go:264 +0x3ec
go.uber.org/zap.(*Logger).Panic(0xc00048e318?, {0xbc27c8?, 0xa3dc06?}, {0xc0001c6600, 0x2, 0x2})
	go.uber.org/zap@v1.24.0/logger.go:258 +0x59
go.etcd.io/etcd/server/v3/storage/backend.newBackend({{0xc00048e318, 0x18}, 0x5f5e100, 0x2710, {0x0, 0x0}, 0x280000000, 0xc000652f50, 0x0, 0x0, ...})
	go.etcd.io/etcd/server/v3@v3.6.0-alpha.0/storage/backend/backend.go:186 +0x2ff
go.etcd.io/etcd/server/v3/storage/backend.NewDefaultBackend(0xc0005df110?, {0xc00048e318?, 0x1?})
	go.etcd.io/etcd/server/v3@v3.6.0-alpha.0/storage/backend/backend.go:170 +0xc5
go.etcd.io/etcd/etcdutl/v3/snapshot.(*v3Manager).saveDB(0xc0005dfbd8)
	go.etcd.io/etcd/etcdutl/v3/snapshot/v3_snapshot.go:303 +0xbe
go.etcd.io/etcd/etcdutl/v3/snapshot.(*v3Manager).Restore(0xc0005dfbd8, {{0x40008008d6, 0xd}, {0x40008007f2, 0x15}, {0x40008007e1, 0x9}, {0xc00048e0a8, 0x14}, {0xc000203ee0, ...}, ...})
	go.etcd.io/etcd/etcdutl/v3/snapshot/v3_snapshot.go:265 +0x7dc
go.etcd.io/etcd/etcdutl/v3/etcdutl.SnapshotRestoreCommandFunc({0x400080081a, 0x86}, {0xbba16b, 0xc}, {0x40008007e1?, 0x0?}, {0x0, 0x0}, {0x40008008bf, 0x16}, ...)
	go.etcd.io/etcd/etcdutl/v3/etcdutl/snapshot_command.go:132 +0x365
go.etcd.io/etcd/etcdutl/v3/etcdutl.snapshotRestoreCommandFunc(0xc00030ec00?, {0xc00064e4e0?, 0x6?, 0x6?})
	go.etcd.io/etcd/etcdutl/v3/etcdutl/snapshot_command.go:102 +0xed
github.com/spf13/cobra.(*Command).execute(0xc00030ec00, {0xc00064e480, 0x6, 0x6})
	github.com/spf13/cobra@v1.7.0/command.go:944 +0x847
github.com/spf13/cobra.(*Command).ExecuteC(0x1168fc0)
	github.com/spf13/cobra@v1.7.0/command.go:1068 +0x3bd
github.com/spf13/cobra.(*Command).Execute(...)
	github.com/spf13/cobra@v1.7.0/command.go:992
main.Start(...)
	go.etcd.io/etcd/etcdutl/v3/ctl.go:56
main.main()
	go.etcd.io/etcd/etcdutl/v3/main.go:23 +0x5e
@jmhbnz
Copy link
Member

jmhbnz commented Jun 12, 2023

Hi @fatihusta - Can you please confirm are you running the etcdutl command from inside a container? If so, what are the memory limits for the container?

@fatihusta
Copy link
Contributor Author

Hi @jmhbnz
Yes, I'm using etcdutl command in container.
Actually I did not set memory limitation of container. I think is hits docker machine vm memory limit. (Running 8 containers same time. 5 of them are etcd.) I'm just testing etcd backup restore cases. Not a production environment.

I suspected boltdb initial memory size.
So I prepared a patch. I will send PR soon. I hope it good solution.

@ahrtr
Copy link
Member

ahrtr commented Jul 19, 2023

@fatihusta What's the DB size? Have you verified that setting a small value (what value?) for --initial-memory-map-size with your PR #16055 can fix this issue?

@fatihusta
Copy link
Contributor Author

@ahrtr I tested and worked.

I'm using podman on macos(M1).

podman machine list
NAME                     VM TYPE     CREATED      LAST UP            CPUS        MEMORY      DISK SIZE
podman-machine-default*  qemu        7 weeks ago  Currently running  5           10.74GB     107.4GB
ID            NAME         CPU %       MEM USAGE / LIMIT  MEM %       NET IO             BLOCK IO           PIDS        CPU TIME      AVG CPU %
30e4fe582399  etcd02  3.35%       531.6MB / 10.4GB   5.11%       3.623MB / 16.71MB  64.05MB / 65.54kB  105         1m5.450819s   10.47%
57fa36254568  etcd01  4.10%       528.5MB / 10.4GB   5.08%       5.215MB / 18.4MB   65.13MB / 65.54kB  107         1m11.439557s  11.43%
8b503ce908ff  lb02    2.13%       2.961GB / 10.4GB   28.48%      1.654MB / 1.15MB   21.94MB / 69.63kB  139         1m1.894951s   9.89%
95d93590a573  lb01    0.67%       461.8MB / 10.4GB   4.44%       1.178MB / 7.965MB  19.13MB / 73.73kB  102         52.58702s     8.40%
be072ed4ba2f  etcd03  3.40%       532.7MB / 10.4GB   5.12%       3.842MB / 16.41MB  64.48MB / 214.5kB  110         1m11.269335s  11.41%
f7da5af9414e  cfg01   3.22%       3.168GB / 10.4GB   30.47%      3.451MB / 15.53MB  58.1MB / 73.73kB   156         1m20.455725s  12.86%


Here is full output with and without --initial-memory-map-size.

bash test_original.sh

du -sh snap.db
24K	snap.db

rm -rf /tmp/etcd

ETCDCTL_API=3 etcdutl snapshot restore snap.db \
  --skip-hash-check=true \
  --data-dir=/tmp/etcd \
  --name=etc02 \
  --initial-cluster=etc01=http://10.88.0.19:2380,etc03=http://10.88.0.21:2380,etc02=http://10.88.0.20:2380 \
  --initial-advertise-peer-urls=http://10.88.0.20:2380

# STACK messaga  
2023-07-19T12:49:24Z	info	snapshot/v3_snapshot.go:248	restoring snapshot	{"path": "snap.db", "wal-dir": "/tmp/etcd/member/wal", "data-dir": "/tmp/etcd", "snap-dir": "/tmp/etcd/member/snap", "stack": "go.etcd.io/etcd/etcdutl/v3/snapshot.(*v3Manager).Restore\n\tgo.etcd.io/etcd/etcdutl/v3/snapshot/v3_snapshot.go:254\ngo.etcd.io/etcd/etcdutl/v3/etcdutl.SnapshotRestoreCommandFunc\n\tgo.etcd.io/etcd/etcdutl/v3/etcdutl/snapshot_command.go:147\ngo.etcd.io/etcd/etcdutl/v3/etcdutl.snapshotRestoreCommandFunc\n\tgo.etcd.io/etcd/etcdutl/v3/etcdutl/snapshot_command.go:117\ngithub.com/spf13/cobra.(*Command).execute\n\tgithub.com/spf13/cobra@v1.1.3/command.go:856\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\tgithub.com/spf13/cobra@v1.1.3/command.go:960\ngithub.com/spf13/cobra.(*Command).Execute\n\tgithub.com/spf13/cobra@v1.1.3/command.go:897\nmain.Start\n\tgo.etcd.io/etcd/etcdutl/v3/ctl.go:50\nmain.main\n\tgo.etcd.io/etcd/etcdutl/v3/main.go:23\nruntime.main\n\truntime/proc.go:255"}
panic: failed to open database

goroutine 1 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc0000ec6c0, {0xc00044c100, 0x2, 0x2})
	go.uber.org/zap@v1.17.0/zapcore/entry.go:234 +0x499
go.uber.org/zap.(*Logger).Panic(0xc0000343f0, {0xb8395b, 0x11a4fa0}, {0xc00044c100, 0x2, 0x2})
	go.uber.org/zap@v1.17.0/logger.go:227 +0x59
go.etcd.io/etcd/server/v3/mvcc/backend.newBackend({{0xc0000343f0, 0x18}, 0x5f5e100, 0x2710, {0x0, 0x0}, 0x280000000, 0xc000110cd0, 0x0, 0x0, ...})
	go.etcd.io/etcd/server/v3@v3.5.7/mvcc/backend/backend.go:189 +0x43f
go.etcd.io/etcd/server/v3/mvcc/backend.NewDefaultBackend(...)
	go.etcd.io/etcd/server/v3@v3.5.7/mvcc/backend/backend.go:169
go.etcd.io/etcd/etcdutl/v3/snapshot.(*v3Manager).saveDB(0xc0005dfc50)
	go.etcd.io/etcd/etcdutl/v3/snapshot/v3_snapshot.go:295 +0x198
go.etcd.io/etcd/etcdutl/v3/snapshot.(*v3Manager).Restore(0xc0005dfc50, {{0x4000800813, 0x7}, {0x400080084e, 0x5}, {0x400080083d, 0x9}, {0xc000034180, 0x14}, {0xc000079810, ...}, ...})
	go.etcd.io/etcd/etcdutl/v3/snapshot/v3_snapshot.go:257 +0x8f9
go.etcd.io/etcd/etcdutl/v3/etcdutl.SnapshotRestoreCommandFunc({0x4000800866, 0x56}, {0xb7b8cb, 0xc}, {0x400080083d, 0x5355d1}, {0x0, 0x0}, {0x40008008db, 0x16}, ...)
	go.etcd.io/etcd/etcdutl/v3/etcdutl/snapshot_command.go:147 +0x365
go.etcd.io/etcd/etcdutl/v3/etcdutl.snapshotRestoreCommandFunc(0xc000445400, {0xc00007f320, 0x6, 0x6})
	go.etcd.io/etcd/etcdutl/v3/etcdutl/snapshot_command.go:117 +0xed
github.com/spf13/cobra.(*Command).execute(0xc000445400, {0xc00007f2c0, 0x6, 0x6})
	github.com/spf13/cobra@v1.1.3/command.go:856 +0x5f8
github.com/spf13/cobra.(*Command).ExecuteC(0x1165a20)
	github.com/spf13/cobra@v1.1.3/command.go:960 +0x3ad
github.com/spf13/cobra.(*Command).Execute(...)
	github.com/spf13/cobra@v1.1.3/command.go:897
main.Start(...)
	go.etcd.io/etcd/etcdutl/v3/ctl.go:50
main.main()
	go.etcd.io/etcd/etcdutl/v3/main.go:23 +0x5e



bash test_patched.sh

du -sh snap.db
24K	snap.db

rm -rf /tmp/etcd

ETCDCTL_API=3 ./etcdutl snapshot restore snap.db \
  --skip-hash-check=true \
  --data-dir=/tmp/etcd \
  --name=etc02 \
  --initial-advertise-peer-urls=http://10.88.0.20:2380 \
  --initial-cluster=etc01=http://10.88.0.19:2380,etc03=http://10.88.0.21:2380,etc02=http://10.88.0.20:2380 \
  --initial-memory-map-size=104857600 # 100mb

2023-07-19T12:49:30Z	info	snapshot/v3_snapshot.go:263	restoring snapshot	{"path": "snap.db", "wal-dir": "/tmp/etcd/member/wal", "data-dir": "/tmp/etcd", "snap-dir": "/tmp/etcd/member/snap", "initial-memory-map-size": 104857600}
2023-07-19T12:49:30Z	info	schema/membership.go:122	Trimming membership information from the backend...
2023-07-19T12:49:30Z	info	membership/cluster.go:403	added member	{"cluster-id": "3f205b6177ed4a5c", "local-member-id": "0", "added-peer-id": "13e536940eb0139f", "added-peer-peer-urls": ["http://10.88.0.21:2380"], "added-peer-is-learner": false}
2023-07-19T12:49:30Z	info	membership/cluster.go:403	added member	{"cluster-id": "3f205b6177ed4a5c", "local-member-id": "0", "added-peer-id": "59263d6b1540b571", "added-peer-peer-urls": ["http://10.88.0.19:2380"], "added-peer-is-learner": false}
2023-07-19T12:49:30Z	info	membership/cluster.go:403	added member	{"cluster-id": "3f205b6177ed4a5c", "local-member-id": "0", "added-peer-id": "c9533c31e73abc10", "added-peer-peer-urls": ["http://10.88.0.20:2380"], "added-peer-is-learner": false}
2023-07-19T12:49:30Z	info	snapshot/v3_snapshot.go:284	restored snapshot	{"path": "snap.db", "wal-dir": "/tmp/etcd/member/wal", "data-dir": "/tmp/etcd", "snap-dir": "/tmp/etcd/member/snap", "initial-memory-map-size": 104857600}

I tried to validate on linux after your ask.
I tested on ubuntu linux with podman.

I defined 200mb memory limit for per etcd container.

Here is output.

du -sh snap.db
20K	snap.db

rm -rf /tmp/etcd

ETCDCTL_API=3 etcdutl snapshot restore snap.db  \
	--skip-hash-check=true \
	--data-dir=/tmp/etcd \
	--name=etc02 \
	--initial-cluster=etc01=http://10.88.0.43:2380,etc03=http://10.88.0.42:2380,etc02=http://10.88.0.41:2380 \
	--initial-advertise-peer-urls=http://10.88.0.41:2380

# STACK messaga  
2023-07-19T15:45:16+03:00	info	snapshot/v3_snapshot.go:248	restoring snapshot	{"path": "snap.db", "wal-dir": "/tmp/etcd/member/wal", "data-dir": "/tmp/etcd", "snap-dir": "/tmp/etcd/member/snap", "stack": "go.etcd.io/etcd/etcdutl/v3/snapshot.(*v3Manager).Restore\n\tgo.etcd.io/etcd/etcdutl/v3/snapshot/v3_snapshot.go:254\ngo.etcd.io/etcd/etcdutl/v3/etcdutl.SnapshotRestoreCommandFunc\n\tgo.etcd.io/etcd/etcdutl/v3/etcdutl/snapshot_command.go:147\ngo.etcd.io/etcd/etcdutl/v3/etcdutl.snapshotRestoreCommandFunc\n\tgo.etcd.io/etcd/etcdutl/v3/etcdutl/snapshot_command.go:117\ngithub.com/spf13/cobra.(*Command).execute\n\tgithub.com/spf13/cobra@v1.1.3/command.go:856\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\tgithub.com/spf13/cobra@v1.1.3/command.go:960\ngithub.com/spf13/cobra.(*Command).Execute\n\tgithub.com/spf13/cobra@v1.1.3/command.go:897\nmain.Start\n\tgo.etcd.io/etcd/etcdutl/v3/ctl.go:50\nmain.main\n\tgo.etcd.io/etcd/etcdutl/v3/main.go:23\nruntime.main\n\truntime/proc.go:255"}


2023-07-19T15:45:16+03:00	info	membership/store.go:141	Trimming membership information from the backend...
2023-07-19T15:45:16+03:00	info	membership/cluster.go:421	added member	{"cluster-id": "b8cdf9557d6554c8", "local-member-id": "0", "added-peer-id": "5f49874613b800fe", "added-peer-peer-urls": ["http://10.88.0.42:2380"]}
2023-07-19T15:45:16+03:00	info	membership/cluster.go:421	added member	{"cluster-id": "b8cdf9557d6554c8", "local-member-id": "0", "added-peer-id": "814cd60c86f73ec0", "added-peer-peer-urls": ["http://10.88.0.43:2380"]}
2023-07-19T15:45:16+03:00	info	membership/cluster.go:421	added member	{"cluster-id": "b8cdf9557d6554c8", "local-member-id": "0", "added-peer-id": "aebd9e0da60f2a61", "added-peer-peer-urls": ["http://10.88.0.41:2380"]}
2023-07-19T15:45:16+03:00	info	snapshot/v3_snapshot.go:269	restored snapshot	{"path": "snap.db", "wal-dir": "/tmp/etcd/member/wal", "data-dir": "/tmp/etcd", "snap-dir": "/tmp/etcd/member/snap"}


bash test_patched.sh

du -sh snap.db
20K	snap.db

rm -rf /tmp/etcd

ETCDCTL_API=3 ./etcdutl snapshot restore snap.db \
	--skip-hash-check=true \
	--data-dir=/tmp/etcd \
	--name=etc02 \
	--initial-cluster=etc01=http://10.88.0.43:2380,etc03=http://10.88.0.42:2380,etc02=http://10.88.0.41:2380 \
	--initial-advertise-peer-urls=http://10.88.0.41:2380 \
	--initial-memory-map-size=104857600 # 100mb

2023-07-19T15:53:04+03:00	info	snapshot/v3_snapshot.go:263	restoring snapshot	{"path": "snap.db", "wal-dir": "/tmp/etcd/member/wal", "data-dir": "/tmp/etcd", "snap-dir": "/tmp/etcd/member/snap", "initial-memory-map-size": 104857600}
2023-07-19T15:53:04+03:00	info	schema/membership.go:122	Trimming membership information from the backend...
2023-07-19T15:53:04+03:00	info	membership/cluster.go:403	added member	{"cluster-id": "b8cdf9557d6554c8", "local-member-id": "0", "added-peer-id": "5f49874613b800fe", "added-peer-peer-urls": ["http://10.88.0.42:2380"], "added-peer-is-learner": false}
2023-07-19T15:53:04+03:00	info	membership/cluster.go:403	added member	{"cluster-id": "b8cdf9557d6554c8", "local-member-id": "0", "added-peer-id": "814cd60c86f73ec0", "added-peer-peer-urls": ["http://10.88.0.43:2380"], "added-peer-is-learner": false}
2023-07-19T15:53:04+03:00	info	membership/cluster.go:403	added member	{"cluster-id": "b8cdf9557d6554c8", "local-member-id": "0", "added-peer-id": "aebd9e0da60f2a61", "added-peer-peer-urls": ["http://10.88.0.41:2380"], "added-peer-is-learner": false}
2023-07-19T15:53:04+03:00	info	snapshot/v3_snapshot.go:284	restored snapshot	{"path": "snap.db", "wal-dir": "/tmp/etcd/member/wal", "data-dir": "/tmp/etcd", "snap-dir": "/tmp/etcd/member/snap", "initial-memory-map-size": 104857600}

@ahrtr
Copy link
Member

ahrtr commented Feb 14, 2024

Please backport the fix to 3.5 and add a changelog. thx

@ahrtr
Copy link
Member

ahrtr commented Feb 14, 2024

/assign ivanvc

@k8s-ci-robot
Copy link

@ahrtr: GitHub didn't allow me to assign the following users: ivanvc.

Note that only etcd-io members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign ivanvc

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jmhbnz
Copy link
Member

jmhbnz commented May 9, 2024

Discussed during sig-etcd triage meeting, @ivanvc the assignment above failed did you still want to work on this?

@ivanvc
Copy link
Member

ivanvc commented May 9, 2024

@jmhbnz, I didn't get this notification as the mention didn't have the at (@) 😅. I can backport this, no problem.
/assign

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment