Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Blackhole implementation for e2e tests #17938

Draft
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

henrybear327
Copy link
Contributor

@henrybear327 henrybear327 commented May 5, 2024

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.

Based on Fu Wei's idea discussed in the issue, we employ the blocking on L7 but without using external tools.

Problem

A peer will
(a) receive traffic from its peers
(b) initiate connections to its peers (via stream and pipeline).

Thus, the current mechanism of only blocking peer traffic via the peer's existing proxy is insufficient, since only scenario (a) is handled, and scenario (b) is not blocked at all.

Solution - main idea

We introduce an "HTTP proxy" for each peer, which will be proxying all the connections initiated from a peer to its peers.

The modified architecture will look something like this:

A -- A's forward proxy ----- B's (currently existing) reverse proxy - B
     ^ newly introduced   ^ in the original codebase

By adding this HTTP proxy, we can block all in and out traffic that is initiated from a peer to others, without having to resort to external tools, such as iptables. It's verified that the blocking of traffic is complete, compared to previous solutions [2][3].

Implementation

The main subtasks are

The result is that for every peer, we will have the arch like this

A -- A's HTTP proxy (connections initiated from A will be forwarded from this proxy)
 |   ^ covers case (b)
 |
 --- A's (currently existing) proxy (advertised to other peers where the connection should come in from)
     ^ covers case (a)

Test

  • make gofail-enable && make build && make gofail-disable && \ go test -timeout 60s -run ^TestBlackholeByMockingPartitionLeader$ go.etcd.io/etcd/tests/v3/e2e -v -count=1
  • make gofail-enable && make build && make gofail-disable && \ go test -timeout 60s -run ^TestBlackholeByMockingPartitionFollower$ go.etcd.io/etcd/tests/v3/e2e -v -count=1

Known issue

  • I run into context deadline exceeded sometimes
    etcd_mix_versions_test.go:175: 
                Error Trace:    /Users/henrybear327/go/src/etcd/tests/e2e/etcd_mix_versions_test.go:175
                                                        /Users/henrybear327/go/src/etcd/tests/e2e/blackhole_test.go:75
                                                        /Users/henrybear327/go/src/etcd/tests/e2e/blackhole_test.go:31
                Error:          Received unexpected error:
                                [/Users/henrybear327/go/src/etcd/bin/etcdctl --endpoints=http://localhost:20006 put key-0 value-0] match not found.  Set EXPECT_DEBUG for more info Errs: [unexpected exit code [1] after running [/Users/henrybear327/go/src/etcd/bin/etcdctl --endpoints=http://localhost:20006 put key-0 value-0]], last lines:
                                {"level":"warn","ts":"2024-05-05T23:02:36.809726+0800","logger":"etcd-client","caller":"v3@v3.6.0-alpha.0/retry_interceptor.go:65","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0x140001ee960/localhost:20006","method":"/etcdserverpb.KV/Put","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
                                Error: context deadline exceeded
                                 (expected "OK", got []). Try EXPECT_DEBUG=TRUE
                Test:           TestBlackholeByMockingPartitionLeader
                Messages:       failed to put "key-0", error: [/Users/henrybear327/go/src/etcd/bin/etcdctl --endpoints=http://localhost:20006 put key-0 value-0] match not found.  Set EXPECT_DEBUG for more info Errs: [unexpected exit code [1] after running [/Users/henrybear327/go/src/etcd/bin/etcdctl --endpoints=http://localhost:20006 put key-0 value-0]], last lines:
                                {"level":"warn","ts":"2024-05-05T23:02:36.809726+0800","logger":"etcd-client","caller":"v3@v3.6.0-alpha.0/retry_interceptor.go:65","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0x140001ee960/localhost:20006","method":"/etcdserverpb.KV/Put","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
                                Error: context deadline exceeded
                                 (expected "OK", got []). Try EXPECT_DEBUG=TRUE

[References]
[1] issue #17737
[2] PR (V1) https://github.com/henrybear327/etcd/tree/fix/e2e_blackhole
[3] PR (V2) #17891

@k8s-ci-robot
Copy link

Hi @henrybear327. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@henrybear327 henrybear327 force-pushed the fix/e2e_blackhole_proxy_review_forward_proxy_review branch from 9887b13 to 3e15f96 Compare May 5, 2024 15:12
@henrybear327
Copy link
Contributor Author

/cc @siyuanfoundation @fuweid

@henrybear327 henrybear327 force-pushed the fix/e2e_blackhole_proxy_review_forward_proxy_review branch 6 times, most recently from e94e96d to 48d1936 Compare May 5, 2024 15:44
@ivanvc
Copy link
Member

ivanvc commented May 5, 2024

/retitle Fix Blackhole implementation for e2e tests

@k8s-ci-robot k8s-ci-robot changed the title Fix Blockhole implementation for e2e tests Fix Blackhole implementation for e2e tests May 5, 2024
pkg/proxy/server.go Outdated Show resolved Hide resolved
pkg/proxy/server.go Outdated Show resolved Hide resolved
@henrybear327 henrybear327 force-pushed the fix/e2e_blackhole_proxy_review_forward_proxy_review branch from 48d1936 to 6158507 Compare May 6, 2024 05:16
@serathius
Copy link
Member

serathius commented May 6, 2024

Not a networking expert, would like to get some help.

cc @aojea
Can you help review this?

@henrybear327 henrybear327 force-pushed the fix/e2e_blackhole_proxy_review_forward_proxy_review branch 3 times, most recently from 763b0d3 to 465b00c Compare May 6, 2024 16:09
henrybear327 added a commit to henrybear327/etcd that referenced this pull request May 6, 2024
Thanks to Fu Wei for the input regarding etcd-io#17938.

[Problem]

A peer will
(a) receive traffic from its peers
(b) initiate connections to its peers (via stream and pipeline).

Thus, the current mechanism of only blocking peer traffic via the peer's existing proxy is insufficient, since only scenario (a) is handled, and scenario (b) is not blocked at all.

[Main idea]

We introduce 1 shared "HTTP proxy" for all peers. All peers will be proxying all the connections through it.

The modified architecture will look something like this:
```
A -- shared HTTP proxy ----- B
     ^ newly introduced
```

By adding this HTTP proxy, we can block all in and out traffic that is initiated from a peer to others, without having to resort to external tools, such as iptables. It's verified that the blocking of traffic is complete, compared to previous solutions [2][3].

[Implementation]

The main subtasks are
- set up an environment variable `FORWARD_PROXY`, because go will not parse HTTP_PROXY and HTTPS_PROXY that is using localhost or 127.0.0.1, regardless if the port is present or not
- implement the shared HTTP proxy by extending the existing proxy server code
- remove existing proxy setup (the per-peer proxy)
- implement enable/disable of the HTTP proxy in the e2e test

[Testing]

- `make gofail-enable && make build && make gofail-disable && \
go test -timeout 60s -run ^TestBlackholeByMockingPartitionLeader$ go.etcd.io/etcd/tests/v3/e2e -v -count=1`
- `make gofail-enable && make build && make gofail-disable && \
go test -timeout 60s -run ^TestBlackholeByMockingPartitionFollower$ go.etcd.io/etcd/tests/v3/e2e -v -count=1`

[References]
[1] Tracking issue etcd-io#17737
[2] PR (V1) https://github.com/henrybear327/etcd/tree/fix/e2e_blackhole
[3] PR (V2) etcd-io#17891
[4] PR (V3) etcd-io#17938

Signed-off-by: Chun-Hung Tseng <henrybear327@gmail.com>
henrybear327 added a commit to henrybear327/etcd that referenced this pull request May 6, 2024
Thanks to Fu Wei for the input regarding etcd-io#17938.

[Problem]

A peer will
(a) receive traffic from its peers
(b) initiate connections to its peers (via stream and pipeline).

Thus, the current mechanism of only blocking peer traffic via the peer's existing proxy is insufficient, since only scenario (a) is handled, and scenario (b) is not blocked at all.

[Main idea]

We introduce 1 shared HTTP proxy for all peers. All peers will be proxying all the connections through it.

The modified architecture will look something like this:
```
A -- shared HTTP proxy ----- B
     ^ newly introduced
```

By adding this HTTP proxy, we can block all in and out traffic that is initiated from a peer to others, without having to resort to external tools, such as iptables. It's verified that the blocking of traffic is complete, compared to previous solutions [2][3].

[Implementation]

The main subtasks are
- set up an environment variable `FORWARD_PROXY`, because go will not parse HTTP_PROXY and HTTPS_PROXY that is using localhost or 127.0.0.1, regardless if the port is present or not
- implement the shared HTTP proxy by extending the existing proxy server code (we need to be able to identify the source sender)
- remove existing proxy setup (the per-peer proxy)
- implement enable/disable of the HTTP proxy in the e2e test

[Testing]

- `make gofail-enable && make build && make gofail-disable && \
go test -timeout 60s -run ^TestBlackholeByMockingPartitionLeader$ go.etcd.io/etcd/tests/v3/e2e -v -count=1`
- `make gofail-enable && make build && make gofail-disable && \
go test -timeout 60s -run ^TestBlackholeByMockingPartitionFollower$ go.etcd.io/etcd/tests/v3/e2e -v -count=1`

[References]
[1] Tracking issue etcd-io#17737
[2] PR (V1) https://github.com/henrybear327/etcd/tree/fix/e2e_blackhole
[3] PR (V2) etcd-io#17891
[4] PR (V3) etcd-io#17938

Signed-off-by: Chun-Hung Tseng <henrybear327@gmail.com>
pkg/proxy/server.go Outdated Show resolved Hide resolved
@serathius
Copy link
Member

Talked with @aojea, proposed L7 forward proxy is an improvement over the previous faulty approach, however it might not be necessarily what we want. We want to simulate a realistic network disruptions, and there is no way to do that using a L7 proxy. This is currently impossible with current setup where etcd processes communicate via localhost, to modify packets we need them to go through kernel network stack, and that's not possible if they share the same network stack.

Unfortunately setting dedicated network stack this is pretty complicated, @aojea suggest using docker which sets it up for each container, however I'm worried about how much using docker will complicate etcd e2e testing. I would prefer to set up network ourselves, however this makes it less portable. Main benefit of using docker is portability across platforms.

As we won't be able to build such solution anytime soon, I'm ok with still using L7 http proxy in the meantime.

// then a nil URL and nil error will be returned.
// Thus, we need to workaround this by manually setting an
// ENV named FORWARD_PROXY and parse the URL (which is a localhost in our case)
if httpProxy, exists := os.LookupEnv("FORWARD_PROXY"); exists {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't know if we want to support a new environment variable FORWARD_PROXY for proxying.

I know that http.ProxyFromEnvironment(req) will ignore HTTP_PROXY when it URL targets localhost, but what about just changing the LookupEnv to check for HTTP_PROXY and return early?

cc @ahrtr

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I did it this way is that I don't want to interfere with the default Golang behavior, since this code change will be reflected in the normally built binaries and I am not sure if this will break some other parts of the system or existing user assumption/usage. (my main concern is that HTTP_PROXY or HTTPS_PROXY parsing behavior will change if we use LookupEnv to intercept the HTTP_PROXY or HTTPS_PROXY and if it's set with localhost or 127.0.0.1)

I am too new to judge so I need @ahrtr 's help to confirm this!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest that we do not change this very basic common function. Instead, we can extend its functionality on top of it in test only using the Decorator pattern.

Specifically, you can add a similar function NewTransport in test framework, it calls this function, but it also lookup the environment variable "HTTP_PROXY" or "HTTPS_PROXY", and set the proxy no matter it's localhost or not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other option is to enhance the test framework to enable contributors to use a non localhost IP, so that you can use environment variable "HTTP_PROXY" or "HTTPS_PROXY" to set a proxy directly, and no need to change this common function NewTransport, similar to what #17661 does.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will look into those 2 approaches :) Thank you @ahrtr

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarify on my previous comment #17938 (comment),

We can enable users to inject a customized NewTransport, similar to

lookupSRV = net.LookupSRV // net.DefaultResolver.LookupSRV when ctxs don't conflict
resolveTCPAddr = net.ResolveTCPAddr

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ahrtr for the code snippets!

After looking into the suggestions, I think it's hard to obtain a non-localhost IP and use it as a proxy during e2e testing, so I went on to look into injecting a customized NewTransport.

My understanding is that the e2e test is not like a unit test where we can change the function pointers during the execution of the unit test dynamically (like #17938 (comment)) because they are in the same module. Instead, the client code is compiled before the e2e test is run. Therefore, some sort of communication channel is required for the e2e test to indicate its need to change the code path during testing (not as easy as swapping lookupSRV), so I use the ENV E2E_TEST_FORWARD_PROXY_IP as a way to pass in the proxy address.

My current attempt is c93e3fc. The idea is kind of borrowed from #17938 (comment), where the HTTP proxy parsing function in use will be determined during the first Transport initialization. I currently can't come up with a cleaner way without having to use a custom ENV though... :( I am not using HTTP_PROXY or HTTPS_PROXY as we would like to not affect the default behavior of http.ProxyFromEnvironment (not parse localhost addresses), and only use a custom NewTransport during e2e testing only when E2E_TEST_FORWARD_PROXY_IP is set.

Not sure if this makes sense to you! Happy to know if there is a better way of implementing this. :)

@henrybear327 henrybear327 force-pushed the fix/e2e_blackhole_proxy_review_forward_proxy_review branch from f243d4d to 9df3252 Compare May 12, 2024 09:55
@henrybear327
Copy link
Contributor Author

henrybear327 commented May 12, 2024

Talked with @aojea, proposed L7 forward proxy is an improvement over the previous faulty approach, however it might not be necessarily what we want. We want to simulate a realistic network disruptions, and there is no way to do that using a L7 proxy. This is currently impossible with current setup where etcd processes communicate via localhost, to modify packets we need them to go through kernel network stack, and that's not possible if they share the same network stack.
Unfortunately setting dedicated network stack this is pretty complicated, @aojea suggest using docker which sets it up for each container, however I'm worried about how much using docker will complicate etcd e2e testing. I would prefer to set up network ourselves, however this makes it less portable. Main benefit of using docker is portability across platforms.
As we won't be able to build such solution anytime soon, I'm ok with still using L7 http proxy in the meantime.

Hey @serathius, thanks for the detailed write-up! I agree that what we want to do eventually, like dropping HTTP packets without having to resort to HTTPS, reordering of the packets, etc., isn't doable on L7. I will finish this PR properly, and then research the potential setup using docker and submit a PoC afterward!

@fuweid and I have discussed my other PR changes #17950 (comment), and I am currently working on implementing it. By doing it this way, we will not need to have both forward and reverse proxies, just 1 forward proxy per peer, and we should be good to go!

WDYT? :)

This PR is forward + reverse proxy, and #17985 (a PoC) is keeping only the forward proxy.

This current PR is easier to blackhole traffic, since for the peer we would like to blackhole, we just need to drop all the traffic to and from its forward and reverse proxies, whereas if we only have the reverse proxy, we need to configure proxies from all peers to perform traffic filtering. But if we only keep the forward proxy, we can reduce the hop count on our e2e testing network and make our set up slimmer and cleaner.

@henrybear327 henrybear327 force-pushed the fix/e2e_blackhole_proxy_review_forward_proxy_review branch from 9df3252 to e8fd61d Compare May 15, 2024 22:15
@henrybear327 henrybear327 marked this pull request as draft May 15, 2024 23:11
@henrybear327 henrybear327 force-pushed the fix/e2e_blackhole_proxy_review_forward_proxy_review branch 4 times, most recently from 0410867 to 954c141 Compare May 16, 2024 20:59
@henrybear327
Copy link
Contributor Author

henrybear327 commented May 16, 2024

@fuweid Based on the discussion here, I implemented the L7 proxy via ServeHTTP function, and now only keeping 1 forward proxy per peer.

I did not have a unified implementation for this PR because the current proxy implementation has "pause/delay accepting connections" features built-in, and these can only be done at L4 instead of L7.

I checked the codebase and "pause accepting connections" isn't used, and "delay accepting connections" is only used during the proxy server's unit test, so in a sense, we can remove the L4 server altogether if we remove those 2 features. But if we ever need those 2 features again, we will need to reimplement the L4 proxy...

@henrybear327 henrybear327 force-pushed the fix/e2e_blackhole_proxy_review_forward_proxy_review branch from 954c141 to ca534e6 Compare May 16, 2024 21:25
siyuanfoundation and others added 7 commits May 18, 2024 21:42
Signed-off-by: Siyuan Zhang <sizhang@google.com>
Shorten the wait time for the open connection to expire to 5s

Signed-off-by: Chun-Hung Tseng <henrybear327@gmail.com>
Based on Fu Wei's idea discussed in the [issue](etcd-io#17737),
we employ the blocking on L7 but without using external tools.

[Background]

A peer will
(a) receive traffic from its peers
(b) initiate connections to its peers (via stream and pipeline).

Thus, the current mechanism of only blocking peer traffic via the peer's
existing proxy is insufficient, since only scenario (a) is handled, and
scenario (b) is not blocked at all.

[Proposed solution]

We introduce an forward proxy for each peer, which will be proxying all
the connections initiated from a peer to its peers.

The modified architecture will look something like this:
```
A -- A's forward proxy ----- B's reverse proxy - B
     ^ newly introduced      ^ in the original codebase (renamed)
```

By adding this forward proxy, we can block all in and out traffic that
is initiated from a peer to others, without having to resort to external
tools, such as iptables.

[Testing]
- `make gofail-enable && make build && make gofail-disable && \
go test -timeout 60s -run ^TestBlackholeByMockingPartitionLeader$ go.etcd.io/etcd/tests/v3/e2e -v -count=1`
- `make gofail-enable && make build && make gofail-disable && \
go test -timeout 60s -run ^TestBlackholeByMockingPartitionFollower$ go.etcd.io/etcd/tests/v3/e2e -v -count=1`

[Issues]
- I run into `context deadline exceeded` sometimes
```
    etcd_mix_versions_test.go:175:
                Error Trace:    /Users/henrybear327/go/src/etcd/tests/e2e/etcd_mix_versions_test.go:175
                                                        /Users/henrybear327/go/src/etcd/tests/e2e/blackhole_test.go:75
                                                        /Users/henrybear327/go/src/etcd/tests/e2e/blackhole_test.go:31
                Error:          Received unexpected error:
                                [/Users/henrybear327/go/src/etcd/bin/etcdctl --endpoints=http://localhost:20006 put key-0 value-0] match not found.  Set EXPECT_DEBUG for more info Errs: [unexpected exit code [1] after running [/Users/henrybear327/go/src/etcd/bin/etcdctl --endpoints=http://localhost:20006 put key-0 value-0]], last lines:
                                {"level":"warn","ts":"2024-05-05T23:02:36.809726+0800","logger":"etcd-client","caller":"v3@v3.6.0-alpha.0/retry_interceptor.go:65","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0x140001ee960/localhost:20006","method":"/etcdserverpb.KV/Put","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
                                Error: context deadline exceeded
                                 (expected "OK", got []). Try EXPECT_DEBUG=TRUE
                Test:           TestBlackholeByMockingPartitionLeader
                Messages:       failed to put "key-0", error: [/Users/henrybear327/go/src/etcd/bin/etcdctl --endpoints=http://localhost:20006 put key-0 value-0] match not found.  Set EXPECT_DEBUG for more info Errs: [unexpected exit code [1] after running [/Users/henrybear327/go/src/etcd/bin/etcdctl --endpoints=http://localhost:20006 put key-0 value-0]], last lines:
                                {"level":"warn","ts":"2024-05-05T23:02:36.809726+0800","logger":"etcd-client","caller":"v3@v3.6.0-alpha.0/retry_interceptor.go:65","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0x140001ee960/localhost:20006","method":"/etcdserverpb.KV/Put","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
                                Error: context deadline exceeded
                                 (expected "OK", got []). Try EXPECT_DEBUG=TRUE
```

[References]
[1] issue etcd-io#17737
[2] PR (V1) https://github.com/henrybear327/etcd/tree/fix/e2e_blackhole
[3] PR (V2) etcd-io#17891

Signed-off-by: Chun-Hung Tseng <henrybear327@gmail.com>

It's verified that the blocking of traffic is complete, compared to
previous solutions [2][3].

[Implementation]

The main subtasks are
- set up an environment variable `E2E_TEST_FORWARD_PROXY_IP`
- implement forward proxy by extending the existing proxy server code
- implement enable/disable of the forward proxy in the e2e test

The result is that for every peer, we will have the arch like this
```
A -- A's forward proxy
     (connections initiated from A will be forwarded from this proxy)
 |   ^ covers case (b)
 |
 --- A's (currently existing) reverse proxy
     (advertised to other peers where the connection should come in from)
     ^ covers case (a)
```
Signed-off-by: Chun-Hung Tseng <henrybear327@gmail.com>
Signed-off-by: Chun-Hung Tseng <henrybear327@gmail.com>
Co-authored-by: Iván Valdés Castillo <iv@nvald.es>
There is no need of reverse proxy after the introduction of forward
proxy, as the forward proxy holds the information of the destination,
we can filter connection traffic from there, and this can reduce 1 hop
when the proxy is turned on.

Signed-off-by: Chun-Hung Tseng <henrybear327@gmail.com>
Due to forward proxy's need to parse the CONNECT header, which is a
L7 layer feature, thus we are splitting the proxy into 2 types, for
better maintainability.

Reference:
- etcd-io#17985 (comment)

Signed-off-by: Chun-Hung Tseng <henrybear327@gmail.com>
@henrybear327 henrybear327 force-pushed the fix/e2e_blackhole_proxy_review_forward_proxy_review branch 3 times, most recently from 9abb206 to c9c17ec Compare May 19, 2024 09:42
…witch to L7 proxy design

Signed-off-by: Chun-Hung Tseng <henrybear327@gmail.com>
@henrybear327 henrybear327 force-pushed the fix/e2e_blackhole_proxy_review_forward_proxy_review branch from c9c17ec to ecb34d5 Compare May 19, 2024 10:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

None yet

7 participants