Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

source: build_stream_source_reader failure may block cluster #16813

Open
xxchan opened this issue May 19, 2024 · 5 comments
Open

source: build_stream_source_reader failure may block cluster #16813

xxchan opened this issue May 19, 2024 · 5 comments
Labels
type/bug Something isn't working
Milestone

Comments

@xxchan
Copy link
Member

xxchan commented May 19, 2024

Describe the bug

If build_stream_source_reader fails due to e.g., connection to external system lost, the SourceExecutor cannot work anymore and barriers cannot be collected.

It seems some sources like Kafka will not fail when creating reader, but some like Pub/Sub will fail.

Error message/log

No response

To Reproduce

  1. risedev d <profile-with-pubsub>
  2. create source with Pub/Sub emulator
  3. risedev k && risedev d <profile-without-pubsub>

Now the cluster stop working

dev=> create table t(x int);
ERROR:  Failed to run the query

Caused by these errors (recent errors listed first):
  1: gRPC request to meta service failed: Internal error
  2: The cluster is bootstrapping
2024-05-19T14:56:19.017581+08:00 ERROR risingwave_stream::task::stream_manager: actor exit with error actor_id=56 error=Executor error: Connector error: tonic error : transport error: error trying to connect: tcp connect error: Connection refused (os error 61)

Backtrace:
   0: std::backtrace_rs::backtrace::libunwind::trace
             at /rustc/4a0cc881dcc4d800f10672747f61a94377ff6662/library/std/src/../../backtrace/src/backtrace/libunwind.rs:105:5
   1: std::backtrace_rs::backtrace::trace_unsynchronized
             at /rustc/4a0cc881dcc4d800f10672747f61a94377ff6662/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2: std::backtrace::Backtrace::create
             at /rustc/4a0cc881dcc4d800f10672747f61a94377ff6662/library/std/src/backtrace.rs:331:13
   3: anyhow::error::<impl core::convert::From<E> for anyhow::Error>::from
             at /Users/xxchan/.cargo/registry/src/index.crates.io-6f17d22bba15001f/anyhow-1.0.81/src/error.rs:565:25
   4: <T as core::convert::Into<U>>::into
             at /rustc/4a0cc881dcc4d800f10672747f61a94377ff6662/library/core/src/convert/mod.rs:759:9
   5: anyhow::kind::Trait::new
             at /Users/xxchan/.cargo/registry/src/index.crates.io-6f17d22bba15001f/anyhow-1.0.81/src/kind.rs:95:9
   6: <risingwave_connector::source::google_pubsub::source::reader::PubsubSplitReader as risingwave_connector::source::base::SplitReader>::new::{{closure}}::{{closure}}
             at ./src/connector/src/source/google_pubsub/source/reader.rs:131:60
   7: core::result::Result<T,E>::map_err
             at /rustc/4a0cc881dcc4d800f10672747f61a94377ff6662/library/core/src/result.rs:829:27
   8: <risingwave_connector::source::google_pubsub::source::reader::PubsubSplitReader as risingwave_connector::source::base::SplitReader>::new::{{closure}}
             at ./src/connector/src/source/google_pubsub/source/reader.rs:131:22
   9: <core::pin::Pin<P> as core::future::future::Future>::poll
             at /rustc/4a0cc881dcc4d800f10672747f61a94377ff6662/library/core/src/future/future.rs:123:9
  10: risingwave_connector::source::base::create_split_reader::{{closure}}
             at ./src/connector/src/source/base.rs:111:75
  11: <F as futures_core::future::TryFuture>::try_poll
             at /Users/xxchan/.cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-core-0.3.30/src/future.rs:82:9
  12: <futures_util::future::try_future::into_future::IntoFuture<Fut> as core::future::future::Future>::poll
             at /Users/xxchan/.cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-util-0.3.30/src/future/try_future/into_future.rs:34:9
  13: <F as futures_core::future::TryFuture>::try_poll
             at /Users/xxchan/.cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-core-0.3.30/src/future.rs:82:9
  14: <futures_util::future::try_maybe_done::TryMaybeDone<Fut> as core::future::future::Future>::poll
             at /Users/xxchan/.cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-util-0.3.30/src/future/try_maybe_done.rs:79:57
  15: <F as futures_core::future::TryFuture>::try_poll
             at /Users/xxchan/.cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-core-0.3.30/src/future.rs:82:9
  16: <futures_util::future::try_join_all::TryJoinAll<F> as core::future::future::Future>::poll
             at /Users/xxchan/.cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-util-0.3.30/src/future/try_join_all.rs:163:32
  17: risingwave_connector::source::reader::reader::SourceReader::to_stream::{{closure}}
             at ./src/connector/src/source/reader/reader.rs:168:18
  18: risingwave_stream::executor::source::source_executor::SourceExecutor<S>::build_stream_source_reader::{{closure}}
             at ./src/stream/src/executor/source/source_executor.rs:136:14
  19: <await_tree::future::Instrumented<F,_> as core::future::future::Future>::poll
             at /Users/xxchan/.cargo/registry/src/index.crates.io-6f17d22bba15001f/await-tree-0.2.1/src/future.rs:119:15
  20: risingwave_stream::executor::source::source_executor::SourceExecutor<S>::execute_with_stream_source::{{closure}}
             at ./src/stream/src/executor/source/source_executor.rs:464:14

Expected behavior

The cluster continues working when there's an issue within a source.

How did you deploy RisingWave?

No response

The version of RisingWave

No response

Additional context

No response

@xxchan xxchan added the type/bug Something isn't working label May 19, 2024
@github-actions github-actions bot added this to the release-1.10 milestone May 19, 2024
@BugenZhao
Copy link
Member

It seems some sources like Kafka will not fail when creating reader

Could you provide more details on how this is possible? Is it because the connection is actually established when first being polled or something else?

@BugenZhao
Copy link
Member

2: The cluster is bootstrapping

Perhaps we should also consider enriching the error message by including the failure details if there's any retry during the bootstrap recovery.

@BugenZhao
Copy link
Member

Expected behavior

The cluster continues working when there's an issue within a source.

I'm not sure if this is the behavior we should guarantee. At least the cluster is not permanently blocked (like #16693) as the user can run DROP to resolve the problem now.

@xxchan
Copy link
Member Author

xxchan commented May 20, 2024

At least the cluster is not permanently blocked (like #16693) as the user can run DROP to resolve the problem now.

Indeed. I thought it's blocked because I cannot CREATE new jobs. But it seems DROP is working. A little surprising, how is it possible?

——

oh, it’s the drop on recovery feature

@xxchan
Copy link
Member Author

xxchan commented May 20, 2024

It seems some sources like Kafka will not fail when creating reader

Could you provide more details on how this is possible? Is it because the connection is actually established when first being polled or something else?

That's highly possible. Anyway, the behavior is largely decided by the SDK.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants