-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
upstream: implementation of outlier detection extensions #34154
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
… monitor. Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
CC @envoyproxy/api-shepherds: Your approval is needed for changes made to |
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
/retest |
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
@wbpcode I see that you have been assigned for API approval, but changes in proto files are not really API changes, but rather changes to event log which is defined in terms of protobufs. Hope it helps! |
will the old outlier detection extension be eventually deprecated? (given that the new one lands) |
@@ -67,12 +68,12 @@ void DetectorHostMonitorImpl::updateCurrentSuccessRateBucket() { | |||
|
|||
void DetectorHostMonitorImpl::putHttpResponseCode(uint64_t response_code) { | |||
external_origin_sr_monitor_.incTotalReqCounter(); | |||
std::shared_ptr<DetectorImpl> detector = detector_.lock(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alternative approach we could take here is to disallow users to configure same outlier detection algorithm via traditional (old) way and via extension, or to fallback to new mechanism if both are configured. That would simplify the flow and make it easier to reason about ejection events. Does it even make sense to configure 2 same algorithms with diff params for same cluster, wonder what could be the use case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that use case could be to use "old" method for 5xx errors and "new" method for 4xx range. "old" has been battle tested and users trust it, but it is not expandable for ranges outside of 5xx. Once there is enough usage of extensions, we can start limiting config options. WDYT?
|
||
void ConsecutiveErrorsMonitor::onReset() { counter_ = 0; } | ||
|
||
class ConsecutiveErrorsMonitorFactory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of Envoy extensions follow a specific code layout, this is not enforced but helps to easier navigate the code, group tests etc. The factories that parse the extension proto config and create extension instances are usually placed into config.h+cc
files. Extension related code goes into its dedicated header+impl files. For example:
extension config
extension
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont have string opinion here since some extensions deviate from this layout and this file does not have that much code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Thanks. I will leave it for now as it is.
namespace Outlier { | ||
|
||
bool ConsecutiveErrorsMonitor::onError() { | ||
if (counter_ < max_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This if block does not avoid race conditions, e.g. you have counter_ == 29 and max_ == 30:
- T1 reads the counter_ value in line 12
- T2 increments counter_ value to 30
- T1 executes line 13 and increases counter_ to 31
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could utilize atomic CAS primitive: https://en.cppreference.com/w/cpp/atomic/atomic_compare_exchange
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah! That is good point. Will change the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Converted to using CAS primitives. Thanks for catching it!
Deprecating old mechanism would as well reduce maintenance load on cognitive load for outlier detection api (where same thing can be achieved via different config mechanisms). Think we should do a proper deprecation cycle going forward. |
OK. Let me try to add warnings that "old" consecutive errors will be deprecated. Thanks! ==================================================================================== |
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
/retest |
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
/retest |
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
// no-op. Just keep executing compare_exchange_strong until threads synchronize. | ||
do { | ||
; | ||
} while (!counter_.compare_exchange_strong(expected_count, expected_count + 1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think the default memory order memory_order_seq_cst
here requires too much synchronization across worker threads, you could use relaxed order for cas failure and make writes to atomic visible to other threads on success (e.g. memory order release)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about just counter_++
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason is that while counter_
can be "atomically" increased, what we need is entire logic of incrementing and checking a value to be deterministic. This approach seems to be sort of pattern for this lock-less approach. See #34154 (comment)
@cpakulski thank you for all the work done! i added one more comment and believe this is now ready for further review. @wbpcode do you have capacity to review this one? (otherwise will tag senior maintainers) |
/assign-from @envoyproxy/senior-maintainers |
@envoyproxy/senior-maintainers assignee is @wbpcode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the great contribution. But it's so huge, so, only some initial comments to added.
# Outlier detection monitors | ||
/*/extensions/extensions/outlier_detection_monitors @cpakulski @nezdolik |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to list the specific extension here. Like:
/*/extensions/extensions/outlier_detection_monitors/consecutive_errors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I can do that.
# Outlier Detection Monitors | ||
# | ||
# | ||
"envoy.outlier_detection_monitors.common": "//source/extensions/outlier_detection_monitors/common:outlier_detection_monitors_lib", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
envoy.outlier_detection_monitors.common
is not an extension and needn't to be listed here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. And I originally did not list it here, but unfortunately things are so interconnected in bazel/docs that without that line build fails. I do not remember what exactly failed, but I remember I spent few days trying to plug things so everything works and this line was just necessary.
envoy.outlier_detection_monitors.common: | ||
categories: | ||
- envoy.outlier_detection_monitors | ||
security_posture: robust_to_untrusted_downstream | ||
status: alpha | ||
undocumented: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
envoy/upstream/outlier_detection.h
Outdated
// Types of outlier detection extension results which can be reported. | ||
enum class ExtResultType; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please just add the definition of this ResultType
in the header file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Moved full definition of ResultType
to the main header file.
/* | ||
* Class carries result of a transaction with upstream entity | ||
* or generated internally by Envoy. | ||
* Different categories of results will be derived from that base class. | ||
* Those categories of results are fed only into Outlier Detection extensions. | ||
*/ | ||
class ExtResult { | ||
public: | ||
ExtResult() = delete; | ||
ExtResult(ExtResultType type) : type_(type) {} | ||
virtual ExtResultType type() const { return type_; }; | ||
virtual ~ExtResult() = default; | ||
|
||
private: | ||
const ExtResultType type_; | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also doesn't get why this is necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is the base class for different types of results. So, Http results will be reported via a class derived from ExtResult. This is needed because when matching to user-defined buckets happens, the type is checked first. So, Http codes are matched only versus a bucket which has the same type. It does not make sense to match Http code against a bucket which has say database errors:
envoy/source/extensions/outlier_detection_monitors/common/monitor_base_impl.cc
Lines 34 to 38 in 43361bb
// if the bucket is not interested in this type of result/error | |
// just ignore it. | |
if (!bucket->matchType(result)) { | |
continue; | |
} |
The only common property across all different results is type.
class Monitor { | ||
public: | ||
Monitor(const std::string& name, uint32_t enforce) : name_(name), enforce_(enforce) {} | ||
Monitor() = delete; | ||
virtual ~Monitor() {} | ||
void reportResult(const ExtResult&); | ||
|
||
void | ||
setCallback(std::function<void(uint32_t, std::string, absl::optional<std::string>)> callback) { | ||
callback_ = callback; | ||
} | ||
|
||
void reset() { onReset(); } | ||
std::string name() const { return name_; } | ||
|
||
void processBucketsConfig( | ||
const envoy::extensions::outlier_detection_monitors::common::v3::ErrorBuckets& config); | ||
void addErrorBucket(ErrorsBucketPtr&& bucket) { buckets_.push_back(std::move(bucket)); } | ||
|
||
protected: | ||
virtual bool onError() PURE; | ||
virtual void onSuccess() PURE; | ||
virtual void onReset() PURE; | ||
// Default extra info is empty string. Descendant classes may overwrite it. | ||
virtual std::string getFailedExtraInfo() { return ""; } | ||
|
||
std::string name_; | ||
uint32_t enforce_{100}; | ||
std::vector<ErrorsBucketPtr> buckets_; | ||
std::function<void(uint32_t, std::string, absl::optional<std::string>)> callback_; | ||
}; | ||
|
||
using MonitorPtr = std::unique_ptr<Monitor>; | ||
|
||
class MonitorsSet { | ||
public: | ||
void addMonitor(MonitorPtr&& monitor) { monitors_.push_back(std::move(monitor)); } | ||
const std::vector<MonitorPtr>& monitors() { return monitors_; } | ||
|
||
private: | ||
std::vector<MonitorPtr> monitors_; | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please define the abstract interface of extended Monitor
in the outlier_detection.h
and use absl::InlinedVector<MonitorPtr, 3>
as MonitorSet
. By this way, the code base in common/upstream
needn't to depend on the code in extensions/
.
You can create a MonitorBase
here as common base class of different implementations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Good idea. Created base abstract class ExtMonitor
in outlier_detection.h
and derived intermediate class called ExtMonitorBase
in source/extensions...
. Other monitors can be derived from ExtMonitorBase
.
class MonitorFactoryContext { | ||
public: | ||
MonitorFactoryContext(ProtobufMessage::ValidationVisitor& validation_visitor) | ||
: validation_visitor_(validation_visitor) {} | ||
ProtobufMessage::ValidationVisitor& messageValidationVisitor() { return validation_visitor_; } | ||
|
||
private: | ||
ProtobufMessage::ValidationVisitor& validation_visitor_; | ||
}; | ||
|
||
class MonitorFactory : public Envoy::Config::TypedFactory { | ||
public: | ||
~MonitorFactory() override = default; | ||
|
||
virtual MonitorPtr createMonitor(const std::string& name, const Protobuf::Message& config, | ||
MonitorFactoryContext& context) PURE; | ||
|
||
std::string category() const override { return "envoy.outlier_detection_monitors"; } | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto. Move these classes to the outlier_detection.h
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. I am confused. I thought that outlier_detection.h
is mostly to define abstract classes. Are you sure that outlier_detection.h
is better location? I was under impression that source/extensions/...
should hide as many details as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. It would be great to hide more detail in the /source/extensions/...
. But this factory will be used in the /source/common/upstream
and typically is part of the core code.
More important point is that we should avoid the dependency from common -> extensions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I had to move few classes to outlier_detection.h
and this allowed me to not to include source/extensions/...
.
|
||
// Store extensions' config. It will be used to create extensions monitors | ||
// when host are added to the cluster. | ||
extensions_config_ = config.monitors(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may need to get the monitor factory and load the monitor config here. Then, only do the creation of Monitor instances when we try to create the host detection monitor to ensure there is no any exception (like no factory, invalid config, etc.) will be throwed at that time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example:
// Factory should load the configuration and return a lambda.
using MonitorFn = std::function<MonitorPtr()>;
class MonitorFactory : public Envoy::Config::TypedFactory {
public:
~MonitorFactory() override = default;
virtual MonitorFn createMonitor(const std::string& name, const Protobuf::Message& config,
MonitorFactoryContext& context) PURE;
std::string category() const override { return "envoy.outlier_detection_monitors"; }
};
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
@wbpcode Thanks for your initial comments. I agree with you that this is large PR. The additional problem is that extensions should work along with "previous" implementation. I believe that code can be cleaner, but keeping backwards compatibility sometimes requires building not-very clean approach. Once we start deprecating "old" way we can start cleaning the code. I addressed most of your comments. One or two comments require more thoughts and I will work on them tomorrow. |
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
/retest |
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
/retest |
Is this ready for a new review? cc @cpakulski |
Commit Message:
upstream: implementation of outlier detection extensions
Additional Description:
The idea and need for the extensions are described in RFC document: https://docs.google.com/document/d/1ZCZSoirVB39eOLdD0VPlsEUING8c23Sq5bzozrv6f4k/edit?usp=drive_link
In a nutshell, the design decouples types of result reported to outlier detector from an algorithm which marks a host as unhealthy. For example, an outlier detector may be configured to count 3 consecutive errors and type of those errors can be defined by a user. For example:
So, the algorithm does not really care about the exact type of reported result. It is only interested whether reported result should be considered an error or not.
The idea of using user-defined errors can be expanded to database errors. See issue #24215 (I have working prototype for errors reported by Redis).
This design puts extensions on top of already existing outlier detector. It means that the solution is 100% backwards compatible. Previous configs are accepted. But a user may configure outlier detection extension to
The implementation of extensions is built on top of already existing outlier detection structures. This minimizes code changes and re-uses event logger, timers, etc.
The implementation is built on top of already approved API for extensions: #31205
Risk Level: Low (previous configuration still works. Extensions do not have to be configured)
Testing: Added unit tests for new code and tests checking co-existence of "old" outlier and "new" extensions
Docs Changes: Yes. Added.
Release Notes: Yes.
Platform Specific Features: No
Fixes #18789