[Outdated] Scaling workspace resources #2194

achirkin · 2024-02-22T16:01:18Z

Brief

Add another workspace memory resource that does not have the explicit memory limit. That is, after the change we have the following:

rmm::mr::get_current_device_resource() is default for all allocations, as before. It is used for the allocations with unlimited lifetime, e.g. returned to the user.
raft::get_workspace_resource() is for temporary allocations and forced to have fixed size, as before. However, it becomes smaller and should be used only for allocations, which do not scale with problem size. It defaults to a thin layer on top of the current_device_resource.
raft::get_large_workspace_resource() (new) is for temporary allocations, which can scale with the problem size. Unlike workspace_resource, its size is not fixed. By default, it points to the current_device_resource, but the user can set it to something backed by the host memory (e.g. managed memory) to avoid OOM exceptions when there's not enough device memory left.

Problem

We have a list of issues/preference/requirements, some of which contradict others

We rely on RMM to handle all allocations and we often use rmm::mr::pool_memory_resource for performance reasons (to avoid lots of cudaMalloc calls in the loops)
Historically, we've used managed memory allocators as a workaround to avoid OOM errors or improve speed (by increasing batch sizes).
However, the design goal is to avoid setting allocators on our own and to give the full control to the user (hence the workaround in 2 was removed).
We introduced the workspace resource earlier to allow querying the available memory reliably and maximize the batch sizes accordingly (see also issue #1310). Without this, some of our batched algorithms either fail with OOM or severely underperform due to small batch sizes.
However, we cannot just put all of RAFT temporary allocations into the limited workspace_resource, because some of them scale with the problem size and would inevitably fail with OOM at some point.
Setting the workspace resource to the managed memory is not advisable as well for performance reasons: we have lots of small allocations in performance critical sections, so we need a pool, but a pool in the managed memory inevitably outgrows the device memory and makes the whole program slow.

Solution

I propose to split the workspace memory into two:

small, fixed-size workspace for small, frequent allocations
large workspace for the allocations that scale with the problem size

Notes:

We still leave the full control over the allocator types to the user.
Neither of the workspace resource should have unlimited lifetime / returned to the user. As a result, if the user sets the managed memory as the large workspace resource, the memory is guaranteed to be released after the function call.
We have the option to use the slow managed memory without a pool for large allocations, while still using a fast pool for small allocations.
We have more flexible control over which allocations are "large" and which are "small", so hopefully using the managed memory is not so bad for performance.

Add another workspace memory resource that does not have the explicit memory limit. It should be used for large allocation; a user can set it to the host-memory-backed resource, such as managed memory, for better scaling and to avoid many OOMs.

tfeher · 2024-02-23T10:57:40Z

Thanks Artem for proposing this solution. On one hand, it is nice to have a secondary workspace allocator to handle large allocations. I need to still think about this.

An alternative solution would be to keep a single workspace allocator, and that would provide the large allocator as a fall-back when allocating from the fast (but smaller) pool fails.

tfeher · 2024-02-23T11:00:58Z

cpp/include/raft/core/resource/device_memory_resource.hpp

@@ -144,7 +177,7 @@ class workspace_resource_factory : public resource_factory {
    // Note, the workspace does not claim all this memory from the start, so it's still usable by
    // the main resource as well.
    // This limit is merely an order for algorithm internals to plan the batching accordingly.
-    return total_size / 2;
+    return total_size / 4;


The OOM errors we have seen with CAGRA were related to workspace pool grabbing all this place. What about limiting to a much smaller workspace size? (E.g. faiss has 1.5 GiB limit).

That is an option, but so far I think it's not necessary. I also think it can hurt performance a little by reducing the batch size in places like ivf_pq::search or ivf_pq::extend.

With the current proposal, ann-bench executable (as a user of raft) set these resources:

default - pool on top of device memory

limited workspace - shares the same pool with default

large workspace - managed memory (without pooling)

Hence the dataset/user allocations do not conflict for the same memory with the workspace (as they both use the same pool). At the same time, large temporary allocations (such as the cagra graph on device) use the managed memory and free it as soon as the algorithm finishes.

achirkin · 2024-02-23T13:59:35Z

Thanks for joining the conversation, Tamas. I've updated the description with my rationale since you reviewed the PR.
I think, the alternative solution you propose is a viable solution short-term to avoid OOMs, but it can be taxing on the performance: imagine a large allocation at the beginning of the algorithm (e.g. training set in kmeans) takes all of the device/workspace memory; then many subsequent small allocations are forced to use a tiny remaining free fraction of the memory via managed memory interface with a large oversubscription rate. This could lead to a terrible performance while the device memory is "wasted" on an allocation that may be not often used.

achirkin · 2024-02-26T07:39:04Z

Benchmarking update: there's a limited evidence that the update improves performance of cagra::build: I've got ~6% speedup with default parameters on DEEP-100M (which is anyway very slow and takes a lot of memory due to large default graph degrees and low ivf-pq compression).

tfeher · 2024-02-26T11:44:07Z

Thanks Artem for the update! It is a nice idea to have an extra memory resource that we can use for potentially host mem backed large temporary allocations. This can be useful for systems with improved H2D interconnect, such as Grace Hopper.

harrism · 2024-04-23T21:02:54Z

If you know you want all allocations within certain size ranges to be allocated with specific resources, you should have a look at binning_memory_resource. Basically you could have one or more fixed_size_resources for really small or common stuff, and a pool_mr (or cuda_async_mr) for larger stuff.

The difference is that choice of MR would be automatic and based on size, not on usage. The advantage of that is that implementers of RAFT functions wouldn't have to think about which MR to use so is potentially less bug prone.

But if you need to explicitly choose MRs based on non-size-based logic, then you might want to store separate resources instead. If the logic you need is common across all functions, then you might want to encode that logic into a custom memory resource class that decides which of a group of upstream resources to allocate from. Encoding the logic in the MR class again reduces potential bugs by not having to duplicate the logic everywhere.

cjnolet · 2024-04-17T00:33:21Z

cpp/include/raft/neighbors/detail/ivf_flat_search-inl.cuh

-  constexpr uint64_t kExpectedWsSize = 1024 * 1024 * 1024;
-  uint64_t max_ws_size = std::min(resource::get_workspace_free_bytes(handle), kExpectedWsSize);
+  uint64_t expected_ws_size = 1024 * 1024 * 1024ull;
+  if (mr == nullptr) {


It'll be nice when we can remove the need to pass around memory resources as arguments and we can just pull everything from the handle.

Yes, but I'd suggest we do this during porting to cuVS to not make things even more complicated for users (if there are any).

cpp/include/raft/core/resource/device_memory_resource.hpp

…workspace-resource

…e workspace resource

achirkin · 2024-05-16T12:42:13Z

Opened #2322 dropping the changes to neighbor methods, which are moved to cuVS. Keeping this PR open, so that we can copy those neighbor changes when cuVS is ready for them.

cjnolet · 2024-05-17T04:08:27Z

@achirkin is this ready to be closed now that you've started a new PR for this?

achirkin · 2024-05-17T04:29:33Z

@cjnolet If you don't mind, I'd like to keep it open until we open cuVS PR with the corresponding neighbor changes.

Use raft's large workspace resource for large temporary allocations during ANN index build. This is the port of rapidsai/raft#2194, which didn't make into raft before the algorithms were ported to cuVS. Authors: - Artem M. Chirkin (https://github.com/achirkin) Approvers: - Tamas Bela Feher (https://github.com/tfeher) URL: #181

[Discussion] Scaling workspace resources

4e5d842

Add another workspace memory resource that does not have the explicit memory limit. It should be used for large allocation; a user can set it to the host-memory-backed resource, such as managed memory, for better scaling and to avoid many OOMs.

achirkin added enhancement New feature or request non-breaking Non-breaking change 2 - In Progress Currenty a work in progress labels Feb 22, 2024

achirkin self-assigned this Feb 22, 2024

github-actions bot added the cpp label Feb 22, 2024

achirkin added feature request New feature or request and removed enhancement New feature or request labels Feb 22, 2024

Avoid using cudaMemGetInfo and adjust the default workspace size

952c6b9

tfeher reviewed Feb 23, 2024

View reviewed changes

Add the new resource to the cagra_build.cuh

5bf0a76

Use the memory workspaces everywhere across ANN methods

26ae6fc

achirkin added 3 - Ready for Review and removed 2 - In Progress Currenty a work in progress labels Feb 23, 2024

achirkin added 3 commits February 27, 2024 09:19

Merge branch 'branch-24.04' into fea-scaled-workspace-resource

9ed2314

Merge branch 'branch-24.04' into fea-scaled-workspace-resource

b68faf2

Merge branch 'branch-24.04' into fea-scaled-workspace-resource

7dad403

achirkin requested a review from cjnolet March 4, 2024 07:55

achirkin and others added 5 commits March 6, 2024 08:52

Merge branch 'branch-24.04' into fea-scaled-workspace-resource

71a3530

Merge branch 'branch-24.04' into fea-scaled-workspace-resource

494cc6f

Merge branch 'branch-24.04' into fea-scaled-workspace-resource

2ec49fb

Fix style

e0b45c0

Merge branch 'branch-24.04' into fea-scaled-workspace-resource

cf7cbd3

achirkin mentioned this pull request Mar 19, 2024

Re enable IVF random sampling #2225

Closed

Merge branch 'branch-24.04' into fea-scaled-workspace-resource

3f57a63

achirkin changed the base branch from branch-24.04 to branch-24.06 April 4, 2024 12:14

cjnolet requested changes Apr 24, 2024

View reviewed changes

achirkin added 4 commits April 26, 2024 15:21

Merge remote-tracking branch 'rapidsai/branch-24.06' into fea-scaled-…

070a9b6

…workspace-resource

Style & naming fixes

2516692

Style & naming fixes

600bf5c

Make sure the default resource is not accidentally used instead of th…

9b858f7

…e workspace resource

achirkin marked this pull request as ready for review April 26, 2024 14:03

achirkin requested a review from a team as a code owner April 26, 2024 14:03

achirkin requested a review from cjnolet April 26, 2024 14:03

Merge branch 'branch-24.06' into fea-scaled-workspace-resource

ddad5fc

achirkin changed the title ~~[Discussion] Scaling workspace resources~~ Scaling workspace resources Apr 29, 2024

achirkin added breaking Breaking change and removed non-breaking Non-breaking change labels Apr 29, 2024

cjnolet and others added 6 commits April 30, 2024 17:44

Merge branch 'branch-24.06' into fea-scaled-workspace-resource

d7569aa

Merge branch 'branch-24.06' into fea-scaled-workspace-resource

2db0322

Merge branch 'branch-24.06' into fea-scaled-workspace-resource

d475161

Merge branch 'branch-24.06' into fea-scaled-workspace-resource

ec9469a

Merge branch 'branch-24.06' into fea-scaled-workspace-resource

e048ea0

Fix an error coming from automatic merge

d463188

achirkin changed the title ~~Scaling workspace resources~~ [Outdated] Scaling workspace resources May 16, 2024

achirkin added 5 - DO NOT MERGE Hold off on merging; see PR for details and removed 3 - Ready for Review labels May 16, 2024

Merge branch 'branch-24.06' into fea-scaled-workspace-resource

94893bc

achirkin changed the base branch from branch-24.06 to branch-24.08 June 6, 2024 14:56

Merge branch 'branch-24.08' into fea-scaled-workspace-resource

5d5674c

achirkin mentioned this pull request Jun 6, 2024

Scaling workspace resources rapidsai/cuvs#181

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Outdated] Scaling workspace resources #2194

[Outdated] Scaling workspace resources #2194

achirkin commented Feb 22, 2024 •

edited

tfeher commented Feb 23, 2024

tfeher Feb 23, 2024

achirkin Feb 23, 2024

achirkin commented Feb 23, 2024 •

edited

achirkin commented Feb 26, 2024 •

edited

tfeher commented Feb 26, 2024

harrism commented Apr 23, 2024

cjnolet Apr 17, 2024

achirkin Apr 26, 2024

achirkin commented May 16, 2024 •

edited

cjnolet commented May 17, 2024

achirkin commented May 17, 2024

[Outdated] Scaling workspace resources #2194

Are you sure you want to change the base?

[Outdated] Scaling workspace resources #2194

Conversation

achirkin commented Feb 22, 2024 • edited

Brief

Problem

Solution

tfeher commented Feb 23, 2024

tfeher Feb 23, 2024

Choose a reason for hiding this comment

achirkin Feb 23, 2024

Choose a reason for hiding this comment

achirkin commented Feb 23, 2024 • edited

achirkin commented Feb 26, 2024 • edited

tfeher commented Feb 26, 2024

harrism commented Apr 23, 2024

cjnolet Apr 17, 2024

Choose a reason for hiding this comment

achirkin Apr 26, 2024

Choose a reason for hiding this comment

achirkin commented May 16, 2024 • edited

cjnolet commented May 17, 2024

achirkin commented May 17, 2024

achirkin commented Feb 22, 2024 •

edited

achirkin commented Feb 23, 2024 •

edited

achirkin commented Feb 26, 2024 •

edited

achirkin commented May 16, 2024 •

edited