Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the deterministic version of AIR Top-K ( the radix-based topk in … #2057

Open
wants to merge 17 commits into
base: branch-24.02
Choose a base branch
from

Conversation

ChristinaZ
Copy link

Hi,

  1. This PR is about the radix-based top-k algorithm in RAFT, we call it AIR Top-K in our just published paper (third_party/raft/cpp/include/raft/matrix/detail/select_radix.cuh)

  2. We have recieved several feedbacks about the deterministic of AIR Top-K.
    In detail, AIR Top-K will return the smallest or largest K elements.
    One thing to notice is that there might be more than one "Kth smallest/largest element" for the given dataset. For example, assuming K=100, the value of the Kth element is 58, while there are three element's value are 58 and there already 99 element whose value are larger than 58. In this case, we might not output all the equaling element as we ensure the output number is K. In this example, for all three elements, we only choose one element to store it in the results.

Previously, we choose the element euqaling to the Kth value randomly. In the deterministic version, we always ensure that the ones with smaller indices will be the output.

  1. In this PR, we only added the code in kernel and add a template parameter stable_last_filter with default value false.
    It means our previous code don't need to change anything.

  2. I think it's better to open this PR first and discuss about current implement. Then we can discuss about adding one more API to expose this API to customer.

Copy link

copy-pr-bot bot commented Dec 12, 2023

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the cpp label Dec 12, 2023
@ChristinaZ ChristinaZ marked this pull request as ready for review December 12, 2023 09:29
@ChristinaZ ChristinaZ requested a review from a team as a code owner December 12, 2023 09:29
@ChristinaZ ChristinaZ marked this pull request as draft December 12, 2023 09:40
@cjnolet cjnolet added documentation Improvements or additions to documentation non-breaking Non-breaking change labels Dec 13, 2023
@ChristinaZ ChristinaZ marked this pull request as ready for review January 8, 2024 10:59
@cjnolet cjnolet added doc Documentation and removed documentation Improvements or additions to documentation labels Jan 10, 2024
@cjnolet
Copy link
Member

cjnolet commented Jan 10, 2024

/ok to test

ChristinaZ and others added 17 commits January 16, 2024 17:38
In the current state, ann-benchmarks running in the `--throughput` mode (multi-threaded) share ANN wrappers among CPU threads. This is not thread-safe and may result in incorrectly measured time (e.g. sharing cuda events among CPU threads) or various exceptions and segfaults (e.g. doing state-changing cublas calls from multiple CPU threads).

This PR makes the search benchmarks copy ANN wrappers in each thread. The copies of the wrappers then selectively:
  - share thread-safe resources (e.g. rmm memory pool) and large objects that are not expected to change during search (e.g. index data);
  - duplicate the resources that are not thread-safe or carry the thread-specific state (e.g. cublas handles, CUDA events and streams).

Alongside, the PR adds a few small changes, including:
 - enables ann-bench NVTX annotations for the non-common-executable mode (shows benchmark labels and iterations in nsys timeline);
 - fixes compile errors for the common-executable mode.

Authors:
  - Artem M. Chirkin (https://github.com/achirkin)
  - William Hicks (https://github.com/wphicks)

Approvers:
  - William Hicks (https://github.com/wphicks)
  - Mark Harris (https://github.com/harrism)
  - Corey J. Nolet (https://github.com/cjnolet)

URL: rapidsai#2021
…apidsai#2062)

With the changes introduced by rapidsai#2021, the copied FAISS benchmark wrapper contains a cuda event that is used for synchronizing between streams during search. The lifetime of the event is the same as of the wrapper, but the event handle itself is copied between the wrappers; this leads to illegal memory accesses and crashes.
This PR fixes the bug by creating a new cuda event on each wrapper copy, so that the wrappers do not share their synchronization events.

Authors:
  - Artem M. Chirkin (https://github.com/achirkin)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: rapidsai#2062
…ai#2061)

The `device_ndarray.empty()` function can be used to allocate device memory without initialization. Previously, the memory has been allocated (uninitialized) on the host and then been copied to the device.

This PR fixes the behavior for 'empty()' by allowing the `device_ndarray` to be initialized by an `array_interface` instead of an `numpy.ndarray` instance, which conditionally allows to skip the initialization of the `DeviceBuffer`.

CC @tfeher, @cjnolet

Authors:
  - Malte Förster (https://github.com/mfoerste4)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: rapidsai#2061
This PR updates RAFT to CCCL 2.2.0. Do not merge until all of RAPIDS is ready to update.

Depends on rapidsai#2048.

Replaces rapidsai#1464.

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Robert Maynard (https://github.com/robertmaynard)

URL: rapidsai#2049
Add our bfknn code to the raft-ann-bench project

Authors:
  - Ben Frederickson (https://github.com/benfred)
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: rapidsai#2063
…rapidsai#2043)

This PR:
1. Adds more clarity to filenames by using `,` as separator instead of `_`
2. Adds 80% and 99% recall bars to build plots
3. Does not plot a recall level in build plot if no data is present
4. Adds a `x-start` argument which allows controlling the minimum recall level used on the x-axis of the search plot
5. Fixes sometimes occurring multi-line issue in search plots
6. Build time plots now plot average build times for an index corresponding a search query in each recall range

Authors:
  - Divye Gala (https://github.com/divyegala)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: rapidsai#2043
A code change to fix a typo that I found while going through the codebase.

Authors:
  - Vivek Narang (https://github.com/narangvivek10)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: rapidsai#2070
Allow docs to be built correctly with doxygen 1.10

Authors:
  - William Hicks (https://github.com/wphicks)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: rapidsai#2079
…ai#1999)

### What is mdbuffer?

This PR introduces a maybe-owning multi-dimensional abstraction called `mdbuffer` to help simplify code that _may_ require an `mdarray` but only if the data are not already in a desired form or location.

As a concrete example, consider a function `foo_device` which operates on memory accessible from the device. If we wish to pass it data originating on the host, a separate code path must be created in which a `device_mdarray` is created and the data are explicitly copied from host to device. This leads to a proliferation of branches as `foo_device` interacts with other functions with similar requirements.

As an initial simplification, `mdbuffer` allows us to write a single template that accepts an `mdspan` pointing to memory on either host _or_ device and routes it through the same code:
```c++
template <typename mdspan_type>
void foo_device(raft::resources const& res, mdspan_type data) {
  auto buf = raft::mdbuffer{res, raft::mdbuffer{data}, raft::memory_type::device};
  // Data in buf is now guaranteed to be accessible from device.
  // If it was already accessible from device, no copy was performed. If it
  // was not, a copy was performed.

  some_kernel<<<...>>>(buf.view<raft::memory_type::device>());

  // It is sometimes useful to know whether or not a copy was performed to
  // e.g. determine whether the transformed data should be copied back to its original
  // location. This can be checked via the `is_owning()` method.
  if (buf.is_owning()) {
    raft::copy(res, data, buf.view<raft::memory_type::device>());
  }
}

foo_device(res, some_host_mdspan);  // Still works; memory is allocated and copy is performed
foo_device(res, some_device_mdspan);  // Still works and no allocation or copy is required
foo_device(res, some_managed_mdspan);  // Still works and no allocation or copy is required
```

While this is a useful simplification, it still leads to a proliferation of template instantiations. If this is undesirable, `mdbuffer` permits a further consolidation through implicit conversion of an mdspan to an mdbuffer:

```c++
void foo_device(raft::resources const& res, raft::mdbuffer<float, raft::matrix_extent<int>>&& data)
{ auto buf = raft::mdbuffer{res, data, raft::memory_type::device};
  some_kernel<<<...>>>(buf.view<raft::memory_type::device>());
  if (buf.is_owning()) {
    raft::copy(res, data, buf.view<raft::memory_type::device>());
  }
}

// All of the following work exactly as before but no longer require separate template instantiations
foo_device(res, some_host_mdspan);
foo_device(res, some_device_mdspan);
foo_device(res, some_managed_mdspan);
```

`mdbuffer` also offers a simple way to perform runtime dispatching based on the memory type passed to it using standard C++ patterns. While mdbuffer's `.view()` method takes an optional template parameter indicating the mdspan type to retrieve as a view, that parameter can be omitted to retrieve a `std::variant` of all mdspan types which may provide a view on the `mdbuffer`'s data (depending on its memory type). We can then use `std::visit` to perform runtime dispatching based on where the data are stored:

```c++
void foo(raft::resources const& res, raft::mdbuffer<float, raft::matrix_extent<int>>&& data) {
  std::visit([](auto view) {
    if constexpr (typename decltype(view)::accessor_type::is_device_accessible) {
      // Do something with these data on device
    } else {
      // Do something with these data on host
    }
  }, data.view());
}
```

In addition to moving data among various memory types (host, device, managed, and pinned currently), `mdbuffer` can be used to coerce data to a desired in-memory layout or to a compatible data type (e.g. floats to doubles). As with changes in the memory type, a copy will be performed if and only if it is necessary.

```c++
template <typename mdspan_type>
void foo_device(raft::resources const& res, mdspan_type data) {
  auto buf = raft::mdbuffer<float, raft::matrix_extent<int>, raft::row_major>{res,
raft::mdbuffer{data}, raft::memory_type::device};
  // Data in buf is now guaranteed to be accessible from device, and
  // represented by floats in row-major order.

  some_kernel<<<...>>>(buf.view<raft::memory_type::device>());

  // The same check can be used to determine whether or not a copy was
  // required, regardless of the cause. I.e. if the data were already on
  // device but in column-major order, the is_owning() method would still
  // return true because new storage needed to be allocated.
  if (buf.is_owning()) {
    raft::copy(res, data, buf.view<raft::memory_type::device>());
  }
}
```

### What mdbuffer is **not**
`mdbuffer` is **not** a replacement for either `mdspan` or `mdarray`. `mdspan` remains the standard object for passing data views throughout the RAFT codebase, and `mdarray` remains the standard object for allocating new multi-dimensional data. This is reflected in the fact that `mdbuffer` can _only_ be constructed from an existing `mdspan` or `mdarray` or another `mdbuffer`. `mdbuffer` is intended to be used solely to simplify code where data _may_ need to be copied to a different location.

### Follow-ups

-  I have omitted the mdbuffer-based replacement for and generalization of `temporary_device_buffer` since this PR is already enormous. I have this partially written however, and I'll post a link to its current state to help motivate the changes here.
- For all necessary copies, `mdbuffer` uses `raft::copy`. For _some_ transformations that require a change in data type or layout, `raft::copy` is not fully optimized. See rapidsai#1842 for more information. Optimizing this will be an important change to ensure that `mdbuffer` can be used with absolutely minimal overhead in all cases. These non-optimized cases represent a small fraction of the real-world use cases we can expect for `mdbuffer`, however, so there should be little concern about beginning to use it as is.
- `std::visit`'s performance for a small number of variants is sometimes non-optimal. As a followup, it would be good to benchmark `mdbuffer`'s current performance and compare to internal use of a `visit` implementation that uses a `switch` on the available memory types.

Resolve rapidsai#1602

Authors:
  - William Hicks (https://github.com/wphicks)
  - Tarang Jain (https://github.com/tarang-jain)

Approvers:
  - Divye Gala (https://github.com/divyegala)
  - Corey J. Nolet (https://github.com/cjnolet)
  - Artem M. Chirkin (https://github.com/achirkin)
  - Tamas Bela Feher (https://github.com/tfeher)
  - Ben Frederickson (https://github.com/benfred)

URL: rapidsai#1999
The `print_results` function here is currently hardcoded to print only 2 results irrespective of the number of queries. A better way here could be to replace the hardcoded limit and allow printing results for the actual number of queries.

Authors:
  - Vivek Narang (https://github.com/narangvivek10)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: rapidsai#2080
This PR addresses rapidsai#2058 by changing the thread parallelism method.

In the first half of the `refine` process, the distance calculation is performed on all candidate vectors, i.e., the number of queries * the original top-k vectors. Since the distance calculations for each vector can be performed independently, this part is thread-parallelized assuming that maximum parallelism is the number of queries * original top-k. This means that even if the number of queries is 1, this part can be executed in thread parallel.

On the other hand, the second half of the `refine` process, the so-called top-k calculation, can be performed independently for each query, but it is difficult to thread parallelize the calculation for a given query, Therefore, this part is parallelized assuming the maximum parallelism is the number of queries, as in the current implementation.

Authors:
  - Akira Naruse (https://github.com/anaruse)
  - Corey J. Nolet (https://github.com/cjnolet)
  - William Hicks (https://github.com/wphicks)

Approvers:
  - Artem M. Chirkin (https://github.com/achirkin)
  - Corey J. Nolet (https://github.com/cjnolet)

URL: rapidsai#2059
Fix for rapidsai#2072: CAGRA search is launching a thread per query in single-CTA. The maximum number of thread is 65535 so the `max_queries` auto selection should be bounded to this number.

Authors:
  - Micka (https://github.com/lowener)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: rapidsai#2081
Add an enum that controls which select-k algorithm is used. This takes the enum that was in the raft_internal and exposes in the public api.  This lets users pick which select algorithm they want to use directly

Authors:
  - Ben Frederickson (https://github.com/benfred)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: rapidsai#2046
@ChristinaZ ChristinaZ force-pushed the determinictic_version_of_AIR_TopK branch from 0a9a38a to 7501e0a Compare January 17, 2024 01:59
@ChristinaZ ChristinaZ requested review from a team as code owners January 17, 2024 01:59
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@ChristinaZ
Copy link
Author

/ok to test

@cjnolet
Copy link
Member

cjnolet commented May 21, 2024

@ChristinaZ it would be great to have this feature in RAFT. I'm going to push the release to 24.08 since we're approaching code freeze for 24.06. 24.08 is in August, do you think we might be able to get this merged by then?

@ChristinaZ
Copy link
Author

@ChristinaZ it would be great to have this feature in RAFT. I'm going to push the release to 24.08 since we're approaching code freeze for 24.06. 24.08 is in August, do you think we might be able to get this merged by then?

Got it. I will work on the integration as soon as possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci CMake cpp doc Documentation non-breaking Non-breaking change python
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

None yet