-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CAGRA new vector addition #2157
Conversation
/ok to test |
/ok to test |
1 similar comment
/ok to test |
…agra-add-new-vectors
/ok to test |
/ok to test |
low recall in DataT=I8/U8 tests due to #2287. All additional vectors tend to be connected to large L2 norm dataset vector nodes if we don't normalize the dataset vectors. |
/ok to test |
/ok to test |
@enp1s0 now that CAGRA has been moved over to cuVS, this PR will also have to be migrated over to cuVS. No rush, of course, just letting you know. |
@enp1s0 just a heads up- now that we've migrated CAGRA over to cuVS, we'll need to migrate these changes over as well. It should be a fairly straightforward merge because the CAGRA impl in cuVS is a direct migration. We are no longer updating the vector search implementations in RAFT and they will be removed soon. |
This PR introduces the new vector addition feature to CAGRA.
Rel: #1775
CAGRA-Q is not supported
Usage
Algorithm
Graph degree: d
The algorithm consists of two stages: rank-based reordering and reverse edge addition.
1-1. Obtain d' (=2d) nearest neighbor vectors (V) of a given new vector using the CAGRA search
1-2. Count the number of detourable edges using the result of step 1 and the neighbor list of the input index. Then we prune (3*d/2) edges in the same way as the CAGRA graph optimization. Through this operation, we decide d/2 neighbors.
2-1. Count the number of incoming edges for all nodes.
2-2. Add d/2 reverse edges from the nodes added to the neighbor list in Step 1 by replacing a node with a new node. To prevent the connection to the replaced node from being lost, we add the node to the neighbor list of the new node. This allow us to make a detour connection. The replaced nodes are the largest number of incoming edge nodes in the 2/d nodes from the back of the neighbor list without duplication with the nodes already in the neighbor list.
Performance
In this experiment, we first split the dataset into two parts: the initial and the additional part. Then, we extend the CAGRA index built by the initial part to include the additional part.
We can see a larger recall drop compared to the baseline by increasing the number of added vectors.
Therefore, rebuilding the CAGRA index is recommended when one wants to add a lot of vectors.
TODO