-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow to mark vectors as "userProvided" #4633
Conversation
48776ca
to
8a123a4
Compare
- use the parsed_vectors module - only parse `_vectors` once per document, instead of once per embedder per document
8a123a4
to
1c7f252
Compare
1c7f252
to
f25d590
Compare
Looks like the vectors in the snapshots are different from those computed in the CI... Investigating, but we might need a different way to check here. |
f25d590
to
96e0de9
Compare
96e0de9
to
9969f7a
Compare
I made sure the behavior when not passing |
milli/src/update/index_documents/extract/extract_vector_points.rs
Outdated
Show resolved
Hide resolved
milli/src/update/index_documents/extract/extract_vector_points.rs
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much @dureuill. LGTM 🍸
bors merge
4633: Allow to mark vectors as "userProvided" r=Kerollmops a=dureuill # Pull Request ## Related issue Fixes #4606 ## What does this PR do? [See usage in PRD](https://meilisearch.notion.site/v1-9-AI-search-changes-e90d6803eca8417aa70a1ac5d0225697#deb96fb0595947bda7d4a371100326eb) - Extends the shape of the special `_vectors` field in documents. - previously, the `_vectors` field had to be an object, with each field the name of a configured embedder, and each value either `null`, an embedding (array of numbers), or an array of embeddings. - In this PR, the value of an embedder in the `_vectors` field can additionally be an object. The object has two fields: 1. `embeddings`: `null`, an embedding (array of numbers), or an array of embeddings. 2. `userProvided`: a boolean indicating if the vector was provided by the user. - The previous form `embedder_or_array_of_embedders` is semantically equivalent to: ```json { "embeddings": embedder_or_array_of_embedders, "userProvided": true } ``` - During the indexing step, the subfields and values of the `_vectors` field that have `userProvided` set to **false** are added in the vector DB, but not in the documents DB: that means that future modifications of the documents will trigger a regeneration of that particular vector using the document template. - This allows **importing** embeddings as a one-shot process, while still retaining the ability to regenerate embeddings on document change. - The dump process now uses this ability: it enriches the `_vectors` fields of documents with the embeddings that were autogenerated, marking them as not `userProvided`. This allows importing the vectors from a dump without regenerating them. ### Tests This PR adds the following tests - Long-needed hybrid search tests of a simple hf embedder - Dump test that imports vectors. Due to the difficulty of actually importing a dump in tests, we just read the dump and check it contains the expected content. - Tests in the index-scheduler: this tests that documents containing the same kind of instructions as in the dump indexes as expected Co-authored-by: Louis Dureuil <louis@meilisearch.com>
bors cancel |
Canceled. |
6ef21c7
to
8a941c0
Compare
bors merge |
4633: Allow to mark vectors as "userProvided" r=Kerollmops a=dureuill # Pull Request ## Related issue Fixes #4606 ## What does this PR do? [See usage in PRD](https://meilisearch.notion.site/v1-9-AI-search-changes-e90d6803eca8417aa70a1ac5d0225697#deb96fb0595947bda7d4a371100326eb) - Extends the shape of the special `_vectors` field in documents. - previously, the `_vectors` field had to be an object, with each field the name of a configured embedder, and each value either `null`, an embedding (array of numbers), or an array of embeddings. - In this PR, the value of an embedder in the `_vectors` field can additionally be an object. The object has two fields: 1. `embeddings`: `null`, an embedding (array of numbers), or an array of embeddings. 2. `userProvided`: a boolean indicating if the vector was provided by the user. - The previous form `embedder_or_array_of_embedders` is semantically equivalent to: ```json { "embeddings": embedder_or_array_of_embedders, "userProvided": true } ``` - During the indexing step, the subfields and values of the `_vectors` field that have `userProvided` set to **false** are added in the vector DB, but not in the documents DB: that means that future modifications of the documents will trigger a regeneration of that particular vector using the document template. - This allows **importing** embeddings as a one-shot process, while still retaining the ability to regenerate embeddings on document change. - The dump process now uses this ability: it enriches the `_vectors` fields of documents with the embeddings that were autogenerated, marking them as not `userProvided`. This allows importing the vectors from a dump without regenerating them. ### Tests This PR adds the following tests - Long-needed hybrid search tests of a simple hf embedder - Dump test that imports vectors. Due to the difficulty of actually importing a dump in tests, we just read the dump and check it contains the expected content. - Tests in the index-scheduler: this tests that documents containing the same kind of instructions as in the dump indexes as expected Co-authored-by: Louis Dureuil <louis@meilisearch.com>
This PR was included in a batch that successfully built, but then failed to merge into main. It will not be retried. Additional information: {"message":"At least 1 approving review is required by reviewers with write access.","documentation_url":"https://docs.github.com/articles/about-protected-branches"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bors merge
4633: Allow to mark vectors as "userProvided" r=Kerollmops a=dureuill # Pull Request ## Related issue Fixes #4606 ## What does this PR do? [See usage in PRD](https://meilisearch.notion.site/v1-9-AI-search-changes-e90d6803eca8417aa70a1ac5d0225697#deb96fb0595947bda7d4a371100326eb) - Extends the shape of the special `_vectors` field in documents. - previously, the `_vectors` field had to be an object, with each field the name of a configured embedder, and each value either `null`, an embedding (array of numbers), or an array of embeddings. - In this PR, the value of an embedder in the `_vectors` field can additionally be an object. The object has two fields: 1. `embeddings`: `null`, an embedding (array of numbers), or an array of embeddings. 2. `userProvided`: a boolean indicating if the vector was provided by the user. - The previous form `embedder_or_array_of_embedders` is semantically equivalent to: ```json { "embeddings": embedder_or_array_of_embedders, "userProvided": true } ``` - During the indexing step, the subfields and values of the `_vectors` field that have `userProvided` set to **false** are added in the vector DB, but not in the documents DB: that means that future modifications of the documents will trigger a regeneration of that particular vector using the document template. - This allows **importing** embeddings as a one-shot process, while still retaining the ability to regenerate embeddings on document change. - The dump process now uses this ability: it enriches the `_vectors` fields of documents with the embeddings that were autogenerated, marking them as not `userProvided`. This allows importing the vectors from a dump without regenerating them. ### Tests This PR adds the following tests - Long-needed hybrid search tests of a simple hf embedder - Dump test that imports vectors. Due to the difficulty of actually importing a dump in tests, we just read the dump and check it contains the expected content. - Tests in the index-scheduler: this tests that documents containing the same kind of instructions as in the dump indexes as expected Co-authored-by: Louis Dureuil <louis@meilisearch.com>
Build failed:
|
bors merge |
4633: Allow to mark vectors as "userProvided" r=Kerollmops a=dureuill # Pull Request ## Related issue Fixes #4606 ## What does this PR do? [See usage in PRD](https://meilisearch.notion.site/v1-9-AI-search-changes-e90d6803eca8417aa70a1ac5d0225697#deb96fb0595947bda7d4a371100326eb) - Extends the shape of the special `_vectors` field in documents. - previously, the `_vectors` field had to be an object, with each field the name of a configured embedder, and each value either `null`, an embedding (array of numbers), or an array of embeddings. - In this PR, the value of an embedder in the `_vectors` field can additionally be an object. The object has two fields: 1. `embeddings`: `null`, an embedding (array of numbers), or an array of embeddings. 2. `userProvided`: a boolean indicating if the vector was provided by the user. - The previous form `embedder_or_array_of_embedders` is semantically equivalent to: ```json { "embeddings": embedder_or_array_of_embedders, "userProvided": true } ``` - During the indexing step, the subfields and values of the `_vectors` field that have `userProvided` set to **false** are added in the vector DB, but not in the documents DB: that means that future modifications of the documents will trigger a regeneration of that particular vector using the document template. - This allows **importing** embeddings as a one-shot process, while still retaining the ability to regenerate embeddings on document change. - The dump process now uses this ability: it enriches the `_vectors` fields of documents with the embeddings that were autogenerated, marking them as not `userProvided`. This allows importing the vectors from a dump without regenerating them. ### Tests This PR adds the following tests - Long-needed hybrid search tests of a simple hf embedder - Dump test that imports vectors. Due to the difficulty of actually importing a dump in tests, we just read the dump and check it contains the expected content. - Tests in the index-scheduler: this tests that documents containing the same kind of instructions as in the dump indexes as expected Co-authored-by: Louis Dureuil <louis@meilisearch.com>
Build failed: |
bors merge |
4633: Allow to mark vectors as "userProvided" r=dureuill a=dureuill # Pull Request ## Related issue Fixes #4606 ## What does this PR do? [See usage in PRD](https://meilisearch.notion.site/v1-9-AI-search-changes-e90d6803eca8417aa70a1ac5d0225697#deb96fb0595947bda7d4a371100326eb) - Extends the shape of the special `_vectors` field in documents. - previously, the `_vectors` field had to be an object, with each field the name of a configured embedder, and each value either `null`, an embedding (array of numbers), or an array of embeddings. - In this PR, the value of an embedder in the `_vectors` field can additionally be an object. The object has two fields: 1. `embeddings`: `null`, an embedding (array of numbers), or an array of embeddings. 2. `userProvided`: a boolean indicating if the vector was provided by the user. - The previous form `embedder_or_array_of_embedders` is semantically equivalent to: ```json { "embeddings": embedder_or_array_of_embedders, "userProvided": true } ``` - During the indexing step, the subfields and values of the `_vectors` field that have `userProvided` set to **false** are added in the vector DB, but not in the documents DB: that means that future modifications of the documents will trigger a regeneration of that particular vector using the document template. - This allows **importing** embeddings as a one-shot process, while still retaining the ability to regenerate embeddings on document change. - The dump process now uses this ability: it enriches the `_vectors` fields of documents with the embeddings that were autogenerated, marking them as not `userProvided`. This allows importing the vectors from a dump without regenerating them. ### Tests This PR adds the following tests - Long-needed hybrid search tests of a simple hf embedder - Dump test that imports vectors. Due to the difficulty of actually importing a dump in tests, we just read the dump and check it contains the expected content. - Tests in the index-scheduler: this tests that documents containing the same kind of instructions as in the dump indexes as expected Co-authored-by: Louis Dureuil <louis@meilisearch.com>
Build failed:
|
bors merge |
Pull Request
Related issue
Fixes #4606
What does this PR do?
See usage in PRD
_vectors
field in documents._vectors
field had to be an object, with each field the name of a configured embedder, and each value eithernull
, an embedding (array of numbers), or an array of embeddings._vectors
field can additionally be an object. The object has two fields:embeddings
:null
, an embedding (array of numbers), or an array of embeddings.userProvided
: a boolean indicating if the vector was provided by the user.embedder_or_array_of_embedders
is semantically equivalent to:_vectors
field that haveuserProvided
set to false are added in the vector DB, but not in the documents DB: that means that future modifications of the documents will trigger a regeneration of that particular vector using the document template._vectors
fields of documents with the embeddings that were autogenerated, marking them as notuserProvided
. This allows importing the vectors from a dump without regenerating them.Tests
This PR adds the following tests