Store List of Fields in Segment #2279

PSeitz · 2023-12-13T07:55:46Z

Fiels may be encoded in the columnar storage or in the inverted index
for JSON fields.
Add a new Segment file that contains the list of fields (schema +
encoded)

Fiels may be encoded in the columnar storage or in the inverted index for JSON fields. Add a new Segment file that contains the list of fields (schema + encoded)

adamreichold · 2023-12-13T10:04:17Z

src/core/segment_reader.rs

+        if let Some(list_fields_file) = self.list_fields_file.as_ref() {
+            let file = list_fields_file.read_bytes()?;
+            let fields_metadata =
+                read_split_fields(file)?.collect::<io::Result<Vec<FieldMetadata>>>();
+            fields_metadata.map_err(|e| e.into())
+        } else {
+            // Schema fallback
+            Ok(self
+                .schema()
+                .fields()
+                .map(|(_field, entry)| FieldMetadata {
+                    field_name: entry.name().to_string(),
+                    typ: entry.field_type().value_type(),
+                    indexed: entry.is_indexed(),
+                    stored: entry.is_stored(),
+                    fast: entry.is_fast(),
+                })
+                .collect())
        }


It is admittedly code golfing, but I think it makes it a bit less noisier:

Suggested change

if let Some(list_fields_file) = self.list_fields_file.as_ref() {

let file = list_fields_file.read_bytes()?;

let fields_metadata =

read_split_fields(file)?.collect::<io::Result<Vec<FieldMetadata>>>();

fields_metadata.map_err(|e| e.into())

} else {

// Schema fallback

Ok(self

.schema()

.fields()

.map(|(_field, entry)| FieldMetadata {

field_name: entry.name().to_string(),

typ: entry.field_type().value_type(),

indexed: entry.is_indexed(),

stored: entry.is_stored(),

fast: entry.is_fast(),

})

.collect())

}

let fields_metadata = if let Some(list_fields_file) = self.list_fields_file.as_ref() {

let file = list_fields_file.read_bytes()?;

read_split_fields(file)?.collect::<io::Result<Vec<FieldMetadata>>>()?

} else {

// Schema fallback

self.schema()

.fields()

.map(|(_field, entry)| FieldMetadata {

field_name: entry.name().to_string(),

typ: entry.field_type().value_type(),

indexed: entry.is_indexed(),

stored: entry.is_stored(),

fast: entry.is_fast(),

})

.collect()

};

Ok(fields_metadata)

adamreichold · 2023-12-13T10:06:48Z

src/indexer/path_to_unordered_id.rs

@@ -24,34 +26,44 @@ impl From<u32> for OrderedPathId {

 #[derive(Default)]
 pub(crate) struct PathToUnorderedId {
-    map: FnvHashMap<String, u32>,
+    /// TinySet contains the type codes of the columns in the path.


I think defining a struct with named fields would improve readability here, likely making the comment unnecessary or at least attaching it to the named field directly.

adamreichold · 2023-12-13T10:13:46Z

src/indexer/path_to_unordered_id.rs

+        let mut sorted_ids: Vec<(&str, (u32, TinySet))> = self
+            .map
+            .iter()
+            .map(|(k, (id, typ_code))| (k.as_str(), (*id, *typ_code)))


I think using the plural typ_codes would be good here and below, or maybe all_codes or typ_code_bitvec as above, to visually indicate that this is not a single code as in e.g. insert_new_path.

adamreichold · 2023-12-13T10:14:28Z

src/indexer/segment_serializer.rs

@@ -81,6 +84,11 @@ impl SegmentSerializer {
        &mut self.postings_serializer
    }

+    /// Accessor to the ``.


Missing name inside the backticks?

adamreichold · 2023-12-13T10:18:52Z

src/postings/postings_writer.rs

    serializer: &mut InvertedIndexSerializer,
 ) -> crate::Result<()> {
    // Replace unordered ids by ordered ids to be able to sort
-    let unordered_id_to_ordered_id: Vec<OrderedPathId> =
-        ctx.path_to_unordered_id.unordered_id_to_ordered_id();
+    let ordered_id_to_path = ctx.path_to_unordered_id.ordered_id_to_path();


Why move the binding up here if it is not used in the loop? I would suggest leaving it below to make that clear to the reader.

The Vec is reused in another method

Isn't it unordered_id_to_ordered_id that is reused in serialize_segment_fields? I was thinking about the binding for ordered_id_to_path which moved up here from line 79 without any change to how it is used.

adamreichold · 2023-12-13T10:27:59Z

src/field_list/mod.rs

+                    // In this case we need to map the potential fast field to the field name
+                    // accepted by the query parser.
+                    let create_canonical =
+                        !field_entry.is_expand_dots_enabled() && json_path.contains('.');


It seems dangerous to pass the name (which is derived from field_entry) as an argument but access field_entry as a capture. Maybe build_path should be a free function taking field_entry and json_path (and map_to_canonical) as arguments to make it obvious what it accesses and how.

adamreichold · 2023-12-13T10:33:22Z

src/field_list/mod.rs

+        write_field(field_metadata, &mut payload);
+    }
+    let compression_level = 3;
+    let payload_compressed = zstd::stream::encode_all(&mut &payload[..], compression_level)


Since you are compressing from full buffers to full buffers, Zstd's bulk API might yield better results (as @trinity-1686a found elsewhere IIRC).

see #1946 (comment) and #1946 (comment). This might be more of an issue when decompressing than compressing, but either way the bulk API is faster, so if it's easy to use, we should use it

adamreichold · 2023-12-13T10:34:52Z

src/field_list/mod.rs

+    mut reader: R,
+) -> io::Result<impl Iterator<Item = io::Result<FieldMetadata>>> {
+    let format_version = read_exact_array::<_, 1>(&mut reader)?[0];
+    assert_eq!(format_version, 1);


Shouldn't this be an error instead of a panic?

adamreichold · 2023-12-13T10:37:42Z

src/field_list/mod.rs

+}
+
+/// Reads the Split fields from a stream of bytes
+fn read_split_fields_from_zstd<R: Read>(


To be honest, I would suggest merging this into read_split_fields. The name was surprising to me as it does nothing specific to Zstd and I think following the logic of constructing the iterator is easier if it is not split over multiple functions.

(If you do go for Zstd's bulk API, this would probably become simpler as well due to working on a full in-memory representation (if that is not consider too memory hungry) which can simply be chunked instead of calling read_exact repeatedly.)

adamreichold · 2023-12-13T10:40:20Z

src/field_list/mod.rs

+) -> io::Result<impl Iterator<Item = io::Result<FieldMetadata>>> {
+    let mut num_fields = u32::from_le_bytes(read_exact_array::<_, 4>(&mut reader)?);
+
+    Ok(std::iter::from_fn(move || {


I am not sure if this break w.r.t. error handling, but wouldn't something like

Ok((0..num_fields).map(move |_| read_field(&mut reader))

work as well?

Store List of Fields in Segment

bb57e63

Fiels may be encoded in the columnar storage or in the inverted index for JSON fields. Add a new Segment file that contains the list of fields (schema + encoded)

adamreichold reviewed Dec 13, 2023

View reviewed changes

PSeitz marked this pull request as draft December 14, 2023 07:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store List of Fields in Segment #2279

Store List of Fields in Segment #2279

PSeitz commented Dec 13, 2023

adamreichold Dec 13, 2023

adamreichold Dec 13, 2023

adamreichold Dec 13, 2023

adamreichold Dec 13, 2023

adamreichold Dec 13, 2023

PSeitz Dec 13, 2023

adamreichold Dec 13, 2023

adamreichold Dec 13, 2023

adamreichold Dec 13, 2023

trinity-1686a Dec 13, 2023 •

edited

adamreichold Dec 13, 2023

adamreichold Dec 13, 2023

adamreichold Dec 13, 2023

Store List of Fields in Segment #2279

Are you sure you want to change the base?

Store List of Fields in Segment #2279

Conversation

PSeitz commented Dec 13, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trinity-1686a Dec 13, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trinity-1686a Dec 13, 2023 •

edited