Convert lexer to iterate over bytes instead of chars #3291

overlookmotel · 2024-05-15T12:17:31Z

Why the lexer is slower than it could be

Chars iterator is really slow. Lexer should iterate byte-by-byte rather than char-by-char.

In almost all cases, we're only matching against ASCII characters anyway (e.g. self.next_eq('.')), so calculating the Unicode char is completely pointless, as we only care if it's an ASCII . anyway. Surprisingly, the compiler seems not to be able to see this, and doesn't optimize out the char calculation itself. It generates a lot of pointless and slow code.

How to make it faster

Source was introduced with intention to ease the transition to byte-by-byte iteration through source code.

#2352 got some heavy perf improvements from using it (along with other tricks), but I have not been able to find the time to complete the work.

The following APIs should be converted to take/return u8 and use Source::peek_byte instead of Source::peek_char:

peek
peek2
next_eq

Usages of these APIs should be refactored unless they're directly preceded by a peek:

next_char
consume_char

Suggested implementation

Source::next_byte is an unsafe function, as it can break the invariant that Source must always be positioned on a UTF8 character boundary (i.e. not in the middle of a multi-byte Unicode char). It's preferable not to use it, to avoid littering the lexer with unsafe code.

However, Source::next_char is written to be much more transparent to the compiler than Chars::next is. So the compiler is able to optimize safe code like:

const dot_was_consumed = match lexer.source.peek_byte() {
  b'.' => {
    lexer.consume_char().unwrap();
    true
  }
  _ => false,
}

down to equivalent of:

const dot_was_consumed = match lexer.source.peek_byte() {
  b'.' => {
    unsafe {
      // This is a single assembler operation
      lexer.source.set_position( lexer.source.position().add(1) );
    }
    true
  }
  _ => false,
}

(at least that's what I remember from a few months ago when I checked it with Godbolt).

(originally mentioned in #3250 (comment))

The text was updated successfully, but these errors were encountered:

overlookmotel added the C-performance Category - Solution not expected to change functional behavior, only performance label May 15, 2024

overlookmotel assigned Boshen and unassigned Boshen May 15, 2024

overlookmotel added the A-parser Area - Parser label May 15, 2024

Boshen added the E-Help Wanted Experience level - For the experienced collaborators label May 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert lexer to iterate over bytes instead of chars #3291

Convert lexer to iterate over bytes instead of chars #3291

overlookmotel commented May 15, 2024

Convert lexer to iterate over bytes instead of chars #3291

Convert lexer to iterate over bytes instead of chars #3291

Comments

overlookmotel commented May 15, 2024

Why the lexer is slower than it could be

How to make it faster

Suggested implementation