Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert lexer to iterate over bytes instead of chars #3291

Open
overlookmotel opened this issue May 15, 2024 · 0 comments
Open

Convert lexer to iterate over bytes instead of chars #3291

overlookmotel opened this issue May 15, 2024 · 0 comments
Labels
A-parser Area - Parser C-performance Category - Solution not expected to change functional behavior, only performance E-Help Wanted Experience level - For the experienced collaborators

Comments

@overlookmotel
Copy link
Collaborator

Why the lexer is slower than it could be

Chars iterator is really slow. Lexer should iterate byte-by-byte rather than char-by-char.

In almost all cases, we're only matching against ASCII characters anyway (e.g. self.next_eq('.')), so calculating the Unicode char is completely pointless, as we only care if it's an ASCII . anyway. Surprisingly, the compiler seems not to be able to see this, and doesn't optimize out the char calculation itself. It generates a lot of pointless and slow code.

How to make it faster

Source was introduced with intention to ease the transition to byte-by-byte iteration through source code.

#2352 got some heavy perf improvements from using it (along with other tricks), but I have not been able to find the time to complete the work.

The following APIs should be converted to take/return u8 and use Source::peek_byte instead of Source::peek_char:

  • peek
  • peek2
  • next_eq

Usages of these APIs should be refactored unless they're directly preceded by a peek:

  • next_char
  • consume_char

Suggested implementation

Source::next_byte is an unsafe function, as it can break the invariant that Source must always be positioned on a UTF8 character boundary (i.e. not in the middle of a multi-byte Unicode char). It's preferable not to use it, to avoid littering the lexer with unsafe code.

However, Source::next_char is written to be much more transparent to the compiler than Chars::next is. So the compiler is able to optimize safe code like:

const dot_was_consumed = match lexer.source.peek_byte() {
  b'.' => {
    lexer.consume_char().unwrap();
    true
  }
  _ => false,
}

down to equivalent of:

const dot_was_consumed = match lexer.source.peek_byte() {
  b'.' => {
    unsafe {
      // This is a single assembler operation
      lexer.source.set_position( lexer.source.position().add(1) );
    }
    true
  }
  _ => false,
}

(at least that's what I remember from a few months ago when I checked it with Godbolt).

(originally mentioned in #3250 (comment))

@overlookmotel overlookmotel added the C-performance Category - Solution not expected to change functional behavior, only performance label May 15, 2024
@overlookmotel overlookmotel assigned Boshen and unassigned Boshen May 15, 2024
@overlookmotel overlookmotel added the A-parser Area - Parser label May 15, 2024
@Boshen Boshen added the E-Help Wanted Experience level - For the experienced collaborators label May 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-parser Area - Parser C-performance Category - Solution not expected to change functional behavior, only performance E-Help Wanted Experience level - For the experienced collaborators
Projects
None yet
Development

No branches or pull requests

2 participants