Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose incremental utf8 decoding APIs #2408

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

rnjtranjan
Copy link
Collaborator

No description provided.

@rnjtranjan rnjtranjan added this to the 0.9.1 milestone Jun 2, 2023
@@ -75,18 +75,25 @@
--
module Streamly.Unicode.Stream
(
DecodeState
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before exposing:

  1. Check the naming
  2. Check good documentation
  3. Check naming consistency
  4. Optional: Benchmark and Test

@adithyaov adithyaov self-requested a review June 21, 2023 11:18
@rnjtranjan rnjtranjan force-pushed the expose_incremental_utf8_decoding_APIs branch from 6a20721 to 5fa6d57 Compare June 21, 2023 11:35
@adithyaov adithyaov force-pushed the expose_incremental_utf8_decoding_APIs branch from 528ba6e to 4e308f1 Compare July 24, 2023 20:09
@adithyaov adithyaov linked an issue Jul 24, 2023 that may be closed by this pull request
@adithyaov adithyaov force-pushed the expose_incremental_utf8_decoding_APIs branch from 4e308f1 to 6c754f4 Compare July 24, 2023 20:11
-- For multi-byte characters, the decoding state indicates the number of bytes
-- remaining to complete the character. It is usually initialized to a non-zero
-- value corresponding to the number of bytes in the multi-byte character, e.g
-- DecodeState will be 1 for 2-bytes char.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This documentation does not make much sense to the user of the library. Need to write which APIs use this, where does this come from, what values need to be supplied etc. The information is to be used to correctly understand how to use the APIs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha.

-- Calculate the code point value: Depending on the type of the leading byte,
-- extract the significant bits from each byte of the sequence and combine them
-- to form the complete code point value. The specific bit manipulations will
-- differ based on the number of bytes used.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not define a codepoint here. Need to say what it means in the context of the APIs and to be able to use the APIs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, gotcha.

@adithyaov adithyaov force-pushed the expose_incremental_utf8_decoding_APIs branch from d8470ed to 3692740 Compare July 25, 2023 06:21
-- ** Resumable UTF-8 Decoding
, DecodeError(..)
, DecodeState
, CodePoint
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can return a more intelligent decode error:

data DecodeUTF8Error = DecodeUTF8Incomplete Word32 |  DecodeUTF8NonStarter Word8 | DecodeUTF8Invalid

This should be enough to build a resumable decoder. When resuming we should supply the Word32 from DecodeUTF8Incomplete.

@harendra-kumar harendra-kumar modified the milestones: 0.9.1, 0.11.0 Nov 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Expose incremental utf8 decoding APIs
3 participants