Simplify utf8.c #102424

am11 · 2024-05-19T20:23:34Z

diet version 🤸

am11 · 2024-05-21T06:14:04Z

cc @AaronRobinsonMSFT, re: #100451 (comment) - this mainly reduces the assembly code with cl.exe -O2 (before 1091 lines, now 369 for minipal_convert_utf8_to_utf16). Could you please run your benchmark to see how it compares against the MultiByteToWideChar/WideCharToMultiChar win32 APIs? Hopefully it will perform better now (at least better than before 😅).

am11 · 2024-05-21T08:08:42Z

src/native/minipal/utf8.c

    }

-    // May have a problem if we have to flush
-    if (ch != 0)
+    if (sourceIndex < sourceLength && (destinationIndex > destinationLength || (destinationIndex == destinationLength && sourceIndex + 1 < sourceLength)))


@lambdageek, just an FYI, I had to change this condition from

if (sourceIndex < sourceLength && (destinationIndex >= destinationLength)

for mono because this case was failing:

runtime/src/libraries/System.Runtime/tests/System.Runtime.Tests/System/AttributeTests.cs

Line 169 in 2bfd1b4

Assert.NotNull(stringArg);

IMHO, the reflection stack of mono should use the same expectation from giconv about the expected lengths as coreclr's reflection stack makes from coreclr PAL's unicode.cpp, so this standalone component doesn't need to handle more edge cases than necessary.

/cc @fanyang-mono

The test that @am11 mentioned was testing the custom attribute value string "\uDFFF". For mono, it went through minipal_convert_utf8_to_utf16 to decode this string "\xed\xbf\xbf". It seems to me that CoreCLR went through a different mechanism to get the value of the custom attribute value.

Additionally, what Mono is doing in g_utf8_to_utf16_impl of giconv.c isn't a lot different than what is done in MultiByteToWideChar of unicode.cpp. If I understand correctly, the only difference is that in MultiByteToWideChar, it makes source length and destination length one byte longer than it is. Doing this in mono won't stop this check ((sourceIndex < sourceLength) && (destinationIndex >= destinationLength)) either.

@fanyang-mono, coreclr ultimately uses the same implementation (couple of wrappers later), but in reflection stack (metasig.h etc.), there it compensates for those off-by-one, zero-length kind of scenarios to match the expectation.

@am11 Do you mind pointing me to the CoreCLR source code where they handles off-by-one, zero-length kind of scenarios?

@fanyang-mono, good question. 😅 It is a bit tricky to discern the exact place without actually attaching a conditional debugger (or some tricky way), but I was debugging the similar tests for coreclr last year #85558 (comment) which points to

runtime/src/coreclr/md/compiler/regmeta_import.cpp

Line 705 in d8c59a4

RegMeta::GetTypeDefProps(

One thing I do remember; when all PAL tests were passing and 99% of managed tests were passingm those edge cases covered in reflection tests were still failing, which is why I had to add a similar special condition in unicode.cpp -> utf8.c conversion to balance varied expectations of mono and coreclr + reflection simultaneously.

It is probably fine having this a little bit longer check, because to make mono align with CoreCLR on this is non-trivial work.

AaronRobinsonMSFT

Thanks. I will take a look.

src/native/minipal/utf8.c

Co-authored-by: Aaron Robinson <arobins@microsoft.com>

jkotas · 2024-05-22T02:48:02Z

Could you please run your benchmark to see how it compares against the MultiByteToWideChar/WideCharToMultiChar win32 APIs?

int main()
{
    char* p = (char*)malloc(200);
    memset(p, 'a', 200);

    for (;;)
    {
        int start = GetTickCount();
        for (int i = 0; i < 10000000; i++)
#if 0
            minipal_get_length_utf8_to_utf16(p, 200, 0);
#else
            MultiByteToWideChar(CP_UTF8, 0, p, 200, NULL, 0);
#endif
        int end = GetTickCount();
        printf("%d\n", end - start);
    }
}

MultiByteToWideChar: 220ms per iteration
minipal_get_length_utf8_to_utf16: 1370ms per iteration

am11 · 2024-05-22T19:44:00Z

Ran some google benchmarks: https://gist.github.com/am11/8d949f6c112aaab1575e85f2f350c59e

Long length: 140
Short length: 20

main:

-------------------------------------------------------------------------------------------------------------
Benchmark                                                   Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------------------------
BM_minipal_get_length_utf16_to_utf8/ASCII_Short          46.8 ns         46.3 ns     15200142 items_per_second=388.583M/s
BM_minipal_get_length_utf16_to_utf8/NonASCII_Short       49.2 ns         48.4 ns     14081759 items_per_second=288.974M/s
BM_minipal_get_length_utf16_to_utf8/ASCII_Long           48.5 ns         47.9 ns     14615732 items_per_second=2.92203G/s
BM_minipal_get_length_utf16_to_utf8/NonASCII_Long         167 ns          164 ns      4267226 items_per_second=854.207M/s
BM_minipal_convert_utf16_to_utf8/ASCII_Short             66.7 ns         65.7 ns     10535346 items_per_second=273.86M/s
BM_minipal_convert_utf16_to_utf8/NonASCII_Short          98.3 ns         97.0 ns      7215230 items_per_second=144.395M/s
BM_minipal_convert_utf16_to_utf8/ASCII_Long               135 ns          133 ns      5226338 items_per_second=1.05422G/s
BM_minipal_convert_utf16_to_utf8/NonASCII_Long            565 ns          556 ns      1261443 items_per_second=251.616M/s
BM_minipal_get_length_utf8_to_utf16/ASCII_Short          46.5 ns         45.7 ns     15311624 items_per_second=393.657M/s
BM_minipal_get_length_utf8_to_utf16/NonASCII_Short       56.6 ns         55.8 ns     12379302 items_per_second=573.563M/s
BM_minipal_get_length_utf8_to_utf16/ASCII_Long           47.7 ns         47.2 ns     15069416 items_per_second=2.96665G/s
BM_minipal_get_length_utf8_to_utf16/NonASCII_Long         118 ns          116 ns      5989100 items_per_second=1.20323G/s
BM_minipal_convert_utf8_to_utf16/ASCII_Short             61.5 ns         60.9 ns     11373792 items_per_second=295.365M/s
BM_minipal_convert_utf8_to_utf16/NonASCII_Short           126 ns          124 ns      5599462 items_per_second=257.461M/s
BM_minipal_convert_utf8_to_utf16/ASCII_Long               109 ns          108 ns      6452981 items_per_second=1.30072G/s
BM_minipal_convert_utf8_to_utf16/NonASCII_Long            260 ns          258 ns      2722687 items_per_second=542.822M/s

PR:

-------------------------------------------------------------------------------------------------------------
Benchmark                                                   Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------------------------
BM_minipal_get_length_utf16_to_utf8/ASCII_Short          46.3 ns         46.2 ns     15294729 items_per_second=389.687M/s
BM_minipal_get_length_utf16_to_utf8/NonASCII_Short       46.7 ns         46.7 ns     15014757 items_per_second=300.068M/s
BM_minipal_get_length_utf16_to_utf8/ASCII_Long           89.1 ns         88.2 ns      7965951 items_per_second=1.58799G/s
BM_minipal_get_length_utf16_to_utf8/NonASCII_Long         167 ns          167 ns      4209817 items_per_second=840.11M/s
BM_minipal_convert_utf16_to_utf8/ASCII_Short             57.9 ns         57.8 ns     12086261 items_per_second=311.416M/s
BM_minipal_convert_utf16_to_utf8/NonASCII_Short          80.0 ns         79.9 ns      8745736 items_per_second=175.196M/s
BM_minipal_convert_utf16_to_utf8/ASCII_Long               155 ns          154 ns      4546517 items_per_second=907.604M/s
BM_minipal_convert_utf16_to_utf8/NonASCII_Long            489 ns          488 ns      1427791 items_per_second=286.914M/s
BM_minipal_get_length_utf8_to_utf16/ASCII_Short          47.9 ns         47.1 ns     14990417 items_per_second=382.482M/s
BM_minipal_get_length_utf8_to_utf16/NonASCII_Short       65.7 ns         65.6 ns     10645254 items_per_second=487.782M/s
BM_minipal_get_length_utf8_to_utf16/ASCII_Long            131 ns          130 ns      5443066 items_per_second=1.07377G/s
BM_minipal_get_length_utf8_to_utf16/NonASCII_Long         180 ns          179 ns      3938780 items_per_second=781.547M/s
BM_minipal_convert_utf8_to_utf16/ASCII_Short             73.2 ns         72.9 ns      9594561 items_per_second=246.863M/s
BM_minipal_convert_utf8_to_utf16/NonASCII_Short           109 ns          108 ns      6488872 items_per_second=296.073M/s
BM_minipal_convert_utf8_to_utf16/ASCII_Long               282 ns          281 ns      2457149 items_per_second=497.751M/s
BM_minipal_convert_utf8_to_utf16/NonASCII_Long            298 ns          296 ns      2353052 items_per_second=472.879M/s

BM_minipal_get_length_utf16_to_utf8/ASCII_Long has gotten the worse performance regression. I tried fast tracking ASCII with:

    // Use vectorized operations for ASCII characters

    typedef uint16_t v4ui __attribute__ ((vector_size (8)));

    for (; sourceIndex < sourceLength && destinationIndex < destinationLength; sourceIndex += 4, destinationIndex += 4)
    {
        v4ui data = *(v4ui*)&source[sourceIndex];
        v4ui mask4 = (v4ui){0x007F, 0x007F, 0x007F, 0x007F};
        v4ui comparison = data <= mask4;
        size_t remainingLength = sourceLength - sourceIndex;

        if (remainingLength > 4 && __builtin_expect(comparison[0] & comparison[1] & comparison[2] & comparison[3], 1))
        {
            destination[destinationIndex] = (char)data[0];
            destination[destinationIndex + 1] = (char)data[1];
            destination[destinationIndex + 2] = (char)data[2];
            destination[destinationIndex + 3] = (char)data[3];
            continue;
        }
        else if (remainingLength <= 4)
        {
            // Handle the remaining 4 or less elements if ASCII
            if (remainingLength == 4) {
                if (__builtin_expect(comparison[0] & comparison[1] & comparison[2] & comparison[3], 1))
                {
                    destination[destinationIndex++] = (char)data[0];
                    destination[destinationIndex++] = (char)data[1];
                    destination[destinationIndex++] = (char)data[2];
                    destination[destinationIndex++] = (char)data[3];
                    return destinationIndex;
                }
                break;
            }

            v4ui mask3 = (v4ui){0x007F, 0x007F, 0x007F};
            v4ui mask2 = (v4ui){0x007F, 0x007F};

            comparison = data <= mask3;
            if (remainingLength == 3) {
                if( __builtin_expect(comparison[0] & comparison[1] & comparison[2], 1)) {
                    destination[destinationIndex++] = (char)data[0];
                    destination[destinationIndex++] = (char)data[1];
                    destination[destinationIndex++] = (char)data[2];
                    return destinationIndex;
                }
                break;
            }

            comparison = data <= mask2;
            if (remainingLength == 2) {
                if(__builtin_expect(comparison[0] & comparison[1], 1)) {
                    destination[destinationIndex++] = (char)data[0];
                    destination[destinationIndex++] = (char)data[1];
                    return destinationIndex;
                }
                break;
            }

            if (source[sourceIndex] < 0x80){
                destination[destinationIndex++] = (char)source[sourceIndex];
                return destinationIndex;
            }
        }

        // Break out of the outer loop here because we know it's non-ASCII
        break;
    }

    // Handle non-ASCII and mixed cases
    while (sourceIndex < sourceLength && destinationIndex < destinationLength)
    {
...

but that did not move the needle (+/- 2ns diff).

Haven't given up on fine-tuning yet, but overall I think the new implementation is simple and more readable. So it's a balancing question of readability vs. performance (I'm pursuing both 🙂).

dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label May 19, 2024

dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label May 19, 2024

am11 added area-PAL-coreclr and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels May 19, 2024

This was referenced May 20, 2024

Cannot find 'arm64-v8a' device dotnet/dnceng#2284

Open

Dead lettering tests #101524

Open

Simplify utf8.c

020c8ed

am11 force-pushed the feature/minipal/utf8-v2 branch from 9fc22ee to 020c8ed Compare May 21, 2024 05:46

am11 marked this pull request as ready for review May 21, 2024 05:47

This was referenced May 21, 2024

Abort on mono in SwiftErrorHandling #102478

Closed

SupportedLinuxPlatforms_IsSupportedIsTrue failed in CI #102479

Closed

am11 requested review from jkotas and AaronRobinsonMSFT May 21, 2024 08:04

am11 commented May 21, 2024

View reviewed changes

AaronRobinsonMSFT reviewed May 21, 2024

View reviewed changes

src/native/minipal/utf8.c Show resolved Hide resolved

src/native/minipal/utf8.c Outdated Show resolved Hide resolved

src/native/minipal/utf8.c Outdated Show resolved Hide resolved

am11 commented May 21, 2024

View reviewed changes

src/native/minipal/utf8.c Show resolved Hide resolved

Apply suggestions from code review

e1f7b59

Co-authored-by: Aaron Robinson <arobins@microsoft.com>

This was referenced May 21, 2024

Test failure: Unknown chain building error in System.Net.Security.CertificateValidationRemoteServer.ConnectWithRevocation_WithCallback #101835

Open

NativeAOT legs timing out in CI #102239

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify utf8.c #102424

Simplify utf8.c #102424

am11 commented May 19, 2024

am11 commented May 21, 2024 •

edited

am11 May 21, 2024

AaronRobinsonMSFT May 21, 2024

fanyang-mono May 31, 2024 •

edited

am11 May 31, 2024

fanyang-mono May 31, 2024

am11 May 31, 2024

fanyang-mono Jun 3, 2024

AaronRobinsonMSFT left a comment

jkotas commented May 22, 2024

am11 commented May 22, 2024 •

edited

Simplify utf8.c #102424

Are you sure you want to change the base?

Simplify utf8.c #102424

Conversation

am11 commented May 19, 2024

am11 commented May 21, 2024 • edited

am11 May 21, 2024

Choose a reason for hiding this comment

AaronRobinsonMSFT May 21, 2024

Choose a reason for hiding this comment

fanyang-mono May 31, 2024 • edited

Choose a reason for hiding this comment

am11 May 31, 2024

Choose a reason for hiding this comment

fanyang-mono May 31, 2024

Choose a reason for hiding this comment

am11 May 31, 2024

Choose a reason for hiding this comment

fanyang-mono Jun 3, 2024

Choose a reason for hiding this comment

AaronRobinsonMSFT left a comment

Choose a reason for hiding this comment

jkotas commented May 22, 2024

am11 commented May 22, 2024 • edited

am11 commented May 21, 2024 •

edited

fanyang-mono May 31, 2024 •

edited

am11 commented May 22, 2024 •

edited