Typed dict throws KeyError when keys contain any UTF-8 character ends with `\xb8\x80` #9542

M0gician · 2024-04-24T03:42:58Z

Reporting a bug

Numba typed dict with key type of UnicodeCharSeq() of any length doesn't handle UTF-8 characters end with \xb8\x80 correctly. It seems that any of these characters is casted into empty string when __getitem__ is called, resulting KeyError

Minimum Reproduction Demo

import numba

a = numba.typed.typeddict.Dict.empty(numba.types.UnicodeCharSeq(1), numba.int64)
a['一'] = 10    # \xe4\xb8\x80
print(a)

this demo also works for other UTF-8 characters like 㸀 (\xe3\xb8\x80) 渀 (\xe6\xb8\x80) 縀 (\xe7\xb8\x80) 帀 (\xe5\xb8\x80)

Error Message

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "...\.vscode\extensions\ms-python.python-2024.4.1\python_files\pythonrc.py", line 22, in my_displayhook
    self.original_displayhook(value)
  File "...\mamba\lib\site-packages\numba\typed\typeddict.py", line 217, in __repr__
    body = str(self)
  File "...\mamba\lib\site-packages\numba\typed\typeddict.py", line 212, in __str__
    for k, v in self.items():
  File "...\mamba\lib\_collections_abc.py", line 911, in __iter__
    yield (key, self._mapping[key])
  File "...\mamba\lib\site-packages\numba\typed\typeddict.py", line 180, in __getitem__
    return _getitem(self, key)
  File "...\mamba\lib\site-packages\numba\typed\dictobject.py", line 783, in impl
    raise KeyError()
KeyError

numba 0.59.1

[x ] I have tried using the latest released version of Numba (most recent is
visible in the release notes
(https://numba.readthedocs.io/en/stable/release-notes-overview.html).
[x ] I have included a self contained code sample to reproduce the problem.
i.e. it's possible to run as 'python bug.py'.

The text was updated successfully, but these errors were encountered:

esc · 2024-04-26T09:04:28Z

@M0gician thank you for the report, I can reproduce this. Maybe using #9530 will yield some more insight into why this is going wrong.

M0gician · 2024-04-26T09:20:58Z

@M0gician thank you for the report, I can reproduce this. Maybe using #9530 will yield some more insight into why this is going wrong.

Oh, I've done something similar on my end. The key here will be converted into an empty string.

KeyError: 'Key "" of type <object type:typeref[[unichr x 1]]> not found in dict'

BTW, here is the byte string for the typed dict before it gets unpickled.

b'\x80\x04\x95"\x00\x00\x00\x00\x00\x00\x00\x8c\x15numba.typed.typeddict\x94\x8c\x04Dict\x94\x93\x94.'

Another thing is if you change the typed dict's typing to [numba.types.unicode_type, numba.int64], everything works, and no KeyError anymore

esc · 2024-04-26T13:12:56Z

@M0gician thank you for the report, I can reproduce this. Maybe using #9530 will yield some more insight into why this is going wrong.

Oh, I've done something similar on my end. The key here will be converted into an empty string.
KeyError: 'Key "" of type <object type:typeref[[unichr x 1]]> not found in dict'
BTW, here is the byte string for the typed dict before it gets unpickled.
b'\x80\x04\x95"\x00\x00\x00\x00\x00\x00\x00\x8c\x15numba.typed.typeddict\x94\x8c\x04Dict\x94\x93\x94.'
Another thing is if you change the typed dict's typing to [numba.types.unicode_type, numba.int64], everything works, and no KeyError anymore

Yes, as a user of typed.Dict you are in control of the key type. I looked into outlines-dev/outlines#833 and probably the issue can be fixed there?

M0gician · 2024-04-26T13:17:49Z

@M0gician thank you for the report, I can reproduce this. Maybe using #9530 will yield some more insight into why this is going wrong.

Oh, I've done something similar on my end. The key here will be converted into an empty string.
KeyError: 'Key "" of type <object type:typeref[[unichr x 1]]> not found in dict'
BTW, here is the byte string for the typed dict before it gets unpickled.
b'\x80\x04\x95"\x00\x00\x00\x00\x00\x00\x00\x8c\x15numba.typed.typeddict\x94\x8c\x04Dict\x94\x93\x94.'
Another thing is if you change the typed dict's typing to [numba.types.unicode_type, numba.int64], everything works, and no KeyError anymore
Yes, as a user of typed.Dict you are in control of the key type. I looked into outlines-dev/outlines#833 and probably the issue can be fixed there?

Haven't heard from the outlines team yet but I did provide a solution to alleviate the issue for now

But I do have a question: what if I want to pass the keys by a numpy array? What numpy dtype is compatible with numba's unicode_type?

esc · 2024-04-26T13:21:11Z

Haven't heard from the outlines team yet but I did provide a solution to alleviate the issue for now

OK, because I suspect that the type UnicodeCharSeq is either the wrong type or buggy. When I do a numba.typeof I get this:

In [1]: s = '一'

In [2]: import numba

In [3]: numba.typeof(s)
Out[3]: unicode_type

So, to narrow this down, It is possible that this isn't a typed.Dict issue.

esc · 2024-04-26T13:25:10Z

But I do have a question: what if I want to pass the keys by a numpy array? What numpy dtype is compatible with numba's unicode_type?

I think I understand now what is being attempted. You want the keys of the dictionary to also be values in a numpy.ndarray? I am not sure yet how to do that, perhaps you'd need to cast the value?

M0gician · 2024-04-26T13:35:18Z

But I do have a question: what if I want to pass the keys by a numpy array? What numpy dtype is compatible with numba's unicode_type?

I think I understand now what is being attempted. You want the keys of the dictionary to also be values in a numpy.ndarray? I am not sure yet how to do that, perhaps you'd need to cast the value?

It turns out numpy dtype "U" of any length is recognized as "unicode_type" in numba. I'll make a PR on outlines side to address this issue.

>>> import numpy as np
>>> a = np.array(["一", "二", "三"])
>>> a
array(['一', '二', '三'], dtype='<U1')
>>> aa = numba.typed.typeddict.Dict.empty(numba.types.unicode_type, numba.int64)
>>> for i, c in enumerate(a):
...     aa[c] = i
...
>>> print(aa)
{一: 0, 二: 1, 三: 2}

esc · 2024-04-26T13:48:24Z

@M0gician I actually tried the following, and got stuck:

In [5]: @numba.njit
   ...: def function():
   ...:     s = '一'
   ...:     a = np.array([s])
   ...:     return a
   ...:

In [6]: function()
---------------------------------------------------------------------------
TypingError                               Traceback (most recent call last)
Cell In[6], line 1
----> 1 function()

File ~/git/numba/numba/core/dispatcher.py:423, in _DispatcherBase._compile_for_args(self, *args, **kws)
    419         msg = (f"{str(e).rstrip()} \n\nThis error may have been caused "
    420                f"by the following argument(s):\n{args_str}\n")
    421         e.patch_message(msg)
--> 423     error_rewrite(e, 'typing')
    424 except errors.UnsupportedError as e:
    425     # Something unsupported is present in the user code, add help info
    426     error_rewrite(e, 'unsupported_error')

File ~/git/numba/numba/core/dispatcher.py:364, in _DispatcherBase._compile_for_args.<locals>.error_rewrite(e, issue_type)
    362     raise e
    363 else:
--> 364     raise e.with_traceback(None)

TypingError: Failed in nopython mode pipeline (step: nopython frontend)
No implementation of function Function(<built-in function array>) found for signature:

 >>> array(list(unicode_type)<iv=['一']>)

There are 2 candidate implementations:
  - Of which 1 did not match due to:
  Overload in function 'impl_np_array': File: numba/np/arrayobj.py: Line 5432.
    With argument(s): '(list(unicode_type)<iv=None>)':
   Rejected as the implementation raised a specific error:
     TypingError: Failed in nopython mode pipeline (step: nopython frontend)
   No implementation of function Function(<intrinsic np_array>) found for signature:

    >>> np_array(list(unicode_type)<iv=None>, none)

   There are 2 candidate implementations:
         - Of which 2 did not match due to:
         Intrinsic in function 'np_array': File: numba/np/arrayobj.py: Line 5406.
           With argument(s): '(list(unicode_type)<iv=None>, none)':
          Rejected as the implementation raised a specific error:
            NumbaNotImplementedError: unicode_type cannot be represented as a NumPy dtype
     raised from /Users/vhaenel/git/numba/numba/np/numpy_support.py:159

   During: resolving callee type: Function(<intrinsic np_array>)
   During: typing of call at /Users/vhaenel/git/numba/numba/np/arrayobj.py (5443)


   File "numba/np/arrayobj.py", line 5443:
       def impl(object, dtype=None):
           return np_array(object, dtype)
           ^

  raised from /Users/vhaenel/git/numba/numba/core/typeinfer.py:1091
  - Of which 1 did not match due to:
  Overload in function 'impl_np_array': File: numba/np/arrayobj.py: Line 5432.
    With argument(s): '(list(unicode_type)<iv=['一']>)':
   Rejected as the implementation raised a specific error:
     TypingError: Failed in nopython mode pipeline (step: nopython frontend)
   No implementation of function Function(<intrinsic np_array>) found for signature:

    >>> np_array(list(unicode_type)<iv=['一']>, none)

   There are 2 candidate implementations:
         - Of which 2 did not match due to:
         Intrinsic in function 'np_array': File: numba/np/arrayobj.py: Line 5406.
           With argument(s): '(list(unicode_type)<iv=None>, none)':
          Rejected as the implementation raised a specific error:
            NumbaNotImplementedError: unicode_type cannot be represented as a NumPy dtype
     raised from /Users/vhaenel/git/numba/numba/np/numpy_support.py:159

   During: resolving callee type: Function(<intrinsic np_array>)
   During: typing of call at /Users/vhaenel/git/numba/numba/np/arrayobj.py (5443)


   File "numba/np/arrayobj.py", line 5443:
       def impl(object, dtype=None):
           return np_array(object, dtype)
           ^

  raised from /Users/vhaenel/git/numba/numba/core/typeinfer.py:1091

During: resolving callee type: Function(<built-in function array>)
During: typing of call at <ipython-input-5-726a1c6ac68f> (4)


File "<ipython-input-5-726a1c6ac68f>", line 4:
def function():
    <source elided>
    s = '一'
    a = np.array([s])

M0gician · 2024-04-26T14:12:18Z

@M0gician I actually tried the following, and got stuck:

In [5]: @numba.njit
   ...: def function():
   ...:     s = '一'
   ...:     a = np.array([s])
   ...:     return a
   ...:

In [6]: function()
---------------------------------------------------------------------------
TypingError                               Traceback (most recent call last)
Cell In[6], line 1
----> 1 function()

File ~/git/numba/numba/core/dispatcher.py:423, in _DispatcherBase._compile_for_args(self, *args, **kws)
    419         msg = (f"{str(e).rstrip()} \n\nThis error may have been caused "
    420                f"by the following argument(s):\n{args_str}\n")
    421         e.patch_message(msg)
--> 423     error_rewrite(e, 'typing')
    424 except errors.UnsupportedError as e:
    425     # Something unsupported is present in the user code, add help info
    426     error_rewrite(e, 'unsupported_error')

File ~/git/numba/numba/core/dispatcher.py:364, in _DispatcherBase._compile_for_args.<locals>.error_rewrite(e, issue_type)
    362     raise e
    363 else:
--> 364     raise e.with_traceback(None)

TypingError: Failed in nopython mode pipeline (step: nopython frontend)
No implementation of function Function(<built-in function array>) found for signature:

 >>> array(list(unicode_type)<iv=['一']>)

There are 2 candidate implementations:
  - Of which 1 did not match due to:
  Overload in function 'impl_np_array': File: numba/np/arrayobj.py: Line 5432.
    With argument(s): '(list(unicode_type)<iv=None>)':
   Rejected as the implementation raised a specific error:
     TypingError: Failed in nopython mode pipeline (step: nopython frontend)
   No implementation of function Function(<intrinsic np_array>) found for signature:

    >>> np_array(list(unicode_type)<iv=None>, none)

   There are 2 candidate implementations:
         - Of which 2 did not match due to:
         Intrinsic in function 'np_array': File: numba/np/arrayobj.py: Line 5406.
           With argument(s): '(list(unicode_type)<iv=None>, none)':
          Rejected as the implementation raised a specific error:
            NumbaNotImplementedError: unicode_type cannot be represented as a NumPy dtype
     raised from /Users/vhaenel/git/numba/numba/np/numpy_support.py:159

   During: resolving callee type: Function(<intrinsic np_array>)
   During: typing of call at /Users/vhaenel/git/numba/numba/np/arrayobj.py (5443)


   File "numba/np/arrayobj.py", line 5443:
       def impl(object, dtype=None):
           return np_array(object, dtype)
           ^

  raised from /Users/vhaenel/git/numba/numba/core/typeinfer.py:1091
  - Of which 1 did not match due to:
  Overload in function 'impl_np_array': File: numba/np/arrayobj.py: Line 5432.
    With argument(s): '(list(unicode_type)<iv=['一']>)':
   Rejected as the implementation raised a specific error:
     TypingError: Failed in nopython mode pipeline (step: nopython frontend)
   No implementation of function Function(<intrinsic np_array>) found for signature:

    >>> np_array(list(unicode_type)<iv=['一']>, none)

   There are 2 candidate implementations:
         - Of which 2 did not match due to:
         Intrinsic in function 'np_array': File: numba/np/arrayobj.py: Line 5406.
           With argument(s): '(list(unicode_type)<iv=None>, none)':
          Rejected as the implementation raised a specific error:
            NumbaNotImplementedError: unicode_type cannot be represented as a NumPy dtype
     raised from /Users/vhaenel/git/numba/numba/np/numpy_support.py:159

   During: resolving callee type: Function(<intrinsic np_array>)
   During: typing of call at /Users/vhaenel/git/numba/numba/np/arrayobj.py (5443)


   File "numba/np/arrayobj.py", line 5443:
       def impl(object, dtype=None):
           return np_array(object, dtype)
           ^

  raised from /Users/vhaenel/git/numba/numba/core/typeinfer.py:1091

During: resolving callee type: Function(<built-in function array>)
During: typing of call at <ipython-input-5-726a1c6ac68f> (4)


File "<ipython-input-5-726a1c6ac68f>", line 4:
def function():
    <source elided>
    s = '一'
    a = np.array([s])

Somehow this code works, but the original issue of converting the specific unicode into empty string occurs again

import numba
import numpy as np

@numba.njit
def function():
    s = np.empty(3, dtype="<U1")
    s[0] = "一"
    s[1] = "二"
    s[2] = "三"
    print(s)
    a = numba.typed.List(s)
    return a

print(function())

['一' '二' '三']
[, 二, 三]

esc · 2024-04-30T14:12:49Z

@M0gician this was discussed in the developer meeting today and something fishy is going on here. This may be a Numba bug after all.

M0gician · 2024-04-30T14:24:05Z

@M0gician this was discussed in the developer meeting today and something fishy is going on here. This may be a Numba bug after all.

I do agree. There's some inconsistency between handling python objects and numpy objects on the numba side.

When string literals or python objects like List[str] are passed to numba typed list or dict, they are treated as unicode_type. However, when they are wrapped as a numpy array, they are mostly treated as unichar and causing many casting problems.

I did find a way to solve the problem on the outlines side by tweaking either to pass python objects or numpy ones.

Numpy only have one unicode type while numba has two, and it is totally unclear to me when and why numpy "U" type was casted to one numba unicode type instead of another. I think it is a better idea to make them consistent on the numba side.

esc · 2024-04-30T15:00:28Z

Numpy only have one unicode type while numba has two, and it is totally unclear to me when and why numpy "U" type was casted to one numba unicode type instead of another. I think it is a better idea to make them consistent on the numba side.

Good that you have a workaround for now and yes this sounds like an interesting brain teaser.

sklam · 2024-04-30T19:42:39Z

Found the problem.

First the '一' character is:

>>> np.array(['一']).tobytes()
b'\x00N\x00\x00'
>>> list(map(hex, np.array(['一']).tobytes()))
['0x0', '0x4e', '0x0', '0x0']

The boxer for unicodecharseq has a invalid skip on null-byte causing the copy to end prematurely:

numba/numba/core/boxing.py

Lines 237 to 238 in e467ae6

    
           # If the char is a non-null-byte, store the next index as count 
        
           with c.builder.if_then(cgutils.is_not_null(c.builder, ch)):

The minimal patch is:

diff --git a/numba/core/boxing.py b/numba/core/boxing.py
index 39d2a6047..be6b8eb2b 100644
--- a/numba/core/boxing.py
+++ b/numba/core/boxing.py
@@ -234,9 +234,9 @@ def box_unicodecharseq(typ, val, c):
     with cgutils.loop_nest(c.builder, [fullsize], fullsize.type) as [idx]:
         # Get char at idx
         ch = c.builder.load(c.builder.gep(strptr, [c.builder.mul(idx, step)]))
-        # If the char is a non-null-byte, store the next index as count
-        with c.builder.if_then(cgutils.is_not_null(c.builder, ch)):
-            c.builder.store(c.builder.add(idx, one), count)
+        # # If the char is a non-null-byte, store the next index as count
+        # with c.builder.if_then(cgutils.is_not_null(c.builder, ch)):
+        c.builder.store(c.builder.add(idx, one), count)
     strlen = c.builder.load(count)
     return c.pyapi.string_from_kind_and_data(kind, strptr, strlen)

However, there's another problem---the boxer is returning a str instead of the unicode-charseq dtype:

numba/numba/core/boxing.py

Line 241 in e467ae6

return c.pyapi.string_from_kind_and_data(kind, strptr, strlen)

.

Alternative reproducer:

import numba
import numpy as np


@numba.njit
def foo(s):
    a = np.zeros(1, dtype="<U1")
    a[0] = s
    print(a)
    x = a[0]
    print(x)
    return (x,)

got = foo('一')
expect = foo.py_func('一')
print(repr(expect))
print(repr(got))

print(expect[0].tobytes())
print(got[0].tobytes())

m0g1cian mentioned this issue Apr 24, 2024

KeyError in BetterFSM::FSMInfo when input FSM alphabet contains UTF-8 characters that ends with \xb8\x80 outlines-dev/outlines#833

Closed

esc added the bug - incorrect behavior Bugs: incorrect behavior label Apr 26, 2024

M0gician mentioned this issue May 19, 2024

Fix null byte \x00 issue by switching to numba.types.unicode_type outlines-dev/outlines#904

Closed

lapp0 mentioned this issue Jun 1, 2024

Fix null byte \x00 issue in byte level fsm resulting in KeyError in BetterFSM::FSMInfo outlines-dev/outlines#930

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Typed dict throws KeyError when keys contain any UTF-8 character ends with `\xb8\x80` #9542

Typed dict throws KeyError when keys contain any UTF-8 character ends with `\xb8\x80` #9542

M0gician commented Apr 24, 2024 •

edited

esc commented Apr 26, 2024

M0gician commented Apr 26, 2024 •

edited

esc commented Apr 26, 2024

M0gician commented Apr 26, 2024 •

edited

esc commented Apr 26, 2024

esc commented Apr 26, 2024 •

edited

M0gician commented Apr 26, 2024 •

edited

esc commented Apr 26, 2024

M0gician commented Apr 26, 2024 •

edited

esc commented Apr 30, 2024

M0gician commented Apr 30, 2024 •

edited

esc commented Apr 30, 2024

sklam commented Apr 30, 2024

Typed dict throws KeyError when keys contain any UTF-8 character ends with \xb8\x80 #9542

Typed dict throws KeyError when keys contain any UTF-8 character ends with \xb8\x80 #9542

Comments

M0gician commented Apr 24, 2024 • edited

Reporting a bug

Minimum Reproduction Demo

Error Message

esc commented Apr 26, 2024

M0gician commented Apr 26, 2024 • edited

esc commented Apr 26, 2024

M0gician commented Apr 26, 2024 • edited

esc commented Apr 26, 2024

esc commented Apr 26, 2024 • edited

M0gician commented Apr 26, 2024 • edited

esc commented Apr 26, 2024

M0gician commented Apr 26, 2024 • edited

esc commented Apr 30, 2024

M0gician commented Apr 30, 2024 • edited

esc commented Apr 30, 2024

sklam commented Apr 30, 2024

Typed dict throws KeyError when keys contain any UTF-8 character ends with `\xb8\x80` #9542

Typed dict throws KeyError when keys contain any UTF-8 character ends with `\xb8\x80` #9542

M0gician commented Apr 24, 2024 •

edited

M0gician commented Apr 26, 2024 •

edited

M0gician commented Apr 26, 2024 •

edited

esc commented Apr 26, 2024 •

edited

M0gician commented Apr 26, 2024 •

edited

M0gician commented Apr 26, 2024 •

edited

M0gician commented Apr 30, 2024 •

edited