-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Typed dict throws KeyError when keys contain any UTF-8 character ends with \xb8\x80
#9542
Comments
Oh, I've done something similar on my end. The key here will be converted into an empty string.
BTW, here is the byte string for the typed dict before it gets unpickled.
Another thing is if you change the typed dict's typing to [numba.types.unicode_type, numba.int64], everything works, and no KeyError anymore |
Yes, as a user of |
Haven't heard from the outlines team yet but I did provide a solution to alleviate the issue for now But I do have a question: what if I want to pass the keys by a numpy array? What numpy dtype is compatible with numba's unicode_type? |
OK, because I suspect that the type
So, to narrow this down, It is possible that this isn't a |
I think I understand now what is being attempted. You want the keys of the dictionary to also be values in a |
It turns out numpy dtype "U" of any length is recognized as "unicode_type" in numba. I'll make a PR on outlines side to address this issue. >>> import numpy as np
>>> a = np.array(["一", "二", "三"])
>>> a
array(['一', '二', '三'], dtype='<U1')
>>> aa = numba.typed.typeddict.Dict.empty(numba.types.unicode_type, numba.int64)
>>> for i, c in enumerate(a):
... aa[c] = i
...
>>> print(aa)
{一: 0, 二: 1, 三: 2} |
@M0gician I actually tried the following, and got stuck:
|
Somehow this code works, but the original issue of converting the specific unicode into empty string occurs again import numba
import numpy as np
@numba.njit
def function():
s = np.empty(3, dtype="<U1")
s[0] = "一"
s[1] = "二"
s[2] = "三"
print(s)
a = numba.typed.List(s)
return a
print(function()) ['一' '二' '三']
[, 二, 三] |
@M0gician this was discussed in the developer meeting today and something fishy is going on here. This may be a Numba bug after all. |
I do agree. There's some inconsistency between handling python objects and numpy objects on the numba side. When string literals or python objects like List[str] are passed to numba typed list or dict, they are treated as unicode_type. However, when they are wrapped as a numpy array, they are mostly treated as unichar and causing many casting problems. I did find a way to solve the problem on the outlines side by tweaking either to pass python objects or numpy ones. Numpy only have one unicode type while numba has two, and it is totally unclear to me when and why numpy "U" type was casted to one numba unicode type instead of another. I think it is a better idea to make them consistent on the numba side. |
Good that you have a workaround for now and yes this sounds like an interesting brain teaser. |
Found the problem. First the '一' character is: >>> np.array(['一']).tobytes()
b'\x00N\x00\x00'
>>> list(map(hex, np.array(['一']).tobytes()))
['0x0', '0x4e', '0x0', '0x0'] The boxer for unicodecharseq has a invalid skip on null-byte causing the copy to end prematurely: Lines 237 to 238 in e467ae6
The minimal patch is: diff --git a/numba/core/boxing.py b/numba/core/boxing.py
index 39d2a6047..be6b8eb2b 100644
--- a/numba/core/boxing.py
+++ b/numba/core/boxing.py
@@ -234,9 +234,9 @@ def box_unicodecharseq(typ, val, c):
with cgutils.loop_nest(c.builder, [fullsize], fullsize.type) as [idx]:
# Get char at idx
ch = c.builder.load(c.builder.gep(strptr, [c.builder.mul(idx, step)]))
- # If the char is a non-null-byte, store the next index as count
- with c.builder.if_then(cgutils.is_not_null(c.builder, ch)):
- c.builder.store(c.builder.add(idx, one), count)
+ # # If the char is a non-null-byte, store the next index as count
+ # with c.builder.if_then(cgutils.is_not_null(c.builder, ch)):
+ c.builder.store(c.builder.add(idx, one), count)
strlen = c.builder.load(count)
return c.pyapi.string_from_kind_and_data(kind, strptr, strlen)
However, there's another problem---the boxer is returning a Line 241 in e467ae6
Alternative reproducer: import numba
import numpy as np
@numba.njit
def foo(s):
a = np.zeros(1, dtype="<U1")
a[0] = s
print(a)
x = a[0]
print(x)
return (x,)
got = foo('一')
expect = foo.py_func('一')
print(repr(expect))
print(repr(got))
print(expect[0].tobytes())
print(got[0].tobytes()) |
Reporting a bug
Numba typed dict with key type of
UnicodeCharSeq()
of any length doesn't handle UTF-8 characters end with\xb8\x80
correctly. It seems that any of these characters is casted into empty string when__getitem__
is called, resulting KeyErrorMinimum Reproduction Demo
\xe3\xb8\x80
) 渀 (\xe6\xb8\x80
) 縀 (\xe7\xb8\x80
) 帀 (\xe5\xb8\x80
)Error Message
numba 0.59.1
visible in the release notes
(https://numba.readthedocs.io/en/stable/release-notes-overview.html).
i.e. it's possible to run as 'python bug.py'.
The text was updated successfully, but these errors were encountered: