The C language is a low level language, close to the hardware. It has a builtin character string <str>
type (:cwchar_t*
), but only few libraries support this type. It is usually used as the first "layer" between the kernel (system calls, e.g. open a file) and applications, higher level libraries and other programming languages. This first layer uses the same type as the kernel: except Windows
, all kernels use byte strings <bytes>
.
There are higher level libraries, like glib <glib>
or Qt <qt>
, offering a Unicode API, even if the underlying kernel uses byte strings. Such libraries use a codec to encode <encode>
data to the kernel and to decode <decode>
data from the kernel. The codec is usually the current locale encoding <locale encoding>
.
Because there is no Unicode standard library, most third-party libraries chose the simple solution: use byte strings <str>
. For example, the OpenSSL library, an open source cryptography toolkit, expects filenames <filename>
as byte strings. On Windows, you have to encode Unicode filenames to the current ANSI code
page <codepage>
, which is a small subset of the Unicode charset.
"Because there is no Unicode standard library": add historical/compatibilty reasons
toupper() and isprint() are locale dependent
:cchar*
points to char, not char*
Create a section for NUL byte/character
towupper() and iswprint() are locale dependent
is wchar_t signed on Windows and Mac OS X?
can wchar_t be signed?
POSIX.1-2001 has no function ignoring case to compare character strings. POSIX.1-2008, a recent standard, adds :cwcscasecmp
: the GNU libc has it as an extension (if :c_GNU_SOURCE
is defined). Windows has the :c_wcsnicmp
function.
Windows
uses (UTF-16 <utf16>
) wchar_t* strings for its Unicode API.
Formats of string arguments for the printf functions:
"%s"
: literal byte string (:cchar*
)"%ls"
: literal character string (:cwchar_t*
)
printf("%ls")
is strict <strict>
: it stops immediatly if a character string <str>
argument cannot be encoded <unencodable>
to the locale encoding <locale encoding>
. For example, the following code prints the truncated string "Latin capital letter L with stroke: [" if Ł (U+0141) cannot be encoded to the locale encoding. :
printf("Latin capital letter L with stroke: [%ls]\n", L"\u0141");
wprintf("%s")
and wprintf("%.<length>s")
are strict <strict>
: they stop immediatly if a byte string <bytes>
argument cannot be decoded <undecodable>
from the locale encoding <locale encoding>
. For example, the following code prints the truncated string "Latin capital letter L with stroke: [" if 0xC5 0x81
(U+0141 encoded to UTF-8
) cannot be decoded from the locale encoding <locale encoding>
. :
wprintf(L"Latin capital letter L with stroke): [%s]\n", "\xC5\x81");
wprintf(L"Latin capital letter L with stroke): [%.10s]\n", "\xC5\x81");
wprintf("%ls")
replaces <replace>
unencodable <unencodable>
character string <str>
arguments by ? (U+003F). For example, the following example print "Latin capital letter L with stroke: [?]" if Ł (U+0141) cannot be encoded to the locale encoding <locale encoding>
: :
wprintf(L"Latin capital letter L with stroke: [%s]\n", L"\u0141");
So to avoid truncated strings, try to use only :cwprintf
with character string arguments.
how are non-ASCII characters handled in the format string?
Note
There is also "%S"
format which is a deprecated alias to the "%ls"
format, don't use it.
locale encoding should be initialized.
std::wstring
:character string <str>
using the :cwchar_t
type, Unicode version ofstd::string
(byte string <bytes>
)std::wcin
,std::wcout
andstd::wcerr
: standard input, output and error output; Unicode version ofstd::cin
,std::cout
andstd::cerr
std::wostringstream
: character stream buffer; Unicode version ofstd::ostringstream
.
To initialize the locales <locales>
, equivalent to setlocale(LC_ALL, "")
, use: :
#include <locale>
std::locale::global(std::locale(""));
If you use also C and C++ functions (e.g. :cprintf
and std::cout
) to access the standard streams, you may have issues with non-ASCII
<ascii>
characters. To avoid these issues, you can disable the automatic synchronization between C (std*
) and C++ (std::c*
) streams using: :
#include <iostream>
std::ios_base::sync_with_stdio(false);
Note
Use typedef basic_ostringstream<wchar_t> wostringstream;
if wostringstream is not available.
Python supports Unicode since its version 2.0 released in October 2000. Byte <bytes>
and Unicode <str>
strings store their length, so it's possible to embed nul byte/character.
Python can be compiled in two modes: narrow (UTF-16 <utf16>
) and wide (UCS-4 <ucs2>
). sys.maxunicode
constant is 0xFFFF in narrow build, and 0x10FFFF in wide build. Python is compiled in narrow mode on Windows, because :cwchar_t
is also 16 bits on Windows and so it is possible to use Python Unicode strings as :cwchar_t*
strings without any (expensive) conversion.
str
is the byte string <bytes>
type and unicode
is the character string <str>
type. Literal byte strings are written b'abc'
(syntax compatible with Python 3) or 'abc'
(legacy syntax), \xHH
can be used to write a byte by its hexadecimal value (e.g. b'\x80'
for 128). Literal Unicode strings are written with the prefix u
: u'abc'
. Code points can be written as hexadecimal: \xHH
(U+0000—U+00FF), \uHHHH
(U+0000—U+FFFF) or \UHHHHHHHH
(U+0000—U+10FFFF), e.g. 'euro sign:\u20AC'
.
In Python 2, str + unicode
gives unicode
: the byte string is decoded <decode>
from the default encoding (ASCII
). This coercion was a bad design idea because it was the source of a lot of confusion. At the same time, it was not possible to switch completely to Unicode in 2000: computers were slower and there were fewer Python core developers. It took 8 years to switch completely to Unicode: Python 3 was relased in December 2008.
Narrow build of Python 2 has a partial support of non-BMP <bmp>
characters. The unichr() function raises an error for code bigger than U+FFFF, whereas literal strings support non-BMP characters (e.g. '\U0010FFFF'
). Non-BMP characters are encoded as surrogate pairs <surrogates>
. The disavantage is that len(u'\U00010000')
is 2, and u'\U0010FFFF'[0]
is u'\uDBFF'
(lone surrogate character).
Note
DO NOT CHANGE THE DEFAULT ENCODING! Calling sys.setdefaultencoding() is a very bad idea because it impacts all libraries which suppose that the default encoding is ASCII.
bytes
is the byte string <bytes>
type and str
is the character string <str>
type. Literal byte strings are written with the b
prefix: b'abc'
. \xHH
can be used to write a byte by its hexadecimal value, e.g. b'\x80'
for 128. Literal Unicode strings are written 'abc'
. Code points can be used directly in hexadecimal: \xHH
(U+0000—U+00FF), \uHHHH
(U+0000—U+FFFF) or \UHHHHHHHH
(U+0000—U+10FFFF), e.g. 'euro sign:\u20AC'
. Each item of a byte string is an integer in range 0—255: b'abc'[0]
gives 97, whereas 'abc'[0]
gives 'a'
.
Python 3 has a full support of non-BMP <bmp>
characters, in narrow and wide builds. But as Python 2, chr(0x10FFFF) creates a string of 2 characters (a UTF-16 surrogate pair <surrogates>
) in a narrow build. chr()
and ord()
supports non-BMP characters in both modes.
Python 3 uses U+DC80—U+DCFF character range to store undecodable bytes <undecodable>
with the surrogateescape
error handler, described in the PEP 383 (Non-decodable Bytes in System Character Interfaces). It is used for filenames and environment variables on UNIX and BSD systems. Example: b'abc\xff'.decode('ASCII', 'surrogateescape')
gives 'abc\uDCFF'
.
str + unicode
gives unicode
in Python 2 (the byte string is decoded from the default encoding, ASCII
) and it raises a TypeError
in Python 3. In Python 3, comparing bytes
and str
gives False
, emits a BytesWarning
warning or raises a BytesWarning
exception depending of the bytes warning flag (-b
or -bb
option passed to the Python program). In Python 2, the byte string is decoded <decode>
from the default encoding (ASCII) to Unicode before being compared.
UTF-8
decoder of Python 2 accept surrogate characters
<surrogates>
, even if there are invalid, to keep backward compatibility with Python 2.0. In Python 3, the UTF-8 decoder is strict <strict utf8 decoder>
: it rejects surrogate characters.
It is possible to make Python 2 behave more like Python 3 with from __future__ import unicode_literals.
The codecs
and encodings
modules provide text encodings. They support a lot of encodings. Some examples: ASCII, ISO-8859-1, UTF-8, UTF-16-LE, ShiftJIS, Big5, cp037, cp950, EUC_JP, etc.
UTF-8
, UTF-16-LE
, UTF-16-BE
, UTF-32-LE
and UTF-32-BE
don't use BOM <bom>
, whereas UTF-8-SIG
, UTF-16
and UTF-32
use BOM. mbcs
is only available on Windows: it is the ANSI code page
<codepage>
.
Python provides also many error handlers <errors>
used to specify how to handle undecodable byte sequences <undecodable>
and unencodable characters
<unencodable>
:
strict
(default): raise aUnicodeDecodeError
or aUnicodeEncodeError
replace
: replace undecodable bytes by � (U+FFFD) and unencodable characters by?
(U+003F)ignore
: ignore undecodable bytes and unencodable charactersbackslashreplace
(only encode): replace unencodable bytes by\xHH
Python 3 has three more error handlers:
surrogateescape
: replace undecodable bytes (non-ASCII:0x80
—0xFF
) bysurrogate characters <surrogates>
(in U+DC80—U+DCFF) on decoding, replace characters in range U+DC80—U+DCFF by bytes in0x80
—0xFF
on encoding. Read the PEP 383 (Non-decodable Bytes in System Character Interfaces) for the details.surrogatepass
, specific toUTF-8
codec: allow encoding/decoding surrogate characters inUTF-8
. It is required because UTF-8 decoder of Python 3 rejects surrogate characters by default.backslashreplace
(for decode): replace undecodable bytes by\xHH
Decoding examples in Python 3:
b'abc\xff'.decode('ASCII')
uses thestrict
error handler and raises anUnicodeDecodeError
b'abc\xff'.decode('ASCII', 'ignore')
gives'abc'
b'abc\xff'.decode('ASCII', 'replace')
gives'abc\uFFFD'
b'abc\xff'.decode('ASCII', 'surrogateescape')
gives'abc\uDCFF'
Encoding examples in Python 3:
'\u20ac'.encode('UTF-8')
givesb'\xe2\x82\xac'
'abc\xff'.encode('ASCII')
uses thestrict
error handler and raises anUnicodeEncodeError
'abc\xff'.encode('ASCII', 'backslashreplace')
givesb'abc\\xff'
Byte string <bytes>
(str
in Python 2, bytes
in Python 3) methods:
.decode(encoding, errors='strict')
:decode <decode>
from the specified encoding and (optional)error handler <errors>
.
Character string <str>
(unicode
in Python 2, str
in Python 3) methods:
.encode(encoding, errors='strict')
:encode <encode>
to the specified encoding with an (optional)error handler <errors>
.isprintable()
:False
if thecharacter category <unicode categories>
is other (Cc, Cf, Cn, Co, Cs) or separator (Zl, Zp, Zs),True
otherwise. There is an exception: even if U+0020 is a separator,' '.isprintable()
givesTrue
..toupper()
: convert to uppercase
Python decodes bytes filenames and encodes Unicode filenames using the filesystem encoding, sys.getfilesystemencoding()
:
mbcs
(ANSI code page <codepage>
) onWindows
UTF-8
onMac OS X <osx>
locale encoding <locale encoding>
otherwise
Python uses the strict
error handler <errors>
in Python 2, and surrogateescape
(PEP 383) in Python 3. In Python 2, if os.listdir(u'.')
cannot decode a filename, it keeps the bytes filename unchanged. Thanks to surrogateescape
, decoding a filename never fails in Python 3. But encoding a filename can fail in Python 2 and 3 depending on the filesystem encoding. For example, on Linux with the C locale, the Unicode filename "h\xe9.py"
cannot be encoded because the filesystem encoding is ASCII.
In Python 2, use os.getcwdu()
to get the current directory as Unicode.
Encodings used on Windows:
- locale.getpreferredencoding():
ANSI code page <codepage>
'mbcs'
codec:ANSI code page <codepage>
- sys.stdout.encoding, sys.stderr.encoding: encoding of the
Windows console <win_console>
.- sys.argv, os.environ, subprocess.Popen(args): native Unicode support (no encoding)
codecs
module:
BOM_UTF8
,BOM_UTF16_BE
,BOM_UTF32_LE
, ...:Byte order marks (BOM) <bom>
constantslookup(name)
: get a Python codec.lookup(name).name
gets the Python normalized name of a codec, e.g.codecs.lookup('ANSI_X3.4-1968').name
gives'ascii'
.open(filename, mode='rb', encoding=None, errors='strict', ...)
: legacy API to open a binary or text file. To open a file in Unicode mode, useio.open()
instead
io
module:
open(name, mode='r', buffering=-1, encoding=None, errors=None, ...)
: open a binary or text file in read and/or write mode. For text file,encoding
anderrors
can be used to specify the encoding and theerror handler <errors>
. By default, it opens text files with thelocale encoding <locale encoding>
instrict <strict>
mode.TextIOWrapper()
: wrapper to read and/or write text files, encode from/decode to the specified encoding (anderror handler <errors>
) and normalize newlines (\r\n
and\r
are replaced by\n
). It requires a buffered file. Don't use it directly to open a text file: useopen()
instead.
locale
module (locales <locales>
):
LC_ALL
,LC_CTYPE
, ...:locale categories <locale categories>
getlocale(category)
: get the value of alocale category <locale categories>
as the tuple (language code, encoding name)getpreferredencoding()
: get thelocale encoding <locale encoding>
setlocale(category, value)
: set the value of a locale category
sys
module:
getdefaultencoding()
: get the default encoding, e.g. used by'abc'.encode()
. In Python 3, the default encoding is fixed to'utf-8'
, in Python 2, it is'ascii'
by default.getfilesystemencoding()
: get the filesystem encoding used to decode and encode filenamesmaxunicode
: biggest Unicode code point storable in a single Python Unicode character, 0xFFFF in narrow build or 0x10FFFF in wide build.
unicodedata
module:
category(char)
: get thecategory <unicode categories>
of a charactername(char)
: get the name of a characternormalize(string)
:normalize <normalization>
a string to the NFC, NFD, NFKC or NFKD form
cleanup Python 2/3 here (open)
In PHP 5, a literal string (e.g. "abc"
) is a byte string <bytes>
. PHP has no character string <str>
type, only a "string" type which is a byte string <bytes>
.
PHP has "multibyte" functions to manipulate byte strings using their encoding. These functions have an optional encoding argument. If the encoding is not specified, PHP uses the default encoding (called "internal encoding"). Some multibyte functions:
mb_internal_encoding()
: get or set the internal encodingmb_substitute_character()
: change how tohandle <errors>
unencodable characters <unencodable>
:
"none"
:ignore <ignore>
unencodable characters"long"
:escape as hexadecimal <escape>
value, e.g."U+E9"
or"JIS+7E7E"
"entity"
:escape as HTML entities <escape>
, e.g."é"
mb_convert_encoding()
:decode <decode>
from an encoding andencode <encode>
to another encodingmb_ereg()
: search a pattern using a regular expressionmb_strlen()
: get the length in charactersmb_detect_encoding()
:guess the encoding <guess>
of abyte string <bytes>
Perl compatible regular expressions (PCRE) have an u
flag ("PCRE8") to process byte strings as UTF-8 encoded strings.
u flag: instead of which encoding?
PHP also includes a binding for the iconv <iconv>
library.
iconv()
:decode <decode>
abyte string <bytes>
from an encoding andencode <encode>
to another encoding, you can use//IGNORE
or//TRANSLIT
suffix to choose theerror handler <errors>
iconv_mime_decode()
: decode a MIME header field
Document utf8_encode() and utf8_decode() functions?
PHP 6 was a project to improve Unicode support of Unicode. This project died at the beginning of 2010. Read The Death of PHP 6/The Future of PHP 6 (May 25, 2010 by Larry Ullman) and Future of PHP6 (March 2010 by Johannes Schlüter) for more information.
PHP6 creation date?
perl
Write a character using its code point written in hexadecimal:
chr(0x1F4A9)
"\x{2639}"
"\N{U+A0}"
Using use charnames qw( :full );
, you can use a Unicode character in a string using "\N{name}"
syntax. Example: :
say "\N{long s} \N{ae} \N{Omega} \N{omega} \N{UPWARDS ARROW}"
Declare that filehandles opened within this lexical scope but not elsewhere are in UTF-8, until and unless you say otherwise. The :std
adds in STDIN
, STDOUT
, and STDERR
. This critical step implicitly decodes incoming data and encodes outgoing data as UTF-8: :
use open qw( :encoding(UTF-8) :std );
If PERL_UNICODE
environment variable is set to AS
, the following data will use UTF-8:
@ARGV
STDIN
,STDOUT
,STDERR
If you have a DATA
handle, you must explicitly set its encoding. If you want this to be UTF-8, then say: :
binmode(DATA, ":encoding(UTF-8)");
Misc: :
use feature qw< unicode_strings >;
use Unicode::Normalize qw< NFD NFC >;
use Encode qw< encode decode >;
@ARGV = map { decode("UTF-8", $_) } @ARGV;
open(OUTPUT, "> :raw :encoding(UTF-16LE) :crlf", $filename);
Misc:
- Encode
- Unicode::Normalize
- Unicode::Collate
- Unicode::Collate::Locale
- Unicode::UCD
- DBM_Filter::utf8
History:
- Perl 5.6 (2000): initial Unicode support, support
character strings <str>
- Perl 5.8 (2002): regex supports Unicode
- use "
use utf8;
" pragma to specify that your Perl script is encoded toUTF-8
Read perluniintro
, perlunicode
and perlunifaq
manuals.
See Tom Christiansen’s Materials for OSCON 2011 for more information.
char
is a character able to store Unicode BMP <bmp>
only characters (U+0000—U+FFFF), whereas Character
is a wrapper of the char
with static helper functions. Character
methods:
.getType(ch)
: get thecategory <unicode categories>
of a character.isWhitespace(ch)
: test if a character is a whitespace according to Java.toUpperCase(ch)
: convert to uppercase.codePointAt(CharSequence, int)
: return the code point at the given index of the CharSequence
explain isWhitespace()
String
is a character string <str>
implemented using a char
array and UTF-16 <utf16>
. String
methods:
String(bytes, encoding)
:decode <decode>
abyte string <bytes>
from the specified encoding. The decoder isstrict <strict>
: throw aCharsetDecoder
exception if abyte sequence cannot be decoded <undecodable>
..getBytes(encoding)
:encode <encode>
to the specified encoding, throw aCharsetEncoder
exception if a charactercannot be encoded <undecodable>
..length()
: get the length in UTF-16 units.
As Python
compiled in narrow mode, non-BMP <bmp>
characters are stored as UTF-16 surrogate pairs <surrogates>
and the length of a string is the number of UTF-16 units, not the number of Unicode characters.
Java, as the Tcl language, uses a variant of UTF-8
which encodes the nul character (U+0000) as the overlong byte sequence <strict utf8 decoder>
0xC0 0x80
, instead of 0x00
. So it is possible to use C <c>
functions like :cstrlen
on byte string <bytes>
with embeded nul characters.
The Go and D languages use UTF-8
as internal encoding to store Unicode strings <str>
.