I was reading What is the use of wchar_t in general programming? and found something confusing in the accepted answer:
It's more common to use char with a variable-width encoding e.g. UTF-8 or GB 18030.
And I find this from my textbook:
Isn't Unicode encoding with UTF-8 is at most 4 bytes?
char for most platforms is 1 byte. Do I misunderstand something?
After searching and reading, now I know that:
- code points and code units are different stuff. Code point is unique while code units rely on encoding.
- u8'a'(a char, not string here) is only allowed for basic character set(the ASCII and it's control character stuff), and its value is the corresponding 'a''s code unit value, and for ascii characters, code units are same value as code points. (This is what @codekaizer's answer say)
std::string::size()returns code units.
- This character
əis encoded with 2 code units (from this website https://unicode-table.com/en/0259/). Then I tried
string s="ə";cout<<s.size();and the output is 1, how come?
- So the editors are all dealing with code units right? And If I change my file encoding from utf8 to uft32, then size of
əwould be 4?
Isn't unicode encoding with utf8 is at most 4 bytes?
As per lex.ccon/3, emphasis mine:
A character literal that begins with u8, such as u8'w', is a character literal of type char, known as a UTF-8 character literal. The value of a UTF-8 character literal is equal to its ISO 10646 code point value, provided that the code point value is representable with a single UTF-8 code unit (that is, provided it is in the C0 Controls and Basic Latin Unicode block). If the value is not representable with a single UTF-8 code unit, the program is ill-formed. A UTF-8 character literal containing multiple c-chars is ill-formed.
Single UTF-8 code unit is 1 byte.