My platform is a Mac and c++11(or above). I'm a C++ beginner and working on a personal project which processes Chinese and English.
utf-8 is the preferred encoding for this project.
I read some posts on stackoverflow, and many of them suggest using
std::string when dealing with
utf-8 and avoid
wchar_t as there's no
char8_t right now for
However, none of them talk about how to properly deal wit functions like
std::regex as these function usually returns unexpected results when facing
Should I go ahead with
std::string or switch to
std::wstring? If I should stay with
std::string, what's the best practice for one to handle the above problems?
Thanks in advance.
std::wstring use UTF encoding. On macOS specifically,
std::string is UTF-8 (8-bit code units), and
std::wstring is UTF-32 (32-bit code units); note that the size of
wchar_t is platform-dependent.
size tracks the number of code units instead of the number of logical characters. (Logical characters are one or more code points.)
Although I'm not familiar with the Unicode representation of Chinese, it's very possible that when you use UTF-32, the number of code units is often very close to the number of logical characters. Obviously, however, this comes at the cost of using up to 4x more memory.
The most accurate solution would be to use a Unicode library, such as ICU, to calculate the Unicode properties that you are after.
Finally, UTF strings in human languages that don't use combining characters usually do pretty well with
regex. I'm not sure about Chinese, but English is one of them.