How to properly use std::string on utf-8 in C++?

  • A+
Category:Languages

My platform is a Mac and c++11(or above). I'm a C++ beginner and working on a personal project which processes Chinese and English. utf-8 is the preferred encoding for this project.

I read some posts on stackoverflow, and many of them suggest using std::string when dealing with utf-8 and avoid wchar_t as there's no char8_t right now for utf-8.

However, none of them talk about how to properly deal wit functions like str[i], std::string::size(), std::string::find_first_of() or std::regex as these function usually returns unexpected results when facing utf-8.

Should I go ahead with std::string or switch to std::wstring? If I should stay with std::string, what's the best practice for one to handle the above problems?

Thanks in advance.


Both std::string and std::wstring use UTF encoding. On macOS specifically, std::string is UTF-8 (8-bit code units), and std::wstring is UTF-32 (32-bit code units); note that the size of wchar_t is platform-dependent.

For both, size tracks the number of code units instead of the number of logical characters. (Logical characters are one or more code points.)

Although I'm not familiar with the Unicode representation of Chinese, it's very possible that when you use UTF-32, the number of code units is often very close to the number of logical characters. Obviously, however, this comes at the cost of using up to 4x more memory.

The most accurate solution would be to use a Unicode library, such as ICU, to calculate the Unicode properties that you are after.

Finally, UTF strings in human languages that don't use combining characters usually do pretty well with regex. I'm not sure about Chinese, but English is one of them.

Comment

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: