Convert Unicode surrogate pair to literal string

  • A+
Category:Languages

I am trying to read a high Unicode character from one string into another. For brevity, I will simplify my code as shown below:

public static void UnicodeTest() {     var highUnicodeChar = "𝐀"; //Not the standard A      var result1 = highUnicodeChar; //this works     var result2 = highUnicodeChar[0].ToString(); // returns /ud835 } 

When I assign highUnicodeChar to result1 directly, it retains its literal value of 𝐀. When I try to access it by index, it returns /ud835. As I understand it, this is a surrogate pair of UTF-16 characters used to represent a UTF-32 character. I am pretty sure this problem has to do with trying to implicitly convert a char to a string.

In the end, I want result2 to yield the same value as result1. How can I do this?

 


In Unicode, you have code points. These are 21 bits long.

In Unicode encodings, you have code units. These are the natural unit of the encoding: 8-bit for UTF-8, 16-bit for UTF-16, and so on. One or more code unit encodes a single code point.

In UTF-16, two code units that form a single code point are called a surrogate pair.

This gets a little tricky in .NET, as a .NET Char represents a single UTF-16 code unit, and a .NET String is a collection of code units.

So your code point 𝐀 (U+1D400) is a surrogate pair, meaning your string has two code units in it:

var highUnicodeChar = "𝐀"; char a = highUnicodeChar[0]; // code unit 0xD835 char b = highUnicodeChar[1]; // code unit 0xDC00 

Meaning when you index into the string like that, you're actually only getting half of the surrogate pair.

You can use IsSurrogatePair to test for a surrogate pair. For instance:

string GetFullCodePointAtIndex(string s, int idx) =>     s.Substring(idx, char.IsSurrogatePair(s, idx) ? 2 : 1); 

Important to note that the rabbit hole of variable encoding in Unicode doesn't end at the code point. A grapheme cluster is the thing most people would ultimately call a "character". A grapheme cluster is made from one or more code points: a base character, and zero or more combining characters. An example of a combining character is an umlaut.

To test for a combining character, you can use GetUnicodeCategory to check for an enclosing mark, non-spacing mark, or spacing mark.

Comment

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: