How can I substitute in strings in Perl 6 by codepoint rather than by grapheme?

  • A+

I need to remove diacritical marks from a string using Perl 6. I tried doing this:

my $hum = 'חוּם'; $ahm.subst(/<-[/c[HEBREW LETTER ALEF] .. /c[HEBREW LETTER TAV]]>/, '', :g); 

I am trying to remove all the characters that are not in the range between HEBREW LETTER ALEF (א) and HEBREW LETTER TAV (ת). I'd expected the following code to return "חום", however it returns "חם".

I guess that what happens is that by default Perl 6 works by graphemes, considers וּ to be one grapheme, and removes all of it. It's often sensible to work by graphemes, but in my case I need it to work by codepoints.

I tried to find an adverb that would make it work by codepoint, but couldn't find it. Perhaps there is also a way in Perl 6 to use Unicode properties to exclude diacritics, or to include only letters, but I couldn't find that either.



My regex-fu is weak, so I'd go with a less magical solution.

First, you can remove all marks via samemark:


Second, you can decompose the graphemes via .NFD and operate on individual codepoints - eg only keeping values with property Grapheme_Base - and then recompose the string:'חוּם'.NFD.grep(*.uniprop('Grapheme_Base'))).Str 

In case of mixed strings, stripping marks from Hebrew characters only could look like this:

$str.subst(:g, /<:Script<Hebrew>>+/, *.Str.samemark('a')); 


:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: