Perl6 IO::Socket::Async truncates data

  • A+

I'm rewriting my P5 socket server in P6 using IO::Socket::Async, but the data received got truncated 1 character at the end and that 1 character is received on the next connection. Someone from Perl6 Facebook group (Jonathan Worthington) pointed that this might be due to the nature of strings and bytes are handled very differently in P6. Quoted:

In Perl 6, strings and bytes are handled very differently. Of note, strings work at grapheme level. When receiving Unicode data, it's not only possible that a multi-byte sequence will be split over packets, but also a multi-codepoint sequence. For example, one packet might have the letter "a" at the end, and the next one would be a combining acute accent. Therefore, it can't safely pass on the "a" until it's seen how the next packet starts.

My P6 is running on MoarVM

use Data::Dump; use experimental :pack;  my $socket = IO::Socket::Async.listen('', 7000);  react {     whenever $socket -> $conn {         my $line = '';         whenever $conn {              say "Received --> "~$_;             $conn.print: &translate($_) if $_.chars ge 100;               $conn.close;                        }     }     CATCH {         default {             say .^name, ': ', .Str;             say "handled in $?LINE";         }     } }  sub translate($raw) {      my $rawdata = $raw;     $raw ~~ s/^/s+|/s+$//; # remove heading/trailing whitespace      my $minus_checksum = substr($raw, 0, *-2);     my $our_checksum = generateChecksum($minus_checksum);     my $data_checksum = ($raw, *-2);      # say $our_checksum;     return $our_checksum;  }  sub generateChecksum($minus_checksum) {      # turn string into Blob     my Blob $blob = $minus_checksum.encode('utf-8');     # unpack Blob into ascii list     my @array = $blob.unpack("C*");     # perform bitwise operation for each ascii in the list     my $dec +^= $_ for $blob.unpack("C*");     # only take 2 digits     $dec = sprintf("%02d", $dec) if $dec ~~ /^/d$/;     $dec = '0'.$dec if $dec ~~ /^[a..fA..F]$/;     $dec = uc $dec;     # convert it to hex     my $hex = sprintf '%02x', $dec;     return uc $hex;   } 


Received --> $$0116AA861013034151986|10001000181123062657411200000000000010235444112500000000.600000000345.4335N10058.8249E00015 Received --> 0 Received --> $$0116AA861013037849727|1080100018112114435541120000000000000FBA00D5122500000000.600000000623.9080N10007.8627E00075 Received --> D Received --> $$0108AA863835028447675|18804000181121183810421100002A300000100900000000.700000000314.8717N10125.6499E00022 Received --> 7 Received --> $$0108AA863835028447675|18804000181121183810421100002A300000100900000000.700000000314.8717N10125.6499E00022 Received --> 7 Received --> $$0108AA863835028447675|18804000181121183810421100002A300000100900000000.700000000314.8717N10125.6499E00022 Received --> 7 Received --> $$0108AA863835028447675|18804000181121183810421100002A300000100900000000.700000000314.8717N10125.6499E00022 Received --> 7 


First of all, TCP connections are streams, so there's no promises that the "messages" that are sent will be received as equivalent "messages" on the receiving end. Things that are sent can be split up or merged as part of normal TCP behavior, even before Perl 6 behavior is considered. Anything that wants a "messages" abstraction needs to build it on top of the TCP stream (for example, by sending data as lines, or by sending a size in bytes, followed by the data).

In Perl 6, the data arriving over the socket is exposed as a Supply. A whenever $conn { } is short for whenever $conn.Supply { } (the whenever will coerce whatever it is given into a Supply). The default Supply is a character one, decoded as UTF-8 into a stream of Perl 6 Str. As noted in the answer you already received, strings in Perl 6 work at grapheme level, so it will keep back a character in case the next thing that arrives over the network is a combining character. This is the "truncation" that you are experiencing. (There are some things which can never be combined. For example, /n can never have a combining character placed on it. This means that line-oriented protocols won't encounter this kind of behavior, and can be implemented as simply whenever $conn.Supply.lines { }.)

There are a couple of options available:

  • Do whenever $conn.Supply(:bin) { }, which will deliver binary Blob objects, which will correspond to what the OS passed to the VM. That can then be .decode'd as wanted. This is probably your best bet.
  • Specify an encoding that does not support combining characters, for example whenever $conn.Supply(:enc('latin-1')) { }. (However, note that since /r/n is 1 grapheme, then if the message were to end in /r then that would be held back in case the next packet came along with a /n).

In both cases, it's still possible for messages to be split up during transmission, but these will (entirely and mostly, respectively) avoid the keep-one-back requirement that grapheme normalization entails.


:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: