Remove all spaces between Chinese words with regex

  • A+
Category:Languages

I would like to remove all spaces among Chinese text only.

My text: "請 把 這 裡 的 10 多 個 字 合 併. Can you help me?"

Ideal output: "請把這裡的 10 多個字合併. Can you help me?"

var str = '請 把 這 裡 的 10 多 個 字 合 併. Can you help me?'; str = str.replace("// /", ""); 

I have studied a similar question for Python but it seems not to work in my situation so I brought my question here for some help.

 


Getting to the Chinese char matching pattern

Using the Unicode Tools, the /p{Han} Unicode property class that matches any Chinese char can be translated into

[/u2E80-/u2E99/u2E9B-/u2EF3/u2F00-/u2FD5/u3005/u3007/u3021-/u3029/u3038-/u303B/u3400-/u4DB5/u4E00-/u9FEF/uF900-/uFA6D/uFA70-/uFAD9/U00020000-/U0002A6D6/U0002A700-/U0002B734/U0002B740-/U0002B81D/U0002B820-/U0002CEA1/U0002CEB0-/U0002EBE0/U0002F800-/U0002FA1D] 

In ES6, to match a single Chinese char, it can be used as

/[/u2E80-/u2E99/u2E9B-/u2EF3/u2F00-/u2FD5/u3005/u3007/u3021-/u3029/u3038-/u303B/u3400-/u4DB5/u4E00-/u9FEF/uF900-/uFA6D/uFA70-/uFAD9/u{20000}-/u{2A6D6}/u{2A700}-/u{2B734}/u{2B740}-/u{2B81D}/u{2B820}-/u{2CEA1}/u{2CEB0}-/u{2EBE0}/u{2F800}-/u{2FA1D}]/u 

Transpiling it to ES5 using ES2015 Unicode regular expression transpiler, we get

(?:[/u2E80-/u2E99/u2E9B-/u2EF3/u2F00-/u2FD5/u3005/u3007/u3021-/u3029/u3038-/u303B/u3400-/u4DB5/u4E00-/u9FEF/uF900-/uFA6D/uFA70-/uFAD9]|[/uD840-/uD868/uD86A-/uD86C/uD86F-/uD872/uD874-/uD879][/uDC00-/uDFFF]|/uD869[/uDC00-/uDED6/uDF00-/uDFFF]|/uD86D[/uDC00-/uDF34/uDF40-/uDFFF]|/uD86E[/uDC00-/uDC1D/uDC20-/uDFFF]|/uD873[/uDC00-/uDEA1/uDEB0-/uDFFF]|/uD87A[/uDC00-/uDFE0]|/uD87E[/uDC00-/uDE1D]) 

pattern to match any Chinese char using JS RegExp.

So, you may use

s.replace(/([/u2E80-/u2E99/u2E9B-/u2EF3/u2F00-/u2FD5/u3005/u3007/u3021-/u3029/u3038-/u303B/u3400-/u4DB5/u4E00-/u9FEF/uF900-/uFA6D/uFA70-/uFAD9]|[/uD840-/uD868/uD86A-/uD86C/uD86F-/uD872/uD874-/uD879][/uDC00-/uDFFF]|/uD869[/uDC00-/uDED6/uDF00-/uDFFF]|/uD86D[/uDC00-/uDF34/uDF40-/uDFFF]|/uD86E[/uDC00-/uDC1D/uDC20-/uDFFF]|/uD873[/uDC00-/uDEA1/uDEB0-/uDFFF]|/uD87A[/uDC00-/uDFE0]|/uD87E[/uDC00-/uDE1D])/s+(?=(?:[/u2E80-/u2E99/u2E9B-/u2EF3/u2F00-/u2FD5/u3005/u3007/u3021-/u3029/u3038-/u303B/u3400-/u4DB5/u4E00-/u9FEF/uF900-/uFA6D/uFA70-/uFAD9]|[/uD840-/uD868/uD86A-/uD86C/uD86F-/uD872/uD874-/uD879][/uDC00-/uDFFF]|/uD869[/uDC00-/uDED6/uDF00-/uDFFF]|/uD86D[/uDC00-/uDF34/uDF40-/uDFFF]|/uD86E[/uDC00-/uDC1D/uDC20-/uDFFF]|/uD873[/uDC00-/uDEA1/uDEB0-/uDFFF]|/uD87A[/uDC00-/uDFE0]|/uD87E[/uDC00-/uDE1D]))/g, '$1') 

See the regex demo.

If your JS environment is ECMAScript 2018 compliant you may use a shorter

s.replace(/(/p{Script=Hani})/s+(?=/p{Script=Hani})/gu, '$1') 

Pattern details

  • (CHINESE_CHAR_PATTERN) - Capturing group 1 ($1 in the replacement pattern): any Chinese char
  • /s+ - any 1+ whitespaces (any Unicode whitespace)
  • (?=CHINESE_CHAR_PATTERN) - there must be a Chinese char immediately to the right of the current location.

JS demo:

var s = "請 把 這 裡 的 10 多 個 字 合 併. Can you help me?"; var HanChr = "[//u2E80-//u2E99//u2E9B-//u2EF3//u2F00-//u2FD5//u3005//u3007//u3021-//u3029//u3038-//u303B//u3400-//u4DB5//u4E00-//u9FEF//uF900-//uFA6D//uFA70-//uFAD9]|[//uD840-//uD868//uD86A-//uD86C//uD86F-//uD872//uD874-//uD879][//uDC00-//uDFFF]|//uD869[//uDC00-//uDED6//uDF00-//uDFFF]|//uD86D[//uDC00-//uDF34//uDF40-//uDFFF]|//uD86E[//uDC00-//uDC1D//uDC20-//uDFFF]|//uD873[//uDC00-//uDEA1//uDEB0-//uDFFF]|//uD87A[//uDC00-//uDFE0]|//uD87E[//uDC00-//uDE1D]";  console.log(s.replace(new RegExp('(' + HanChr + ')//s+(?=(?:' + HanChr + '))', 'g'), '$1'));

A test for the regex compliant with the ECMAScript 2018 standard:

var s = "請 把 這 裡 的 10 多 個 字 合 併. Can you help me?"; console.log(s.replace(/(/p{Script=Hani})/s+(?=/p{Script=Hani})/gu, '$1'));

Comment

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: