Regexp to match every occurence after n occurences

  • A+
Category:Languages

given some csv data, with unescaped commas in the final field like this

 1, 2, 3, 4, 5 a, b, c, d, foo bar a, b, c, d, Lorem Ipsum, dolores umbridge, something latin a, b, c, d, upcoming unescaped commas!, one, two, three, oh no! 

I want a regexp to match all the commas after the 4th comma on each line so I can replace them with an escaped comma /,

This is my terrible attempt so far which seems to return only the last occurence after the first n occurences.

^([^,]*,){4}([^,]*(,)[^,]*)*

For some context

Some formats which claim to be partially compatible with csv formats like ASS assume that it's ok to have unescaped commas in the last field because the number of fields was registered when the header line was parsed.

You can see this in the ASS specification

The format line specifies how SSA will interpret all following Event lines. The field names must be spelled correctly, and are as follows: Marked, Start, End, Style, Name, MarginL, MarginR, MarginV, Effect, Text The last field will always be the Text field, so that it can contain commas.

and here

The information fields in each line are separated by a commas. This makes it illegal to use commas in character names and style names (SSA prevents you putting commas in these). It also makes it quite easy to load chunks of an SSA script into a spreadsheet as a CSV file, and chop out columns of information you need for another subtitling program.

To be able to parse files like this, assuming you've already separated the data into "Chunks", I need to also escape all the commas in the last field to work with certain csv-parsers.


Two solutions:

  • If you're doing this in an environment that supports the new lookbehind, and you have an array of strings
  • If you aren't, or you have one big string

If you can use lookbehind and have an array of strings

If you're doing this in an environment like Node.js that supports lookbehind (which will be in the ES2018 specification), you can do it like this:

const newData = data.map(line => line.replace(/(?<=(?:.*,){4,}.*),/g, "//,")); 

(I can only get this to work if you have an array of lines (which is what I thought you had). See the non-lookbehind version below if you have one big string.)

That's a positive lookbehind for at least four occurrences of .*, followed by .*. It matches every comma with that in front of it.

Example (if you have an array of lines):

const data = [   "1, 2, 3, 4, 5",   "a, b, c, d, foo bar",   "a, b, c, d, Lorem Ipsum, dolores umbridge, something latin",   "a, b, c, d, upcoming unescaped commas!, one, two, three, oh no!", ]; const newData = data.map(line => line.replace(/(?<=(?:[^,]*,){4,}.*),/g, "//,")); console.log(newData);

If you can't use lookbehind or have one big string

If you can't use lookbehind, you could capture the text before the relevant commas and use replace on the text after, with the function callback version of replace:

const newData = data.map(line =>     line.replace(/^((?:[^,]*,){4})(.*)$/, (m, c0, c1) => c0 + c1.replace(/,/g, "//,")) ); 

Example (if data is an array):

const data = [   "1, 2, 3, 4, 5",   "a, b, c, d, foo bar",   "a, b, c, d, Lorem Ipsum, dolores umbridge, something latin",   "a, b, c, d, upcoming unescaped commas!, one, two, three, oh no!", ]; const newData = data.map(line => line.replace(/^((?:[^,]*,){4})(.*)$/, (m, c0, c1) => c0 + c1.replace(/,/g, "//,"))); console.log(newData);

Or if data is one big string:

const newData = data.replace(/^((?:[^,]*,){4})(.*)$/gm, (m, c0, c1) => c0 + c1.replace(/,/g, "//,")); 

Example (if data is one big string):

const data = `1, 2, 3, 4, 5 a, b, c, d, foo bar a, b, c, d, Lorem Ipsum, dolores umbridge, something latin a, b, c, d, upcoming unescaped commas!, one, two, three, oh no!`; const newData = data.replace(/^((?:[^,]*,){4})(.*)$/gm, (m, c0, c1) => c0 + c1.replace(/,/g, "//,")); console.log(newData);

Comment

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: