Advanced Python Regex: how to evaluate and extract nested lists and numbers from a multiline string?

  • A+
Category:Languages

I was trying to separate the elements from a multiline string:

lines = '''c0 c1 c2 c3 c4 c5 0   10 100.5 [1.5, 2]     [[10, 10.4], [c, 10, eee]]  [[a , bg], [5.5, ddd, edd]] 100.5 1   20 200.5 [2.5, 2]     [[20, 20.4], [d, 20, eee]]  [[a , bg], [7.5, udd, edd]] 200.5''' 

My aim is to get a list lst such that:

# first value is index lst[0] = ['c0', 'c1', 'c2', 'c3', 'c4','c5'] lst[1] = [0, 10, 100.5, [1.5, 2], [[10, 10.4], ['c', 10, 'eee']], [['a' , 'bg'], [5.5, 'ddd', 'edd']], 100.5 ] lst[2] = [1, 20, 200.5, [2.5, 2], [[20, 20.4], ['d', 20, 'eee']], [['a' , 'bg'], [7.5, 'udd', 'edd']], 200.5 ] 

My attempt so far is this:

import re  lines = '''c0 c1 c2 c3 c4 c5 0   10 100.5 [1.5, 2]     [[10, 10.4], [c, 10, eee]]  [[a , bg], [5.5, ddd, edd]] 100.5 1   20 200.5 [2.5, 2]     [[20, 20.4], [d, 20, eee]]  [[a , bg], [7.5, udd, edd]] 200.5'''   # get n elements for n lines and remove empty lines lines = lines.split('/n') lines = list(filter(None,lines))      lst = [] lst.append(lines[0].split())   for i in range(1,len(lines)):    change = re.sub('([a-zA-Z]+)', r"'/1'", lines[i])   lst.append(change)  for i in lst[1]:   print(i) 

How to fix the regex?

Update
Test datasets

data = """     orig  shifted  not_equal  cumsum  lst 0     10      NaN       True       1  [[10, 10.4], [c, 10, eee]]  1     10     10.0      False       1  [[10, 10.4], [c, 10, eee]]  2     23     10.0       True       2  [[10, 10.4], [c, 10, eee]]  """  # Gives: ValueError: malformed node or string:  data = """     Name Result Value 0   Name1   5   2 1   Name1   5   3 2   Name2   11  1 """ # gives same error   data = """ product  value 0       A     25 1       B     45 2       C     15 3       C     14 4       C     13 5       B     22 """ # gives same error  data = '''     c0 c1 0   10 100.5 1   20 200.5 ''' # works perfect 

 


As noted in the comments, this task is impossible to do with regex. Regex is fundamentally unable to handle nested constructs. What you need is a parser.

One of the ways to create a parser is PEG, which lets you set up a list of tokens and their relations to each other in a declarative language. This parser definition is then turned into an actual parser that can handle the described input. When parsing succeeds, you will get back a tree structure with all the items properly nested.

For demonstration purposes, I've used the JavaScript implementation peg.js, which has an online demo page where you can live-test parsers against some input. This parser definition:

{     // [value, [[delimiter, value], ...]] => [value, value, ...]     const list = values => [values[0]].concat(values[1].map(i => i[1])); } document     = line* line "line"     = value:(item (whitespace item)*) whitespace? eol { return list(value) } item "item"     = number / string / group group "group"     = "[" value:(item (comma item)*) whitespace? "]" { return list(value) } comma "comma"     = whitespace? "," whitespace? number "number"     = value:$[0-9.]+ { return +value } string "string"     = $([^ 0-9/[/]/r/n,] [^ /[/]/r/n,]*) whitespace "whitespace"     = $" "+ eol "eol"     = [/r]? [/n] / eof eof "eof"     = !. 

can understand this kind of input:

 c0 c1 c2 c3 c4 c5 0   10 100.5 [1.5, 2]     [[10, 10.4], [c, 10, eee]]  [[a , bg], [5.5, ddd, edd]] 1   20 200.5 [2.5, 2]     [[20, 20.4], [d, 20, eee]]  [[a , bg], [7.5, udd, edd1]] 

and produces this object tree (JSON notation):

[     ["c0", "c1", "c2", "c3", "c4", "c5"],     [0, 10, 100.5, [1.5, 2], [[10, 10.4], ["c", 10, "eee"]], [["a", "bg"], [5.5, "ddd", "edd"]]],     [1, 20, 200.5, [2.5, 2], [[20, 20.4], ["d", 20, "eee"]], [["a", "bg"], [7.5, "udd", "edd1"]]] ] 

i.e.

  • an array of lines,
  • each of which is an array of values,
  • each of which can be either a number, or a string, or another array of values

This tree structure can then be handled by your program.

The above would work for example with node.js to turn your input into JSON. The following minimal JS program accepts data from STDIN and writes the parsed result to STDOUT:

// reference the parser.js file, e.g. downloaded from https://pegjs.org/online const parser = require('./parser');  var chunks = [];  // handle STDIN events to slurp up all the input into one big string process.stdin.on('data', buffer => chunks.push(buffer.toString())); process.stdin.on('end', function () {     var text = chunks.join('');     var data = parser.parse(text);     var json = JSON.stringify(data, null, 4);     process.stdout.write(json); });  // start reading from STDIN process.stdin.resume(); 

Save it as text2json.js or something like that and redirect (or pipe) some text into it:

# input redirection (this works on Windows, too) node text2json.js < input.txt > output.json  # common alternative, but I'd recommend input redirection over this cat input.txt | node text2json.js > output.json 

There are PEG parser generators for Python as well, for example https://github.com/erikrose/parsimonious. The parser creation language differs between implementations, so the above can only be used for peg.js, but the principle is exactly the same.


EDIT I've dug into Parsimonious and recreated the above solution in Python code. The approach is the same, the parser grammar is the same, with a few tiny syntactical changes.

from parsimonious.grammar import Grammar from parsimonious.nodes import NodeVisitor  grammar = Grammar(     r"""     document   = line*     line       = whitespace? item (whitespace item)* whitespace? eol     item       = group / number / boolean / string     group      = "[" item (comma item)* whitespace? "]"     comma      = whitespace? "," whitespace?     number     = "NaN" / ~"[0-9.]+"     boolean    = "True" / "False"     string     = ~"[^ 0-9/[/]/r/n,][^ /[/]/r/n,]*"     whitespace = ~" +"     eol        = ~"/r?/n" / eof     eof        = ~"$"     """)  class DataExtractor(NodeVisitor):     @staticmethod     def concat_items(first_item, remaining_items):         """ helper to concat the values of delimited items (lines or goups) """         return first_item + list(map(lambda i: i[1][0], remaining_items))      def generic_visit(self, node, processed_children):         """ in general we just want to see the processed children of any node """         return processed_children      def visit_line(self, node, processed_children):         """ line nodes return an array of their processed_children """         _, first_item, remaining_items, _, _ = processed_children         return self.concat_items(first_item, remaining_items)      def visit_group(self, node, processed_children):         """ group nodes return an array of their processed_children """         _, first_item, remaining_items, _, _ = processed_children         return self.concat_items(first_item, remaining_items)      def visit_number(self, node, processed_children):         """ number nodes return floats (nan is a special value of floats) """         return float(node.text)      def visit_boolean(self, node, processed_children):         """ boolean nodes return return True or False """         return node.text == "True"      def visit_string(self, node, processed_children):         """ string nodes just return their own text """         return node.text 

The DataExtractor is responsible for traversing the tree and pulling out data from the nodes, returning lists of strings, numbers, booleans, or NaN.

The concat_items() function performs the same task as the list() function in the Javascript code above, the other functions also have their equivalents in the peg.js approach, except that peg.js integrates them directly into the parser definition and Parsimonious expects definitions in a separate class, so it's a bit wordier in comparison, but not too bad.

Usage, assuming an input file called "data.txt", also mirrors the JS code:

de = DataExtractor()  with open("data.txt", encoding="utf8") as f:     text = f.read()  tree = grammar.parse(text) data = de.visit(tree) print(data) 

Input:

 orig shifted not_equal cumsum lst 0 10 NaN True 1 [[10, 10.4], [c, 10, eee]] 1 10 10.0 False 1 [[10, 10.4], [c, 10, eee]] 2 23 10.0 True 2 [[10, 10.4], [c, 10, eee]] 

Output:

 [     ['orig', 'shifted', 'not_equal', 'cumsum', 'lst'],     [0.0, 10.0, nan, True, 1.0, [[10.0, 10.4], ['c', 10.0, 'eee']]],     [1.0, 10.0, 10.0, False, 1.0, [[10.0, 10.4], ['c', 10.0, 'eee']]],      [2.0, 23.0, 10.0, True, 2.0, [[10.0, 10.4], ['c', 10.0, 'eee']]] ] 

In the long run, I would expect this approach to be more maintainable and flexible than regex hackery. Adding explicit support for NaN and for booleans (which the peg.js-Solution above does not have - there they are parsed as strings) for example was easy.

Comment

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: