Splitting a string based on a pattern in Python

  • A+
Category:Uncategorized

I have long strings such as

"123 - Footwear, 5678 - Apparel, Accessories & Luxury Goods, 9876 - Leisure Products"

and

"321 - Apparel & Accessories, 4321 - Apparel & Accessories > Handbags, Wallets & Cases, 187 - Apparel & Accessories > Shoes"

I want to split them based on the pattern "a number, a space, a dash, a space, some string until the next number, a space, a dash, a space or end of string". Notice that the string may contain commas, ampersands, '>' and other special characters, so splitting on them will not work. I think there is a way in Python to split based on regular expressions but I have trouble forming that.

I have a very introductory knowledge of regular expressions. I can form a regex for numbers, as well as for alphanumeric strings, but I don't know how to specify "take everything until the next number starts".


Update: Expected output:

first case:

["123 - Footwear", "5678 - Apparel, Accessories & Luxury Goods", "9876 - Leisure Products"]

second case:

["321 - Apparel & Accessories", "4321 - Apparel & Accessories > Handbags, Wallets & Cases", "187 - Apparel & Accessories > Shoes"]

 


You may match substrings starting with one or more digits followed with 1+ whitespaces, -, 1+ whitespaces and ending with the same pattern or end of string.

re.findall(r"/d+/s+-/s+.*?(?=/s*(?:,/s*)?/d+/s+-/s|/Z)", s, re.S) 

See the regex demo

Note: If the leading number length is more than one, say, it is at least a 2-digit number, you may replace the /d+ with /d{2,}, etc. Adjust as you see fit.

Regex demo

  • /d+ - 1+ digits
  • /s+-/s+ - a - enclosed with 1+ whitespaces
  • .*? - any 0+ chars, as few as possible, up to the location in string that is followed with...
  • (?=/s*(?:,/s*)?/d+/s+-/s|/Z) - (a positive lookahead):
    • /s*(?:,/s*)?/d+/s+-/s - 0+ whitespaces, an optional substringof a comma and 0+ whitespaces after it, 1+ digits, 1+ whitespaces, - and a whitespace
    • | - or
    • /Z - end of string

Python demo:

import re  rx = r"/d+/s+-/s+.*?(?=/s*(?:,/s*)?/d+/s+-/s|/Z)" texts = ["123 - Footwear, 5678 - Apparel, Accessories & Luxury Goods, 9876 - Leisure Products", "321 - Apparel & Accessories, 4321 - Apparel & Accessories > Handbags, Wallets & Cases, 187 - Apparel & Accessories > Shoes"] for s in texts:     print("--- {} ----".format(s))     print(re.findall(rx, s, re.S)) 

Output:

--- 123 - Footwear, 5678 - Apparel, Accessories & Luxury Goods, 9876 - Leisure Products --- ['123 - Footwear', '5678 - Apparel, Accessories & Luxury Goods', '9876 - Leisure Products'] --- 321 - Apparel & Accessories, 4321 - Apparel & Accessories > Handbags, Wallets & Cases, 187 - Apparel & Accessories > Shoes --- ['321 - Apparel & Accessories', '4321 - Apparel & Accessories > Handbags, Wallets & Cases', '187 - Apparel & Accessories > Shoes'] 

Comment

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: