I needed to use a nested group in a regex string. I didn’t understand how it works at first. Therefore, I created this post.
How grouping works
Let’s recap how grouping works. If we want to extract keys or values from a formatted string with regex, we need to use grouping.
Assume that we have the following two formats. What we need is the dot chaining string, key, and value.
- “xxx.yyy.zzz key = value;”
- “xxx.yyy key = value;”
To extract the data, we can implement regex in the following way.
def run_regex(text):
regex_str = r"([0-9a-zA-Z_\.]+) ([0-9a-zA-Z_]+) * = *(\d+);"
result = re.search(regex_str, text)
if result:
print(result)
print(result.group(0))
print(result.group(1))
print(result.group(2))
print(result.group(3))
Let’s check the two results.
run_regex("my.data.type data_1 = 1;")
# <re.Match object; span=(0, 24), match='my.data.type data_1 = 1;'>
# my.data.type data_1 = 1;
# my.data.type
# data_1
# 1
run_regex("product.count count = 26;")
# <re.Match object; span=(0, 25), match='product.count count = 26;'>
# product.count count = 26;
# product.count
# count
# 26
The first index contains the whole text that matches the regex string. Then, the first index is corresponding to the first parenthesis. The second index and third index are also corresponding to the second and third parenthesis.
By using this grouping, we can easily extract the desired value from a text.
Nested grouping
We recapped the basics. Then, how does it work if the grouping is nested? Let’s assume that we have the following two formats.
format1_text = "product_name: super-pc, price: 100"
format2_text = "name<super-pc>, price<100>"
Both of them have product name and price but the format is different. So, we need two regex strings but want to process with only one regex string.
The first regex string is as follows.
product_name: (.+), price: (\d+)
Likewise, the second regex string is as follows.
name<(.+)>, price<(\d+)>
If we want to use both regex strings in a single regex string, it will be the following.
regex_str = r"(product_name: (.+), price: (\d+))|(name<(.+)>, price<(\d+)>)"
Grouping the two regex strings by parenthesis and use a vertical line. (A|B)
means that it matches A or B. Since A and B is a not simple regex string, it needs to be grouped.
Let’s try to process it.
def run_nested_regex(text):
regex_str = r"(product_name: (.+), price: (\d+))|(name<(.+)>, price<(\d+)>)"
result = re.search(regex_str, text)
if result is not None:
for index, group in enumerate(result.groups()):
print(f"{index}: {group}")
format1_text = "product_name: super-pc, price: 100"
format2_text = "name<super-pc>, price<100>"
run_nested_regex(format1_text)
# 0: product_name: super-pc, price: 100
# 1: super-pc
# 2: 100
# 3: None
# 4: None
# 5: None
run_nested_regex(format2_text)
# 0: None
# 1: None
# 2: None
# 3: name<super-pc>, price<100>
# 4: super-pc
# 5: 100
According to this result, we take index 1 and 2 for the first format and 4 and 5 for the second format.
Really? Look at the next example.
def run_nested_regex2(text):
regex_str = r"(product_name: (.+), price: (\d+))|(name<(.+)>, price<(\d+)>)"
result = re.search(regex_str, text)
if result is not None:
print(result.lastindex)
print(f"0: {result.group(0)}")
print(f"1: {result.group(1)}")
print(f"2: {result.group(2)}")
print(f"3: {result.group(3)}")
print(f"4: {result.group(4)}")
print(f"5: {result.group(5)}")
print(f"6: {result.group(6)}")
print("--- run_nested_regex2")
run_nested_regex2(format1_text)
# 1
# 0: product_name: super-pc, price: 100
# 1: product_name: super-pc, price: 100
# 2: super-pc
# 3: 100
# 4: None
# 5: None
# 6: None
run_nested_regex2(format2_text)
# 4
# 0: name<super-pc>, price<100>
# 1: None
# 2: None
# 3: None
# 4: name<super-pc>, price<100>
# 5: super-pc
# 6: 100
What!? The index is different from the previous one. The first index 0 is always the whole text that matches the regex string.
Index 1 ~ 3 is the result for the first format. The first index (1) for the first format is also the whole text that matches the regex for the first format.
Then, index 1 is also the whole text that matches the first format.
Index 4 ~ 6 is the result for the second format. Likewise, the first index (4) is the whole text. Therefore, product name and price are stored in index 5 and 6 respectively.
Comments