Powershell RegEx - capturing “too much” (not honoring non-Greedy indicators?)
Clash Royale CLAN TAG#URR8PPP
Powershell RegEx - capturing “too much” (not honoring non-Greedy indicators?)
The code below is returning:
partner=<Partner>
more stuff <Name>Test</Name>
other things </Partner> <Partner>
more stuff <Name>CompanyX</Name>
other things </Partner>
but I want it to return:
partner=<Partner>
more stuff <Name>CompanyX</Name>
other things </Partner>
Sample Code:
$partyName = "CompanyX"
#$bindings = [IO.File]::ReadAllText($inputFileName)
$bindings = "starting stuff <Partner>`r`n more stuff <Name>Test</Name>`n other things </Partner> <Partner>`r`n more stuff <Name>CompanyX</Name>`n other things </Partner> ending stuff"
$found = $bindings -match "(?s)(<Partner>.*?<Name>$partyName</Name>.*?</Partner>)"
if ($found)
Write-Host "matched"
$partner = $matches[1]
Write-Host "partner=$partner "
In short: Don't parse XML yourself with regex... Use an xml parser.
– TheIncorrigible1
Aug 10 at 21:56
Deleting my answer because it was far too fragile. I'm relatively certain someone that's very familiar with
balanced constructs
can give you a reasonable regex...But I suspect even then a manual parsing solution is going to be easier for most people to read.– zzxyz
Aug 10 at 22:19
balanced constructs
The basic issue, to summarize, is that as soon as the regex engine sees its first
<Partner>
, it starts working to make THAT match. With THAT match, it is honoring the lazy indicator as much as possible. It's basically working left to right, in other words.– zzxyz
Aug 10 at 22:42
<Partner>
2 Answers
2
As TheIncorrigible1 says: Use an xml parser instead of Regex.
However.. Since the reason for doing it with regex for you might simply be te see IF and HOW it can be done using Regular Expression you can use:
$found = $bindings -match "(?sx)(<Partner>(?:((?!</Partner>).)+<Name>$([Regex]::Escape($partyName))</Name>)(?:((?!</Partner>).))*</Partner>)"
The non-greedy duplication symbols (.*?
) are being honored, but they're not enough in this case:
.*?
<Partner>.*?<Name>$partyName</Name>
matches between <Partner>
and the next instance of the <Name>
element, but that doesn't guarantee that there won't be another <Partner>
tag in between.
In other words: Your regex will invariably match between the first <Partner>
tag and the <Name>
element of interest.
<Partner>.*?<Name>$partyName</Name>
<Partner>
<Name>
<Partner>
<Partner>
<Name>
To prevent that, you need a negative look-ahead assertion ((?!...)
) that rules out intervening <Partner>
tags:
(?!...)
<Partner>
# Sample input, defined as a here-string.
$bindings = @'
starting stuff <Partner>
more stuff <Name>Test</Name>
other things </Partner> <Partner>
stuff of interest before <Name>CompanyX</Name>
stuff of interest after </Partner> even more </Partner> ending stuff
'@
# Escape the name to ensure it is treated as a literal inside the regex.
# Note: Not strictly necessary for sample value 'CompanyX'
$partyName = [regex]::Escape('CompanyX')
# Use a negative look-ahead assertion - (?!...) - to rule out intervening
# <Partner> tags before the <Name> element of interest.
if ($bindings -match "(?s)<Partner>((?!<Partner>).)*<Name>$partyName</Name>.*?</Partner>")
# Output the match.
$matches[0]
else
Write-Warning 'No match.'
The above yields:
<Partner>
stuff of interest before <Name>CompanyX</Name>
stuff of interest after </Partner>
(?!<Partner>).
matches a single character (.
) not preceded by string <Partner>
.
(?!<Partner>).
.
<Partner>
This subexpression must itself be matched against each character (if any) between the opening <Partner>
and the <Name>
element of interest, hence it is wrapped in (...)*
<Partner>
<Name>
(...)*
I presume this makes for an inefficient matching algorithm, but it does work.
As mentioned, using proper XML parsing with an XPath query is worth considering as an alternative.
You could make this matching more efficient by using (?:...)*
as the wrapper, which tells the regex engine not to capture (the latest) match of the subexpression. ((...)
are capture groups, meaning that what the subexpression matches is reported as part of what automatic variable $Matches
returns, which is not needed here, so ?:
suppresses that).
(?:...)*
(...)
$Matches
?:
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
Possible duplicate of RegEx match open tags except XHTML self-contained tags
– TheIncorrigible1
Aug 10 at 21:55