In a recent project, I’ve begun to heavily use regular expressions to capture the data from a looong string. This post will demonstrate how I used ExplicitCapture to make it really easy to use the captured data later on using the Match object.
The data is in the following format:
[Data I do not care about...]<textarea1><![CDATA[blahhhhhhh]]>
</textarea1> <textarea2 xmlns=""><![CDATA[bada bing bada boom]]>
</textarea2>[More data I do not care about]
I need to capture all the data inside those CDATA areas, and also the number of the textarea. Even though the content of the CDATA may be HTML and therefore not a regular language, I am still using a regular expression here as the content outside of the textareas is regular. If you need to extract data from HTML, I’d recommend using HTML Agility Pack.
Regex pattern = new Regex("<textarea(?<textareaid>\\d+)( xmlns=\"\"|)><!\\[CDATA\\[(?<content>.*?)\\]\\]></textarea\\d+>", RegexOptions.IgnoreCase | RegexOptions.Singleline | RegexOptions.ExplicitCapture);
Here I’m storing my regular expression in to a variable called pattern. The expression is a little more complex than expected due to me having to escape my backslashes when I do actually want them represented in the regex pattern. The interesting parts are the RegexOptions in the second parameter of the Regex constructor.
Of course I’m using ignore case as there may be instances where the case is different.
I’m using single line as I don’t want to take in to account line breaks, nor am I using ^ or $ to indicate the start or end of lines respectively. This option treats the string as one long unbroken line with line break characters and so enables a kind of dotall functionality.
The final option I’m using is explicit capture. This allows me to specify which capture groups I’m actually interested in and giving them a name. In the above example, you can see I capture the textarea id number by writing (?<textareaid>\\d+). This matches one or more digits and calls it “textareaid”. Awesome.
Now to use that later on I use the following code:
MatchCollection matches = pattern.Matches(content);
string convertedContent = String.Empty;
foreach (Match match in matches)
{
convertedContent += string.Format("<div id=\"textarea{0}\">{1}</div>", match.Groups["textareaid"].Value, match.Groups["content"].Value);
}
The string convertedContent will now contain as many divs as there are textareas and will contain the content held within the CDATA. Using match.Groups["whatever"] is a really easy way to get at the match values.