Xml parsers wouldn't understand the ASP directives: <%@ <%= etc.
You'll probably best to use regular expressions to do this, likely in 3 stages.
- Match any tag elements from the entire page.
- For Each tag, match the tag and control type.
- For Each tag that matches (2), match any attributes.
So, starting from the top, we can use the following regex:
(?<tag><[^%/](?:.*?)>)
This will match any tags that don't have <% and < / and does so lazily (we don't want greedy expressions, as we won't read the content correctly). The following could be matched:
<asp:Content ID="ph_PageContent" ContentPlaceHolderID="ph_MainContent" runat="server">
<asp:Image runat="server" />
<img src="/test.png" />
For each of those captured tags, we want to then extract the tag and type:
<(?<tag>[a-z][a-z1-9]*):(?<type>[a-z][a-z1-9]*)
Creating named capture groups makes this easier, this will allow us to easily extract the tag and type. This will only match server tags, so standard html tags will be dropped at this point.
<asp:Content ID="ph_PageContent" ContentPlaceHolderID="ph_MainContent" runat="server">
Will yield:
{ tag = "asp", type = "Content" }
With that same tag, we can then match any attributes:
(?<name>\S+)=["']?(?<value>(?:.(?!["']?\s+(?:\S+)=|[>"']))+.)["']?
Which yields:
{ name = "ID", value = "ph_PageContent" },
{ name = "ContentPlaceHolderID", value = "ph_MainContent" },
{ name = "runat", value = "server" }
So putting that all together, we can create a quick function that can create an XmlDocument for us:
public XmlDocument CreateDocumentFromMarkup(string content)
{
if (string.IsNullOrEmpty(content))
throw new ArgumentException("'content' must have a value.", "content");
RegexOptions options = RegexOptions.CultureInvariant | RegexOptions.Compiled | RegexOptions.IgnoreCase;
Regex tagExpr = new Regex("(?<tag><[^%/](?:.*?)>)", options);
Regex serverTagExpr = new Regex("<(?<tag>[a-z][a-z1-9]*):(?<type>[a-z][a-z1-9]*)", options);
Regex attributeExpr = new Regex("(?<name>\\S+)=[\"']?(?<value>(?:.(?![\"']?\\s+(?:\\S+)=|[>\"']))+.)[\"']?", options);
XmlDocument document = new XmlDocument();
XmlElement root = document.CreateElement("controls");
Func<XmlDocument, string, string, XmlElement> creator = (document, name, value) => {
XmlElement element = document.CreateElement(name);
element.InnerText = value;
return element;
};
foreach (Match tagMatch in tagExpr.Matches(content)) {
Match serverTagMatch = serverTagExpr.Match(tagMatch.Value);
if (serverTagMatch.Success) {
XmlElement controlElement = document.CreateElement("control");
controlElement.AppendChild(
creator(document, "tag", serverTagMatch.Groups["tag"].Value));
controlElement.AppendChild(
creator(document, "type", serverTagMatch.Groups["type"].Value));
XmlElement attributeElement = document.CreateElement("attributes");
foreach (Match attributeMatch in attributeExpr.Matches(tagMatch.Value)) {
if (attributeMatch.Success) {
attributeElement.AppendChild(
creator(document, attributeMatch.Groups["name"].Value, attributeMatch.Groups["value"].Value));
}
}
controlElement.AppendChild(attributeElement);
root.AppendChild(controlElement);
}
}
return document;
}
The resultant document could look like this:
<controls>
<control>
<tag>asp</tag>
<type>Content</type>
<attributes>
<ID>ph_PageContent</ID>
<ContentPlaceHolderID>ph_MainContent</ContentPlaceHolderID>
<runat>server</runat>
</attributes>
</control>
</controls>
Hope that helps!