views:

912

answers:

4

I am trying to work out the overhead of the ASP.NET auto-naming of server controls. I have a page which contains 7,000 lines of HTML rendered from hundreds of nested ASP.NET controls, many of which have id / name attributes that are hundreds of characters in length.

What I would ideally like is something that would extract every HTML attribute value that begins with "ctl00" into a list. The regex Find function in Notepad++ would be perfect, if only I knew what the regex should be?

As an example, if the HTML is:
<input name="ctl00$Header$Search$Keywords" type="text" maxlength="50" class="search" />

I would like the output to be something like:
name="ctl00$Header$Search$Keywords"
A more advanced search might include the element name as well (e.g. control type):
input|name="ctl00$Header$Search$Keywords"

In order to cope with both Id and Name attributes I will simply rerun the search looking for Id instead of Name (i.e. I don't need something that will search for both at the same time).

The final output will be an excel report that lists the number of server controls on the page, and the length of the name of each, possibly sorted by control type.

+1  A: 

Quick and dirty:

Search for

\w+\s*=\s*"ctl00[^"]*"

This will match any text that looks like an attribute, e.g. name="ctl00test" or attr = "ctl00longer text". It will not check whether this really occurs within an HTML tag - that's a little more difficult to do and perhaps unnecessary? It will also not check for escaped quotes within the tag's name. As usual with regexes, the complexity required depends on what exactly you want to match and what your input looks like...

Tim Pietzcker
A: 

"7000"? "Hundreds"? Dear god.

Since you're just looking at source in a text editor, try this... /(id|name)="ct[^"]*"/

annakata
My thoughts precisely. :-S
Hugo Rodger-Brown
Ditto. aaaaaaaaaaaaaand space for the validation.
Robert C. Barth
downvote?! *sigh*
annakata
I'll help ya out... upvote.
Robert C. Barth
you sir are my hero :)
annakata
A: 

I suggest xpath, as in this question

Anonymous
XPath? On an HTML page? Since he stated it has 7000 lines and hundreds of controls, what do you think the odds are that the page is XHTML compliant? About zero?
Robert C. Barth
you can use xpath on html too, you can set the parser not to make strict validation of the document
Anonymous
A: 

Answering my own question, the easiest way to do this is to use BeautifulSoup, the 'dirty HTML' Python parser whose tagline is:

"You didn't write that awful page. You're just trying to get some data out of it. Right now, you don't really care what HTML is supposed to look like. Neither does this parser."

It works, and it's available from here - http://crummy.com/software/BeautifulSoup

Hugo Rodger-Brown