tags:

views:

60

answers:

3

Hi,

Given the following html:

<div id="f52_lblQuestionWording" title="" style="width:auto;height:auto; display: inline;  overflow: hidden;" >Home telephone</div>

I want to automatically get the ID of the container div element using the "Home telephone" string, does anyone know how I can do this via a regular expression?

The string to find the ID isn't always the same and the html is dynamically generated, so it may be slightly different from time to time. I'm working on automating UI testing on a company project using Selenium.

Thanks.

A: 

I'm not sure what you mean by using the the "Home telephone" string but here are a couple of ways to do this:

/id=(.*?)\s+.*(?=Home telephone)/

where (?=) construct is positive lookahead if you programming language supports it.

ANother way is to simply grep for Home telephone and then grab the id value using awk or sed

ennuikiller
A: 

XPath is the easiest way to retrieve values from XML and HTML documents (provided that they are well-formed).

The expression you want is this:

//div[text() = 'Home telephone']/@id

Which reads, "Find all divs whose text value is equal to 'Home telephone', and return the id attribute for everything that matches."

Depending on your language, there are typically several built-in or third-party (and free) XPath interpreters that are available.

It's a bad idea to parse HTML using regular expressions because HTML isn't a regular language. Regular expressions can't deal with even the simplest of HTML edge cases because regular expressions can't properly deal with nesting. HTML is an inherently nested structure.

Welbog
Thanks for the response. I am using java script to write an extension for use within Selenium and this seems to be the best way to do what I am looking for.
A: 

In C#, you'd set up a regex that looked like this:

string elementText = "Home\\stelephone"; // you can change this as needed
Regex regex = new Regex(
  "id=\"(.*?)\"\\s+.*(?="+ elementText +")",
RegexOptions.IgnoreCase
| RegexOptions.CultureInvariant
| RegexOptions.IgnorePatternWhitespace
| RegexOptions.Compiled
);

// Capture all Matches in the InputText
MatchCollection ms = regex.Matches(InputText);

InputText would be your html file opened for reading.

ddc0660