views:

86

answers:

4

hello

as i am not very familiar with regex, is it possible (whether its hard to do or not) to extract certain text inbetween symbols? for example:

<meta name="description" content="THIS IS THE TEXT I WANT TO EXTRACT" />

thank you :)

+2  A: 

Sure, you can identify the start and the end of your desired substring by string methods such as IndexOf, then get the desired Substring! In your example, you want to locate (with IndexOf) the "contents=" and then the first following ", right? And once you have those indices into the string, Substring will work fine. (Not posting C# code because I'm not entirely sure of what exactly it IS that you want, beyond IndexOf and Substring...!-)

If so, then:

int first = str.IndexOf("contents=\"");
int last = str.IndexOf("\"", first + 10);
return str.Substring(first + 10, last - first - 10);

should more or less do what you want (apologies in again if there's an off-by-one or so in those hardcoded 10s -- they're meant to stand for the length of the first substring you're looking for; adjust them a little bit up or down until you get exactly the result you want!-), but this is the general concept. Locate the start with single-argument IndexOf, locate the end with two-args IndexOf, slice off the desired piece with Substring...!

Alex Martelli
thats right, what i'm after is the text inbetween both quotes like inside the content tag like this: content="i need this text"
baeltazor
thanks for the code Alex, but its nowhere near close, it always extracts the first 15 or so chars of the beginning of the file.. weird???
baeltazor
What do you see when you add output statements to show the value of first and last?
Alex Martelli
A: 

Sure you can do it with out Regex. Say you want to get the text between < and >...

string GetTextBetween(string content)
{
  int start = content.IndexOf("<");
  if(start == -1) return null; // Not found.
  int end = content.IndexOf(">");
  if(end == -1) return null;  // end not found
  return content.SubString(start, end - start);
}
RichAmberale
+1  A: 

if the input is : text1/text2/text3

The below regex will give the 2 in the group i.e, TEXT3

^([^/]*/){2}([^/]*)/$


if you need the last text always, then use the below

^.*/([^/]*)/$
solairaja
I think OP is looking for a non-regex solution.
Goose Bumper
+4  A: 

Since you give an xml example, just use an xml parser:

string s = (string) XElement.Parse(xml).Attribute("content");

xml is not a simple text format, and Regex isn't really a very good fit; using an appropriate tool will protect you from a range of evils... for example, the following is identical as xml:

<meta
    name="description"
    content=
        'THIS IS THE TEXT I WANT TO EXTRACT'
/>

It also means that when the requirement changes, you have a simple tweak to make to the code, rather than trying to unpick a regex and put it back together again (which can be tricky if you are access a non-trivial node). Equally, xpath might be an option; so in your data the xpath:

/meta/@content

is all you need.

If you haven't got .NET 3.5:

XmlDocument doc = new XmlDocument();
doc.LoadXml(xml);
string s = doc.DocumentElement.GetAttribute("content");
Marc Gravell
This is really nice. Thanks for that one! =)
Carl Bergquist