views:

268

answers:

2

Hi Im trying to develop a C# program to scrape the urls of flash movies on a website, this is the code im trying to parse

flashvars="file=http://cache01-videos02.myspacecdn.com/24/vid_878ccd5444874681845df39eb3f00628.flv"/>

the closest I got using regex was this expression

file=http://[^/]+/(.*)flv

However it outputs with the file= portion, How do I filter out the file= part?

A: 

Change the Regex to the following and use the Groups property

public void ScrapeURLs(String input) {
  Regex regex = new Regex("file=(http://[^/]+/.*flv)");

  foreach(Match m in regex.Matches(input)) {
     //The URL should now be in the Groups property
     //Note that Groups is a zero based index but Groups[0] will give the complete match
     String url = m.Groups[1].Value;

     //Do something with the URL...
  }
}

Basically the Regular Expression syntax in .Net uses brackets () for grouping, each bracketed expression in the pattern will be accessible through the Groups property. Groups are numbered from left to right from zero BUT the entire match is always considered as a Group and will always have index 0 in the Groups collection

Edit

One thing to note with this pattern is that if the input contains multiple flash URLs then the greedy nature of Regular Expressions will cause you to get a weird match which incorporates all the text from the start of the first URL to the end of the last URL.

RobV
+1  A: 

I think you need this:

var url=@"flashvars=""file=http://cache01-videos02.myspacecdn.com/24/vid_878ccd5444874681845df39eb3f00628.flv""";
        var match = Regex.Match(url, @"file=(?<flashurl>http://[^/]+/(.*)flv)");
        var scrapedurl = match.Groups["flashurl"].Value;

The (?/<flashurl>...) part will extract the part between the parentheses and give it the name "flashurl";

Dabblernl
Yes that is the code that you so much dude!