tags:

views:

159

answers:

5
+1  A: 

URL-matching with regular expressions is extremely difficult, if not impossible. Unless you have some extra constraints on what your URLs in your documents contain, in which case you can sacrifice flexibility of your regex in exchange for practicality.

Wahnfrieden
As all I want is the last bit of the url (i.e. the filename). The filename can only be jpg etc, there will be an img src="http:// at the beginning.
IainMH
It's a little easier when the URL is contained within an href of an HTML file. Then you can know where the URL starts and stops.
Steve Wortham
Then you should use some HTML scraping library to get at the image tags' src attributes, before using a regex on the URIs. Try to avoid using regexs to parse HTML itself.
Wahnfrieden
+3  A: 

Since I already have this handy, this should grab the URL itself:

(?<=src=")[^"]+(?=")

Verified in Regex Hero, this regular expression uses a positive lookbehind and a positive lookahead to grab the url inside of src="".

I'll see if I can come up with something more specific to your task...

OK, this should work:

(?<=src=")[^"]+(/[^/]+(\.jpg|\.gif))(?=")

And then you can use a replacement value of:

/LocalDirectory/images$1

Or here's the complete C# code:

string strRegex = "(?<=src=\")[^\"]+(/[^/]+(\.jpg|\.gif))(?=\")";
RegexOptions myRegexOptions = RegexOptions.None;
Regex myRegex = new Regex(strRegex, myRegexOptions);
string strTargetString = "<img src=\"http://www.example.com/any/number/of/directories/picture.jpg\" />" & vbCrLf & "<img src=\"http://www.example.com/any/number/of/directories/picture.gif\" />";
string strReplace = "/LocalDirectory/images$1";

return myRegex.Replace(strTargetString, strReplace);
Steve Wortham
Note that ' is valid for wrapping attribute values (instead of ") in HTML 4.01, so this won't work for all pages.
Wahnfrieden
+1  A: 
strTargetString = "img tags to check";
string strRegex = "src=\"(.*)/(.*)\.(jpg|png|gif)\"";
RegexOptions myRegexOptions = RegexOptions.Multiline | RegexOptions.IgnorePatternWhitespace;
Regex myRegex = new Regex(strRegex, myRegexOptions);

string strReplace = "src="\/LocalDirectory\/images\/$2\.$3"";

return myRegex.Replace(strTargetString, strReplace);

Misread the question. This will now replace the first part of the path for jpg, png and gif and keep the filename. anything else is ignored

Xetius
Thanks Xetius - I'm just sorry I can't mark your answer as accepted too!
IainMH
No worries. His looks neater than mine
Xetius
+1  A: 

Hope this helps:

var replace = "/localserver/some/directory/";
var strs = new List<string>
{
    "<img src=\"http://www.example.com/any/number/of/directories/picture.jpg\"",
    "<img src=\"http://www.example.com/any/number/of/directories/picture.gif\"" 
};

Regex r = new Regex("[^<img src=\"].*/");

foreach (var s in strs)
{
    Console.WriteLine("Replaced: {0}",r.Replace(s,replace));
}

outputs:

Replaced: <img src="/localserver/some/directory/picture.jpg"
Replaced: <img src="/localserver/some/directory/picture.gif"
TheVillageIdiot
+2  A: 

Try this out...

            var test1 = "<img src=\"http://www.something.com/any/number/of/pic.jpg\"&gt;";
        var test2 = "<img src=\"http://www.something.com/any/number/of/pic.doc\"&gt;";
        var test3 = "<a href=\"http://www.something.com/any/number/of/pic.jpg\"&gt;";
        var test4 = "<a href=\"http://www.something.com/any/number/of/pic.doc\"&gt;";

        var reg = "<(?:a|img)\\s+(?:src|href)=\"(?<replace>http://www.+?/).+?(?:\\.jpg|\\.jpeg|\\.gif|\\.png)\".*?&gt;";

        var file = test1 + "\r\n" + test2 + "\r\n" + test3 + "\r\n" + test4;

        var results = Regex.Matches(file, reg);

        for (int i = results.Count - 1; i >= 0; i--)
        {
            var match = results[i];
            var group = match.Groups["replace"];
            file = file.Remove(group.Index, group.Length);
            file = file.Insert(group.Index, "/LocalDirectory/");
        }

        Console.WriteLine(file);

        Console.ReadKey();

So the regex string I am using here is:

<(?:a|img)\s+(?:src|href)=\"(?http://www.+?/).+?(?:\.jpg|\.jpeg|\.gif|\.png)\".*?>

This will match only anchor links and img tags and only jpg, jpeg, gif, and png files

Part by part here is how this works:

< - matches the opening tag

(?:a|img) - specifies that only an anchor or img tag should be looked at

\s+ - require 1 or more spaces

(?:src|href) - match only a src or href

=\" - immediately followed by an equal sign and quotation mark

(?http://www.+?/) - Here we are grabbing what we need to replace - it must start with "http://www" and it will capture everything up to the next slash (/)

.+? - bla bla bla until the file extension is found

(?:\.jpg:\.gif|\.jpeg|\.png) - must have one of these extensions \".*?> - This patterm ends in a closing bracket and allows for whatever parameters etc in between.

Then all I am doing is going through each match - grabbing the group named "replace" and removing/inserting from the file at that group's index

Make sure you do this in reverse order so your replaces are not throwing off your group indexes

I "think" that should do it - please let me know if I had any over-sights.

DataDink