views:

143

answers:

5

First of all, I did a search on this and was able to find how to use something like String.Split() to extract the string based on a condition. I wasn't able to find however, how to extract it based on an ending condition as well. For example, I have a file with links to images: http://i594.photobucket.com/albums/tt27/34/444.jpghttp://i594.photobucket.com/albums/as/asfd/ghjk6.jpg You will notice that all the images start with http:// and end with .jpg. However, .jpg is succeeded by http:// without a space, making this a little more difficult.

So basically I'm trying to find a way (Regex?) to extract a string from a string that starts with http:// and ends with .jpg

A: 

Regex would work really well for this. Here's an example in C# (and Java) for Regex

Joel
+1  A: 

In your specific case, you could always split if by ".jpg". You will probably end up with one empty element at the end of the array, and have to append the .jpg at the end of each file if you need that. Apart from that I think it would work.

Tested the following code and it worked fine:

public void SplitTest()
{
    string test = "http://i594.photobucket.com/albums/tt27/34/444.jpghttp://i594.photobucket.com/albums/as/asfd/ghjk6.jpg";
    string[] items = test.Split(new string[] { ".jpg" }, StringSplitOptions.RemoveEmptyEntries);
}

It even get rid of the empty entry...

Wagner Silveira
This works fine for the given example. However, it won't properly enforce the starting "http" requirement. For example add "foobar.jpg" somewhere in the input and "foobar" is in the `items` result. That's easily solvable by adding a `.Where(s => s.StartsWith("http"))` after the `Split`.
Ahmad Mageed
+2  A: 
    Regex RegexObj = new Regex("http://.+?\\.jpg");
Match MatchResults = RegexObj.Match(subject);
while (MatchResults.Success) {
    //Do something with it 
    MatchResults = MatchResults.NextMatch();
     }
Martin Smith
You definitely need the + as Dan has, otherwise you'll only match a filename of zero or one character.
Ben Voigt
Thanks for the response. It didn't work though. I think it's because you forgot the + sign as found in Dan's response (as well as two \).
DMan
@Ben - Good Catch, @DMan - That is needed to escape the \ in C# strings. You can avoid the need to do it by putting the pattern in a string literal prefixed with the @ symbol
Martin Smith
Sorry, I assumed that / needed to be escaped too, so I thought you escaped only the .jpg part and not the front.
DMan
+4  A: 

Regex is the easiest way to do this. If you're not familiar with regular expressions, you might check out Regex Buddy. It's a relatively cheap little tool that I found extremely useful when I was learning. For your particular case, a possible expression is:

(http://.+?\.jpg)

It probably requires some more refinement, as there are boundary cases that could trip this up, but it would work if the file is a simple list.


You can also do free quick testing of expressions here.


Per your latest comment, if you have links to other non-images as well, then you need to make sure it doesn't start at the http:// for one link and read all the way to the .jpg for the next image. Since URLs are not allowed to have whitespace, you can do it like this:

(http://[^\s]+\.jpg)

This basically says, "match a string starting with http:// and ending with .jpg where there is at least one character between the two and none of those characters are whitespace".

Dan Bryant
+1 for Regex Buddy - That's where I generated the C# in my post from!
Martin Smith
I agree that in general the x{something goes here}y kind of issue is quite easy to solve with regex - and probably the preferred approach. But his specific case, a simple split on .jpg is still the simplest way to solve it.
Wagner Silveira
@Wagner, it depends on the actual format of the file. If it's really a simple list with no delimiters or other text, then I agree, splitting on .jpg is simpler, though also more brittle. I would prefer Regex even for the simpler case for its flexibility if the requirements change.
Dan Bryant
I agree. Sorry I forgot to mention, but there are a lot of non-images in there too which may break it. Anyways, marked this as enter, and looking in to your recommendation with Regex Buddy!
DMan
As per your latest edit, you are a life saver! I just ran into the exact problem of it reading from http:// to the next .jpg! I was JUST going to parse it twice (first created a regex that started with "img source", then use your original regex on that afterwards) but now I don't have to :D RegexBuddy is useful. And you too.
DMan
+1  A: 

The following LINQ will separate by http: and make sure to only get values that end with jpg.

 var images = from i in imageList.Split(new[] {"http:"}, 
                                     StringSplitOptions.RemoveEmptyEntries)
              where i.EndsWith(".jpg")
              select "http:" + i;
juharr