tags:

views:

46

answers:

2

I have a load of jibberish data with this somewhere in the middle:

"video_id": "hGosI8rBVe8"

And from this, I want to extract hGosI8rBVe8. Note that what I want to extract can be of any length, and can include upper/lowercase letters and numbers. This is what I've tried so far:

"video_id": "(.*)"

and:

"video_id": "([a-zA-Z0-9]*)"

But they carry on matching way past the " at the end of what I want returned. I'm pretty sure this is because of the * (greedy)... but I see no other way to do it because what I want returned will be of variable length.

Any help is appreciated, cheers.

+3  A: 

Make it ungreedy by appending the ?

"video_id": "([a-zA-Z0-9]+?)"

I also changed * to + as the former is 0 or more and the latter is 1 or more. Which is more appropriate in this case.

Jason McCreary
Ah, so `?` makes it a lazy match... good to know, thank you! :)
Wen
Won't `+?` only match at most one character?
dreamlax
I just tried it and it works perfect, but I can see why you can think that and honestly I don't know :P
Wen
@dreamlax: The `+` metacharacter is "one or more" repetition, and `+?` means "one or more repetition and don't be greedy". The `?` metacharacter by itself is "zero or one" repetition, so `ab?c`, for instance, matches `abc` or `ac` but not `abbc`.
eldarerathis
I can see how greedy or not is relevant for the "(.*)" approach, but the "([a-zA-Z0-9]*)" can never reach past the ending double-quote, so surely it would have worked as it was. @Wen: as Randal suggested, why not post your exact input, regexp code and result?
Tony
How does that help the specific case here. It wouldn't be greedy anyway since it would stop at the `"` by virtue of the fact it's not in the character class `[a-zA-Z0-9]`. I can see how it might help with `.*"`. EDIT: Yeah - basically what Tony said :-)
paxdiablo
@Tony, agreed. But the user said it worked. So without seeing the input and code, that's good enough for me.
Jason McCreary
+3  A: 

The "video_id": "([a-zA-Z0-9]*)" shouldn't match beyond the closing " simply because that's not included in the [a-zA-Z0-9] character class. I'm not sure why you think it's doing that.

However, the .* will match more characters if avalaible so that applying the "(.*)" regex to My name is "Pax" and yours is "George" will get you:

Pax" and yours is "George

If you have a regex engine that doesn't support non-greediness, you can use:

"video_id": "([^"]*)"

which will basically match " followed by the maximum number of non-" characters, followed by the " again.

paxdiablo