views:

184

answers:

2

I have an unfinished binary file that has some info that I can recover using regex. The contents are:

G $12.Angry.Men.1957.720p.HDTV.x264-HDLH Lhttp://site.com/forum/f89/12-angry-men-1957-720p-hdtv-x264-hdl-538403/ L I Š M ,ABBA.The.Movie.1977.720p.BluRay.DTS.x264-iONN Phttp://site.com/forum/f89/abba-movie-1977-720p-bluray-dts-x264-ion-428687/&

How can I parse it so I can at least get links that are:

http://site.com/forum/f89/abba-movie-1977-720p-bluray-dts-x264-ion-428687/

where 428687 is the id number.

So I would have a full link and an id.

The other names that comes before are the name of the links:

ABBA.The.Movie.1977.720p.BluRay.DTS.x264-iON

Though I am not sure if these can be parsed. I noticed they all have a character before and after the LINKS and the NAMES. So maybe this can narrow down the problem?

Btw I am willing to give 500 bounty for the correct answer.

+2  A: 

Something like the following regular expression?

MatchCollection matches = Regex.Matches(yourString, @"http://\S+?-(\d+)/") 
foreach(Match m in matches)
{
    string id = m.Captures[0].Value;
    string url = m.Value;
}

which will grab links (starting http://) then everything not a space (spaces are guaranteed not around in HTTP (URI) links) and assumes it ends with digits and a trailing slash (this will correctly remove the & in your example or other trailing text).

EDIT: the whole match is the link, the ID is in the first capturing parentheses, updated code to show how to get the info.

Update: if dash+digits+slash can occur more then once in the URL, then greediness must be used, but then consecutive links (with no additional text having spaces) will be matched together. If dash+digits+slash occurs only once per URL, then laziness is preferred. This is the solution currently in the code above.

Alternative approach

From the updates and the extra information, I understand that there's a lot unclear about the text. Another approach might be easier: split everything on http:// and go through the results. This prevents having to make a complex look-forward/backward regex and makes sure that consecutive links (i.e., without text in-between) are correctly treated:

// zero-width split:
string[] linksWithText = Regex.Split(yourString, @"(?<=http:\S+-\d+/)");
foreach (string link in linksWithText)
{
    Match m = Regex.Match(link, @"(.*)(http:\S+-(\d+)/)$");
    if (m.Success)
    {
        string text = m.Groups[1].Value;
        string url = m.Groups[2].Value;
        string id = m.Groups[3].Value;
    }
}

Update: alternative approach updated. The text (name) is first, then url. Note the negative look behind expression to split on a zero-width spot, taking anything before the url up to the end of the url.

Abel
Thanks, will try now.
Joan Venge
Thanks, btw can you also please help me group the LINK and the ID? Yours return full LINKS correctly.
Joan Venge
just updated the code and changed the regex slightly to make that possible. See my edits.
Abel
just updated the code and changed the regex slightly to make that possible. See my edits (your update on guaranteeing dash+digits+slash makes it easier)
Abel
I get ArgumentOutOFRangeEx for this:http://site.com/forum/f89/%5Bdirect%5Dmadagascar-escape-2-africa-720p-bluray-x264-septic-514993/ System.Text.RegularExpressions.Match
Joan Venge
It's the Captures line that gives the error, not m.
Joan Venge
I tried m.Groups[1] instead of id var, and it worked.
Joan Venge
So yours cover the LINKS and the IDs. Do you know if we can also parse the NAMES? If it's gonna make it easier, I can remove the LINKS from the original content, so only NAMES would be there. But it's always NAMES followed by the LINKS in the content if it's gonna help.
Joan Venge
I used Captures, which starts at 0, groups starts at 1 (hence the exception). My bad. I showed an alternative approach which should give you more control.
Abel
About your "names": it is `text` in my alternative approach. Use that if you want the names, it is easier and more readable (didn't test any of this, hope the code is correct enough for you to go on)
Abel
Ah, the text comes before, sorry. Hold on, I'll fix.
Abel
Thanks, your second example is great. Ec ept it throws an ArgumentOutOFRangeEx where m is {}. I also noticed the http split made the elements LINK + next NAME, isntead of NAME + LINK. I will try to get this working, but any pointers would be very helpful. Thanks again.
Joan Venge
You are faster than me :)
Joan Venge
Abel
Joan Venge
+1 for all the detailed explanation and tweaking :)
Jass
Abel
If it's in the middle it's ok, if at the end ignore is fine. So you are right. I will try this more throughly at home. But the NAME included some binary characters too. AFAIK they can only have numbers and letters and characters like (,) and [,]. Is it possible to match the NAMES without the binary garbage that comes before and after? After that I can crop the first and last letter which are ASCII but aren't in the actual name, by simple string parsing. Like the NAME ABBA has , and N at the start and end. They aren't in the actual name.
Joan Venge
If your data is a string: no worries, everything in a string can be matched by a .NET regex. Binary garbage is undefined in itself and can be indistinguishable from ASCII (your code doesn't seem ASCII but more UTF-8/16, actually). If it is a byte array, the story becomes different. But then you cannot use a regex anymore. If you can define "binary garbage", I can help you with the implementation. Otherwise, in the absence of a definition, it will be impossible I'm afraid.
Abel
Hi ABel, thanks for your help. Sorry I just got home. So I used BinaryFormatter to serialize to a Stream which is a FileStream. SO I guess it's byte array, right? I uploaded the file here: http://www.storage.to/get/UNhp9BAR/Streamer.bin Let me know if it helps. Thanks alot again.
Joan Venge
The stream you sent is a stream created with .NET serialization (you mention `BinaryFormatter`, probably you streamed objects to disk, right?) To stream them back, all you need to do is use the same type of objects and deserialize. If you need roundtrip serialization and you can control the `BinaryFormatter`, replace it with an `XmlSerializer`, this makes readable and parsable data. Or you simply use (exactly!) the same classes to get the data back. Deserializing this binary stream without that information is daunting, to say the least, I hope you can control its production.
Abel
I tried to deserialize your data, but apparently it got corrupted on the way. The start of the file is not correct. Reading it back with .NET 2.0 or 3.5 both fail (the SOH pos 9-12 should resolve to Int32 `0x1`, which is the "binary formatter major version").
Abel
I won't be here for a few days. If you need (quick) help, link that file to a new question and ask how to use the BinaryFormatter to deserialize. As a hint: in the MSDN help is a clear example. If you are unsure what class type to use, deserialize to `object` and use introspection (hover with your mouse during debugging) to find out what types it contains.
Abel
Thanks ABel. Yeah that was the problem I was having. This file is serialized but the operation couldn't be completed so it finished like 98% of the contents. I know how to serialize/deserialize but since this error happened because of no disk space, I thought I could get the links back using regex. Thanks to you, I recovered most of it.
Joan Venge
+1  A: 

Assuming all urls end with a hyphen, followed by some arbitrary numbers, followed by a backslash. This could work.

`http://[^ ]*-?<id>(\d)+/`

What do you think?

UPDATE: Try this:-

http://(?!http://)[^ ]*-?<id>(\d)+/

Updated code (?!http://) to stop url matching two urls are concatenated with some data in the middle between urls that is not a space.

You can get the captured group by name. The whole search would be the matched url and group would match the id.

Jass
your match would incorrectly match only the first half of http_//site.com/forum-24/something-abba-4737373/, but if the link never contains dash+digits, it'll work just as fine (in other words: actually we need more info on the links to be certain we can give the correct regex).
Abel
Thanks, will try now.
Joan Venge
Yes links info wise, can only start with http:// and end with / that's for certain.
Joan Venge
Abel, the * is greedy it will match until the first space character and then backtrack to the last hyphen , followed by the series of numbes, followed by backslash. It should match the whole url.
Jass
A greedy * tries to match as much as possible.
Jass
Thanks Jass, I tried the 2nd but threw an exception, using this:Regex.Matches ( content, @"?<link>(http://[^ ]*-?<id>(\d)+/)" );
Joan Venge
@Jass: I know what greedy means and you are correct, however, consecutive URLs would not be found (they would be combined) due to the same greediness. My point was more about that we know too little about the data to get good guaranteed results.
Abel
ofcourse if two urls are joined without any space between then them they would be joined, but i assumed that it wouldn't be the case. Hmm maybe i should have used a (?!http://) that should do...
Jass