ansaurus

Question

Answer 1

+2 A:

Something like the following regular expression?

MatchCollection matches = Regex.Matches(yourString, @"http://\S+?-(\d+)/") 
foreach(Match m in matches)
{
    string id = m.Captures[0].Value;
    string url = m.Value;
}

which will grab links (starting http://) then everything not a space (spaces are guaranteed not around in HTTP (URI) links) and assumes it ends with digits and a trailing slash (this will correctly remove the & in your example or other trailing text).

EDIT: the whole match is the link, the ID is in the first capturing parentheses, updated code to show how to get the info.

Update: if dash+digits+slash can occur more then once in the URL, then greediness must be used, but then consecutive links (with no additional text having spaces) will be matched together. If dash+digits+slash occurs only once per URL, then laziness is preferred. This is the solution currently in the code above.

Alternative approach

From the updates and the extra information, I understand that there's a lot unclear about the text. Another approach might be easier: split everything on http:// and go through the results. This prevents having to make a complex look-forward/backward regex and makes sure that consecutive links (i.e., without text in-between) are correctly treated:

// zero-width split:
string[] linksWithText = Regex.Split(yourString, @"(?<=http:\S+-\d+/)");
foreach (string link in linksWithText)
{
    Match m = Regex.Match(link, @"(.*)(http:\S+-(\d+)/)$");
    if (m.Success)
    {
        string text = m.Groups[1].Value;
        string url = m.Groups[2].Value;
        string id = m.Groups[3].Value;
    }
}

Update: alternative approach updated. The text (name) is first, then url. Note the negative look behind expression to split on a zero-width spot, taking anything before the url up to the end of the url.

Abel 2009-10-27 13:54:05

Thanks, will try now.

Joan Venge 2009-10-27 14:01:30

Thanks, btw can you also please help me group the LINK and the ID? Yours return full LINKS correctly.

Joan Venge 2009-10-27 14:09:16

just updated the code and changed the regex slightly to make that possible. See my edits.

Abel 2009-10-27 14:22:10

just updated the code and changed the regex slightly to make that possible. See my edits (your update on guaranteeing dash+digits+slash makes it easier)

Abel 2009-10-27 14:28:08

I get ArgumentOutOFRangeEx for this:http://site.com/forum/f89/%5Bdirect%5Dmadagascar-escape-2-africa-720p-bluray-x264-septic-514993/ System.Text.RegularExpressions.Match

Joan Venge 2009-10-27 14:31:41

It's the Captures line that gives the error, not m.

Joan Venge 2009-10-27 14:35:28

I tried m.Groups[1] instead of id var, and it worked.

Joan Venge 2009-10-27 14:37:28

So yours cover the LINKS and the IDs. Do you know if we can also parse the NAMES? If it's gonna make it easier, I can remove the LINKS from the original content, so only NAMES would be there. But it's always NAMES followed by the LINKS in the content if it's gonna help.

Joan Venge 2009-10-27 14:40:06

I used Captures, which starts at 0, groups starts at 1 (hence the exception). My bad. I showed an alternative approach which should give you more control.

Abel 2009-10-27 14:43:23

About your "names": it is `text` in my alternative approach. Use that if you want the names, it is easier and more readable (didn't test any of this, hope the code is correct enough for you to go on)

Abel 2009-10-27 14:45:08

Ah, the text comes before, sorry. Hold on, I'll fix.

Abel 2009-10-27 14:48:47

Thanks, your second example is great. Ec ept it throws an ArgumentOutOFRangeEx where m is {}. I also noticed the http split made the elements LINK + next NAME, isntead of NAME + LINK. I will try to get this working, but any pointers would be very helpful. Thanks again.

Joan Venge 2009-10-27 14:52:08

You are faster than me :)

Joan Venge 2009-10-27 14:52:43

Abel 2009-10-27 15:23:58

Joan Venge 2009-10-27 15:33:43

+1 for all the detailed explanation and tweaking :)

Jass 2009-10-27 15:48:13

Abel 2009-10-27 16:18:35

If it's in the middle it's ok, if at the end ignore is fine. So you are right. I will try this more throughly at home. But the NAME included some binary characters too. AFAIK they can only have numbers and letters and characters like (,) and [,]. Is it possible to match the NAMES without the binary garbage that comes before and after? After that I can crop the first and last letter which are ASCII but aren't in the actual name, by simple string parsing. Like the NAME ABBA has , and N at the start and end. They aren't in the actual name.

Joan Venge 2009-10-27 17:49:27

If your data is a string: no worries, everything in a string can be matched by a .NET regex. Binary garbage is undefined in itself and can be indistinguishable from ASCII (your code doesn't seem ASCII but more UTF-8/16, actually). If it is a byte array, the story becomes different. But then you cannot use a regex anymore. If you can define "binary garbage", I can help you with the implementation. Otherwise, in the absence of a definition, it will be impossible I'm afraid.

Abel 2009-10-27 18:46:11

Hi ABel, thanks for your help. Sorry I just got home. So I used BinaryFormatter to serialize to a Stream which is a FileStream. SO I guess it's byte array, right? I uploaded the file here: http://www.storage.to/get/UNhp9BAR/Streamer.bin Let me know if it helps. Thanks alot again.

Joan Venge 2009-10-28 01:19:24

The stream you sent is a stream created with .NET serialization (you mention `BinaryFormatter`, probably you streamed objects to disk, right?) To stream them back, all you need to do is use the same type of objects and deserialize. If you need roundtrip serialization and you can control the `BinaryFormatter`, replace it with an `XmlSerializer`, this makes readable and parsable data. Or you simply use (exactly!) the same classes to get the data back. Deserializing this binary stream without that information is daunting, to say the least, I hope you can control its production.

Abel 2009-10-28 08:58:53

I tried to deserialize your data, but apparently it got corrupted on the way. The start of the file is not correct. Reading it back with .NET 2.0 or 3.5 both fail (the SOH pos 9-12 should resolve to Int32 `0x1`, which is the "binary formatter major version").

Abel 2009-10-28 09:21:08

I won't be here for a few days. If you need (quick) help, link that file to a new question and ask how to use the BinaryFormatter to deserialize. As a hint: in the MSDN help is a clear example. If you are unsure what class type to use, deserialize to `object` and use introspection (hover with your mouse during debugging) to find out what types it contains.

Abel 2009-10-28 09:50:35

Thanks ABel. Yeah that was the problem I was having. This file is serialized but the operation couldn't be completed so it finished like 98% of the contents. I know how to serialize/deserialize but since this error happened because of no disk space, I thought I could get the links back using regex. Thanks to you, I recovered most of it.

Joan Venge 2009-10-28 12:15:42

Answer 2

+1 A:

Assuming all urls end with a hyphen, followed by some arbitrary numbers, followed by a backslash. This could work.

`http://[^ ]*-?<id>(\d)+/`

What do you think?

UPDATE: Try this:-

http://(?!http://)[^ ]*-?<id>(\d)+/

Updated code (?!http://) to stop url matching two urls are concatenated with some data in the middle between urls that is not a space.

You can get the captured group by name. The whole search would be the matched url and group would match the id.

Jass 2009-10-27 13:54:28

your match would incorrectly match only the first half of http_//site.com/forum-24/something-abba-4737373/, but if the link never contains dash+digits, it'll work just as fine (in other words: actually we need more info on the links to be certain we can give the correct regex).

Abel 2009-10-27 14:00:21

Thanks, will try now.

Joan Venge 2009-10-27 14:00:58

Yes links info wise, can only start with http:// and end with / that's for certain.

Joan Venge 2009-10-27 14:02:01

Abel, the * is greedy it will match until the first space character and then backtrack to the last hyphen , followed by the series of numbes, followed by backslash. It should match the whole url.

Jass 2009-10-27 14:05:33

A greedy * tries to match as much as possible.

Jass 2009-10-27 14:07:59

Thanks Jass, I tried the 2nd but threw an exception, using this:Regex.Matches ( content, @"?<link>(http://[^ ]*-?<id>(\d)+/)" );

Joan Venge 2009-10-27 14:17:40

@Jass: I know what greedy means and you are correct, however, consecutive URLs would not be found (they would be combined) due to the same greediness. My point was more about that we know too little about the data to get good guaranteed results.

Abel 2009-10-27 14:25:39

ofcourse if two urls are joined without any space between then them they would be joined, but i assumed that it wouldn't be the case. Hmm maybe i should have used a (?!http://) that should do...

Jass 2009-10-27 15:50:19

ansaurus

tags:

views:

answers:

Simple Regex help for C#

Alternative approach

related questions