views:

163

answers:

3

I have an string input-buffer that contains html. That html contains a lot of text, including some stuff I want to parse. What I'm actually looking for are the lines like this : "< strong>Filename< /strong>: yadayada.thisandthat.doc< /p>"

(Although position and amount of whitespace / semicolons is variable)

What's the best way to get all the filenames into a List< string> ?

+1  A: 

Well a regular expression to accomplish this will be very hard to write and will end up being unreliable anyway.

Probably your best bet is to have a whitelist of extensions you want to look for (.doc, .pdf etc), and trawl through the html looking for instances of these extensions. When you find one, track back to the next whitespace character and that's your filename.

Hope this helps.

Paul Suart
Forgot to mention that I have no clue what the filenames are going to be - with or without extension, etc...
Pygmy
Filenames can even have whitespace in them, can't they ?
Pygmy
They *can*, but whether or not they *should* is another matter.
Paul Suart
If you don't know what the extensions are going to be, you'll have no way of differentiating a filename from normal text such as "He looked for filenames.This isn't a filename".
Paul Suart
+1  A: 

You have a couple of options. You can use regular expressions, it could be something like Filename: (.*?)< /p> , but it will need to be much more complex. You would need to look at more of the text file to write a proper one. This could work depending on the structure of all your text, if there is always a certain tag after a filename for example.

If it is valid HTML you can also use a HTML parser like HTML Agility Pack to go through the html and pull out text from certain tags, then use a regex to seperate out the path.

Glenn Condron
+1 for Html Agility Pack. It is very powerful.
Mikos
A: 

I'm not sure a regular expression is the best way to do this, traversing the HTML tree is probably more sensible, but the following regex should do it:

<\s*strong\s*>\s*Filename\s*<\s*/strong\s*>[\s:]*([^<]+)<\s*/p\s*>

As you can see, I've been extremely tolerant of whitespace, as well as tolerant on the content of the filename. Also, multiple (or no) semicolons are permitted.

The C# to build a List (off the top of my head):

List<String> fileNames = new List<String>();
Regex regexObj = new Regex(@"<\s*strong\s*>\s*Filename\s*<\s*/strong\s*>[\s:]*([^<]+)<\s*/p\s*>", RegexOptions.IgnoreCase);
    Match matchResults = regexObj.Match(subjectString);
    while (matchResults.Success) {

            fileNames.Add(matchResults.Groups[0].Value);

     matchResults = matchResults.NextMatch();
    }
Kazar
Thank you very much ! I'll give it a go as soon as I get home !
Pygmy