ansaurus

Question

Answer 1

+1 A:

Well a regular expression to accomplish this will be very hard to write and will end up being unreliable anyway.

Probably your best bet is to have a whitelist of extensions you want to look for (.doc, .pdf etc), and trawl through the html looking for instances of these extensions. When you find one, track back to the next whitespace character and that's your filename.

Hope this helps.

Paul Suart 2009-12-10 09:27:35

Forgot to mention that I have no clue what the filenames are going to be - with or without extension, etc...

Pygmy 2009-12-10 09:28:49

Filenames can even have whitespace in them, can't they ?

Pygmy 2009-12-10 09:29:24

They *can*, but whether or not they *should* is another matter.

Paul Suart 2009-12-10 09:30:32

If you don't know what the extensions are going to be, you'll have no way of differentiating a filename from normal text such as "He looked for filenames.This isn't a filename".

Paul Suart 2009-12-10 09:31:47

Answer 2

+1 A:

You have a couple of options. You can use regular expressions, it could be something like Filename: (.*?)< /p> , but it will need to be much more complex. You would need to look at more of the text file to write a proper one. This could work depending on the structure of all your text, if there is always a certain tag after a filename for example.

If it is valid HTML you can also use a HTML parser like HTML Agility Pack to go through the html and pull out text from certain tags, then use a regex to seperate out the path.

Glenn Condron 2009-12-10 09:31:18

+1 for Html Agility Pack. It is very powerful.

Mikos 2010-08-08 18:04:42

Answer 3

A:

I'm not sure a regular expression is the best way to do this, traversing the HTML tree is probably more sensible, but the following regex should do it:

<\s*strong\s*>\s*Filename\s*<\s*/strong\s*>[\s:]*([^<]+)<\s*/p\s*>

As you can see, I've been extremely tolerant of whitespace, as well as tolerant on the content of the filename. Also, multiple (or no) semicolons are permitted.

The C# to build a List (off the top of my head):

List<String> fileNames = new List<String>();
Regex regexObj = new Regex(@"<\s*strong\s*>\s*Filename\s*<\s*/strong\s*>[\s:]*([^<]+)<\s*/p\s*>", RegexOptions.IgnoreCase);
    Match matchResults = regexObj.Match(subjectString);
    while (matchResults.Success) {

            fileNames.Add(matchResults.Groups[0].Value);

     matchResults = matchResults.NextMatch();
    }

Kazar 2009-12-10 10:07:50

Thank you very much ! I'll give it a go as soon as I get home !

Pygmy 2009-12-10 10:29:41

ansaurus

tags:

views:

answers:

c# : parsing text from html

related questions