ansaurus

Question

A way to use RegEx to find a set of filenames paths in a string

Answer 1

A:

Here's something I came up with:

using System;
using System.Text.RegularExpressions;

public class Test
{

    public static void Main()
    {
        string s = @"Hello John these are the files you have to send us today: 
            C:\projects\orders20101130.docx also we would like you to send 
            C:\some\file.txt, C:\someother.file and d:\some file\with spaces.ext  

            Thank you";

        Extract(s);

    }

    private static readonly Regex rx = new Regex
        (@"[a-z]:\\(?:[^\\:]+\\)*((?:[^:\\]+)\.\w+)", RegexOptions.IgnoreCase);

    static void Extract(string text)
    {
        MatchCollection matches = rx.Matches(text);

        foreach (Match match in matches)
        {
            Console.WriteLine("'{0}'", match.Value);
        }
    }

}

Produces: (see on ideone)

'C:\projects\orders20101130.docx', file: 'orders20101130.docx'
'C:\some\file.txt', file: 'file.txt'
'C:\someother.file', file: 'someother.file'
'd:\some file\with spaces.ext', file: 'with spaces.ext'

The regex is not extremely robust (it does make a few assumptions) but it worked for your examples as well.

Here is a version of the program if you use <file> tags. Change the regex and Extract to:

private static readonly Regex rx = new Regex
    (@"<file>(.+?)</file>", RegexOptions.IgnoreCase);

static void Extract(string text)
{
    MatchCollection matches = rx.Matches(text);

    foreach (Match match in matches)
    {
        Console.WriteLine("'{0}'", match.Groups[1]);
    }
}

Also available on ideone.

Aillyn 2010-09-25 10:43:34

Your code is really working here. I also have tested, adding extra whitespace in "file 20101130.csv". Thank you Aillyn!

Junior Mayhé 2010-09-25 10:58:03

@Aillyn: Does not deal with Jim Brissom's comment (see comments on op). It also does not take into account that paths can be deeper than just one directory and that the file extensions can contain spaces.

Obalix 2010-09-25 11:01:00

@Junior I've added a version of the regex that uses `<file>` tags.

Aillyn 2010-09-25 11:01:06

@Obalix True, that is why I said it does make a few assumptions (paths deeper than one directory work fine though, and it wouldn't be hard to add whitespaces to the extensions - not that I've seen files like that). But I agree that using tags would be a better idea

Aillyn 2010-09-25 11:01:41

@Junior Mayhé: The code does work, only under certain circumstances. If you can guarantee that the files will always be in the following format it is ok: `c:\directory\filename.ext`, it does not work for: `c:\directory\directory\filename.ext`, nor for `c:\directory\file name with space.ext with space`, nor for `c:\directory\filename.ext1.ext2`.

Obalix 2010-09-25 11:04:48

@Obalix [Oh Really?](http://ideone.com/awTjX)

Aillyn 2010-09-25 11:08:08

@Obalix, Hi there. I tested Aillyn's code with both cases: C:\Development\Projects2010\Accounting\file 20101130.csv and C:\Development\Projects 2010\Accounting\file 20101130.csv. Notice there is a white space in Projects 2010, it is a subfolder.

Junior Mayhé 2010-09-25 11:13:59

@Aillyn indeed it is cleanner when we use a <file> tag!

Junior Mayhé 2010-09-25 11:14:48

@Junior I've updated my answer with a more robust regex. And now it's also capable of capturing the file name. It still doesn't support extensions with spaces because I have never seen files like that.

Aillyn 2010-09-25 11:20:38

Mee too @Aillyn, but the code is ok for searching filenames in string variable. I am your fan now :-) I didn't put attention to @Obalix thinking about extracting C:\Directory\Sub Directory\That another directory\Those.Namespace.FileName.txt. But your expression now works beautifully

Junior Mayhé 2010-09-25 11:27:23

Don't parse (X)HTML using RegEx! http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454

abatishchev 2010-09-27 06:29:40

@abatis Read the question carefully. If the OP follows the convention of using a tag only for the files, the result is a regular language, which *can* be parsed by a regular expression.

Aillyn 2010-09-27 15:49:23

Answer 2

+1 A:

If you put some constraints on your filename requirements, you can use code similar to this:

string s = @"Hello John

these are the files you have to send us today: C:\Development\Projects 2010\Accounting\file20101130.csv, C:\Development\Projects 2010\Accounting\orders20101130.docx

also we would like you to send C:\Development\Projects 2010\Accounting\customersupdated.xls

thank you";

Regex regexObj = new Regex(@"\b[a-z]:\\(?:[^<>:""/\\|?*\n\r\0-\37]+\\)*[^<>:""/\\|?*\n\r\0-\37]+\.[a-z0-9\.]{1,5}", RegexOptions.IgnorePatternWhitespace|RegexOptions.IgnoreCase);
MatchCollection fileNameMatchCollection = regexObj.Matches(s);
foreach (Match fileNameMatch in fileNameMatchCollection)
{
    MessageBox.Show(fileNameMatch.Value);
}

In this case, I limited extensions to a length of 1-5 characters. You can obviously use another value or restrict the characters allowed in filename extensions further. The list of valid characters is taken from the MSDN article Naming Files, Paths, and Namespaces.

Jim Brissom 2010-09-25 10:59:12

Good answer too Jim! Thank you!

Junior Mayhé 2010-09-25 11:09:41

Answer 3

A:

If you use <file> tag and the final text could be represented as well formatted xml document (as far as being inner xml, i.e. text without root tags), you probably can do:

var doc = new XmlDocument();
doc.LoadXml(String.Concat("<root>", input, "</root>"));

var files = doc.SelectNodes("//file"):

or

var doc = new XmlDocument();

doc.AppendChild(doc.CreateElement("root"));
doc.DocumentElement.InnerXml = input;

var nodes = doc.SelectNodes("//file");

Both method really works and are highly object-oriented, especially the second one.

And will bring rather more performance.

See also - Don't parse (X)HTML using RegEx

abatishchev 2010-09-25 18:22:45

-1 Waste of resources.

Aillyn 2010-09-27 15:51:19

@Aillyn: No, it is NOT. Parsing well formed XML with RegEx - is much, much worse

abatishchev 2010-09-27 17:51:52

@abatis It happens that the OP is using a subset of XML (if you call it that) that *is* regular, thus, it *can* be parsed with RegEx. There is absolutely no need for a XML parser.

Aillyn 2010-09-27 22:06:39

ansaurus

tags:

views:

answers:

A way to use RegEx to find a set of filenames paths in a string

related questions