views:

58

answers:

3

Good morning guys

Is there a good way to use regular expression in C# in order to find all filenames and their paths within a string variable?

For example, if you have this string:

string s = @"Hello John

these are the files you have to send us today: <file>C:\Development\Projects 2010\Accounting\file20101130.csv</file>, <file>C:\Development\Projects 2010\Accounting\orders20101130.docx</file>

also we would like you to send <file>C:\Development\Projects 2010\Accounting\customersupdated.xls</file>

thank you";

The result would be:

C:\Development\Projects 2010\Accounting\file20101130.csv
C:\Development\Projects 2010\Accounting\orders20101130.docx
C:\Development\Projects 2010\Accounting\customersupdated.xls

EDITED: Considering what told @Jim, I edited the string adding tags in order to make it easier to extract needed file names from string!

A: 

Here's something I came up with:

using System;
using System.Text.RegularExpressions;

public class Test
{

    public static void Main()
    {
        string s = @"Hello John these are the files you have to send us today: 
            C:\projects\orders20101130.docx also we would like you to send 
            C:\some\file.txt, C:\someother.file and d:\some file\with spaces.ext  

            Thank you";

        Extract(s);

    }

    private static readonly Regex rx = new Regex
        (@"[a-z]:\\(?:[^\\:]+\\)*((?:[^:\\]+)\.\w+)", RegexOptions.IgnoreCase);

    static void Extract(string text)
    {
        MatchCollection matches = rx.Matches(text);

        foreach (Match match in matches)
        {
            Console.WriteLine("'{0}'", match.Value);
        }
    }

}

Produces: (see on ideone)

'C:\projects\orders20101130.docx', file: 'orders20101130.docx'
'C:\some\file.txt', file: 'file.txt'
'C:\someother.file', file: 'someother.file'
'd:\some file\with spaces.ext', file: 'with spaces.ext'

The regex is not extremely robust (it does make a few assumptions) but it worked for your examples as well.


Here is a version of the program if you use <file> tags. Change the regex and Extract to:

private static readonly Regex rx = new Regex
    (@"<file>(.+?)</file>", RegexOptions.IgnoreCase);

static void Extract(string text)
{
    MatchCollection matches = rx.Matches(text);

    foreach (Match match in matches)
    {
        Console.WriteLine("'{0}'", match.Groups[1]);
    }
}

Also available on ideone.

Aillyn
Your code is really working here. I also have tested, adding extra whitespace in "file 20101130.csv". Thank you Aillyn!
Junior Mayhé
@Aillyn: Does not deal with Jim Brissom's comment (see comments on op). It also does not take into account that paths can be deeper than just one directory and that the file extensions can contain spaces.
Obalix
@Junior I've added a version of the regex that uses `<file>` tags.
Aillyn
@Obalix True, that is why I said it does make a few assumptions (paths deeper than one directory work fine though, and it wouldn't be hard to add whitespaces to the extensions - not that I've seen files like that). But I agree that using tags would be a better idea
Aillyn
@Junior Mayhé: The code does work, only under certain circumstances. If you can guarantee that the files will always be in the following format it is ok: `c:\directory\filename.ext`, it does not work for: `c:\directory\directory\filename.ext`, nor for `c:\directory\file name with space.ext with space`, nor for `c:\directory\filename.ext1.ext2`.
Obalix
@Obalix [Oh Really?](http://ideone.com/awTjX)
Aillyn
@Obalix, Hi there. I tested Aillyn's code with both cases: C:\Development\Projects2010\Accounting\file 20101130.csv and C:\Development\Projects 2010\Accounting\file 20101130.csv. Notice there is a white space in Projects 2010, it is a subfolder.
Junior Mayhé
@Aillyn indeed it is cleanner when we use a <file> tag!
Junior Mayhé
@Junior I've updated my answer with a more robust regex. And now it's also capable of capturing the file name. It still doesn't support extensions with spaces because I have never seen files like that.
Aillyn
Mee too @Aillyn, but the code is ok for searching filenames in string variable. I am your fan now :-) I didn't put attention to @Obalix thinking about extracting C:\Directory\Sub Directory\That another directory\Those.Namespace.FileName.txt. But your expression now works beautifully
Junior Mayhé
Don't parse (X)HTML using RegEx! http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454
abatishchev
@abatis Read the question carefully. If the OP follows the convention of using a tag only for the files, the result is a regular language, which *can* be parsed by a regular expression.
Aillyn
+1  A: 

If you put some constraints on your filename requirements, you can use code similar to this:

string s = @"Hello John

these are the files you have to send us today: C:\Development\Projects 2010\Accounting\file20101130.csv, C:\Development\Projects 2010\Accounting\orders20101130.docx

also we would like you to send C:\Development\Projects 2010\Accounting\customersupdated.xls

thank you";

Regex regexObj = new Regex(@"\b[a-z]:\\(?:[^<>:""/\\|?*\n\r\0-\37]+\\)*[^<>:""/\\|?*\n\r\0-\37]+\.[a-z0-9\.]{1,5}", RegexOptions.IgnorePatternWhitespace|RegexOptions.IgnoreCase);
MatchCollection fileNameMatchCollection = regexObj.Matches(s);
foreach (Match fileNameMatch in fileNameMatchCollection)
{
    MessageBox.Show(fileNameMatch.Value);
}

In this case, I limited extensions to a length of 1-5 characters. You can obviously use another value or restrict the characters allowed in filename extensions further. The list of valid characters is taken from the MSDN article Naming Files, Paths, and Namespaces.

Jim Brissom
Good answer too Jim! Thank you!
Junior Mayhé
A: 

If you use <file> tag and the final text could be represented as well formatted xml document (as far as being inner xml, i.e. text without root tags), you probably can do:

var doc = new XmlDocument();
doc.LoadXml(String.Concat("<root>", input, "</root>"));

var files = doc.SelectNodes("//file"):

or

var doc = new XmlDocument();

doc.AppendChild(doc.CreateElement("root"));
doc.DocumentElement.InnerXml = input;

var nodes = doc.SelectNodes("//file");

Both method really works and are highly object-oriented, especially the second one.

And will bring rather more performance.

See also - Don't parse (X)HTML using RegEx

abatishchev
-1 Waste of resources.
Aillyn
@Aillyn: No, it is NOT. Parsing well formed XML with RegEx - is much, much worse
abatishchev
@abatis It happens that the OP is using a subset of XML (if you call it that) that *is* regular, thus, it *can* be parsed with RegEx. There is absolutely no need for a XML parser.
Aillyn