views:

344

answers:

3

I have a Python regular expression that matches a set of filenames. How to change it so that I can use it in Mercurial's .hgignore file to ignore files that do not match the expression?

Full story: I have a big source tree with *.ml files scattered everywhere. I want to put them into a new repository. There are other, less important files which are too heavy to be included in the repository. I'm trying to find the corresponding expression for .hgignore file.

1st observation: Python doesn't have regular language complement operator (AFAIK it can complement only a set of characters). (BTW, why?)

2nd observation: The following regex in Python:

re.compile("^.*(?<!\.ml)$")

works as expected:

abcabc - match  
abc.ml - no match  
x/abcabc - match  
x/abc.ml - no match

However, when I put exactly the same expression in the .hgignore file, I get this:

$ hg st --all  
?  abc.ml  
I .hgignore  
I abcabc  
I x/xabc  
I x/xabc.ml

According to .hgignore manpage, Mercurial uses just normal Python regular expressions. How is that I get different results then? How is it possible that Mercurial found a match for the x/xabc.ml?

Does anybody know less ugly way around the lack of regular language complement operator?

A: 

Through some testing, found two solutions that appear to work. The first roots to a subdirectory, and apparently this is significant. The second is brittle, because it only allows one suffix to be used. I'm running these tests on Windows XP (customized to work a bit more unixy) with Mercurial 1.2.1.

(Comments added with # message by me.)

$ hg --version
Mercurial Distributed SCM (version 1.2.1)

$ cat .hgignore
syntax: regexp
^x/.+(?<!\.ml)$       # rooted to x/ subdir
#^.+[^.][^m][^l]$

$ hg status --all
? .hgignore           # not affected by x/ regex
? abc.ml              # not affected by x/ regex
? abcabc              # not affected by x/ regex
? x\saveme.ml         # versioned, is *.ml
I x\abcabc            # ignored, is not *.ml
I x\ignoreme.txt      # ignored, is not *.ml

And the second:

$ cat .hgignore
syntax: regexp
#^x/.+(?<!\.ml)$
^.+[^.][^m][^l]$      # brittle, can only use one suffix

$ hg status --all
? abc.ml              # versioned, is *.ml
? x\saveme.ml         # versioned, is *.ml
I .hgignore           # ignored, is not *.ml
I abcabc              # ignored, is not *.ml
I x\abcabc            # ignored, is not *.ml
I x\ignoreme.txt      # ignored, is not *.ml

The second one has fully expected behavior as I understand the OP. The first only has expected behavior in the subdirectory, but is more flexible.

Roger Pate
Unfortunately, if I try your pattern in .hgignore, I get the following showing up as "?": .hgignore (shouldn't be there), abc.ml (should be), abcabc (shouldn't be), x/abc.ml (should be), y/abc.ml (should be), y/abcabc (shouldn't be). With the OP's regex, it correctly ignores everything but incorrectly also ignores x/abc.ml and y/abc.ml.So I'm not sure this is a workaround.
Vinay Sajip
Which pattern? The one rooted to the subdir (the first one) naturally won't affect files not in that subdir. Except for that one caveat, it works as it should, and the second one works as it should too in all cases. I'll update my post now.
Roger Pate
I meant the first one, as it was the one not commented out. However, your brittle re above unfortunately also fails for filenames which are <= 3 chars long - for example "a", "ab" and "abc" are not ignored. "abcd" is caught.
Vinay Sajip
Yet another reason why it is brittle. :) Thanks for finding that additional bug.
Roger Pate
The first solution requires specifying all directories in .hgignore. This boils down to using handmade tool to crawl the source tree (then we can select each file explicitly and .hgignore is not needed).The second regex causes a file named e.g. 'shame' to be included (no match). This is too inaccurate.
A: 

The problem appears specifically to be that matches in subdirectories are different to the root. Note the following:

$ hg --version
Mercurial Distributed SCM (version 1.1.2)

It's an older version, but it behaves in the same way. My project has the following files:

$ find . -name 'abc*' -print
./x/abcabc
./x/abc.ml
./abcabc
./abc.ml

Here's my .hgignore:

$ cat .hgignore
^.*(?<!\.ml)$

Now, when I run stat:

$ hg stat
? abc.ml

So, hg has failed to pick up x/abc.ml. But is this really a problem with the regular expression? Perhaps not:

$ python
Python 2.6.2 (release26-maint, Apr 19 2009, 01:56:41) 
[GCC 4.3.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import mercurial.ignore
>>> import os
>>> root = os.getcwd()
>>> ignorefunc = mercurial.ignore.ignore(root, ['.hgignore'], lambda msg: None)
>>> 
>>> ignorefunc("abc.ml") # No match - this is correct
>>> ignorefunc("abcabc") # Match - this is correct, we want to ignore this
<_sre.SRE_Match object at 0xb7c765d0>
>>> ignorefunc("abcabc").span() 
(0, 6)
>>> ignorefunc("x/abcabc").span() # Match - this is correct, we want to ignore this
(0, 8)
>>> ignorefunc("x/abc.ml") # No match - this is correct!
>>>

Notice that ignorefunc treated abcabc and x/abcabc the same (matched - i.e. ignore) whereas abc.ml and x/abc.ml are also treated the same (no match - i.e. don't ignore).

So, perhaps the logic error is elsewhere in Mercurial, or perhaps I'm looking at the wrong bit of Mercurial (though I'd be surprised if that were the case). Unless I've missed something, maybe a bug (rather than an RFE which Martin Geisler pointed to) needs to be filed against Mercurial.

Vinay Sajip
If you're responding to my first answer, when I changed from ^re/.+ to ^.+ (as the question has), I was able to reproduce the problem. (So I deleted that answer.) Apparently this has something to do with the directory specified, which led me to the current suggestion. I haven't looked at hg's source, but, based on this external testing, it does appear the problem is in there somewhere.
Roger Pate
Er - yes, I was commenting about your first (now deleted) answer. Anyway, my testing appears to point away from the regular expression being the source of the problem.
Vinay Sajip
+2  A: 

The regexs are applied to each subdirectory component in turn as well as the file name, not the entire relative path at once. So if I have a/b/c/d in my repo, each regex will be applied to a, a/b, a/b/c as well as a/b/c/d. If any component matches, the file will be ignored. (You can tell that this is the behaviour by trying ^bar$ with bar/foo - you'll see that bar/foo is ignored.)

^.*(?<!\.ml)$ ignores x/xabc.ml because the pattern matches x (i.e. the subdirectory.)

This means that there is no regex that will help you, because your patterns are bound to match the first subdirectory component.

hwiechers
But in my answer above, "x/abcabc" and "x/abc.ml" are treated differently: the first is matched (correctly), whereas the second is not (also correctly).
Vinay Sajip
You're applying the ignore func to the entire relative path not the subpaths. If you're looking at the source, take a look at _dirignore in mercurial/dirstate.py. I'm pretty sure that's where this behaviour comes from.
hwiechers
Ah, I was also thinking it was matching against the entire relative path instead. And that man hgignore said that it did, but on rereading it specifies something else.
Roger Pate
Now I see. At the top level the directories "x", "y" etc. are ignored because they match the regex, so we never recurse down into them. What we really want is logic that says "never ignore a directory, only ignore files" even when they match the ignore regexes. Except of course, you don't want that behaviour in general, only in this sort of case.
Vinay Sajip
My tests and _dirignore in dirstate.py indicate that for a/b/c/d in the repo, the following strings are tried: a, a/b, a/b/c, a/b/c/d (but not 'd' itself!). Try ^zz$ against 'zz/a' and 'a/zz'. It's sad that manpage doesn't explain this.
hwiechers: could you possible refine your answer accordingly? Thanks.
@dawidtoton..pl: You're right about a/b/c/d. I updated my answer.
hwiechers