tags:

views:

68

answers:

2

I need to uncompress all the files in a directory and for this I need to find the first file in the set. I'm currently doing this using a bunch of if statements and loops. Can i do this this using regex?

Here's a list of files that i need to match:

yes.rar
yes.part1.rar
yes.part01.rar
yes.part001.rar
yes.r01
yes.r001

These should NOT be matched:

no.part2.rar
no.part02.rar
no.part002.rar
no.part011.rar
no.r002
no.r02

I found a similar regex on this thread but it seems that Python doesn't support varible length lookarounds. A single line regex would be complicated but I'll document it well and it's not a problem. It's just one of those problems you beat your heap up, over.

Thanks in advance guys.

:)

+1  A: 

There's no need to use look behind assertions for this. Since you start looking from the beginning of the string, you can do everything with look-aheads that you can with look-behinds. This should work:

^((?!\.part(?!0*1\.rar$)\d+\.rar$).)*\.(?:rar|r?0*1)$

To capture the first part of the filename as you requested, you could do this:

^((?:(?!\.part\d+\.rar$).)*)\.(?:(?:part0*1\.)?rar|r?0*1)$
reko_t
Moi Reko.I have two issues and i might need to trouble you again.1. I couldn't match these two.yes.r01yes.r0012. Would it be possible to capture the first part of the filename into a capturing group? Like this:yes.part01.rar >> yestestfile.rar >> testfilenew.file.part01.rar >> new.fileSeems that my regex skills are either pathetic or terribly rusty.Paljon kiitoksia.Mridang.
Mridang Agarwalla
I edited the regexp so that it'll match the cases you specified. The second regexp will also capture the basename of the filename.
reko_t
Hi again Reko, I tried the first regex and it worked as expected — matching even the r001, r01 types.The second regex that you wrote seems to capture the file name in a cases where the file name is .r01, r001 or something.rar but it doesn't seem to match the *part* cases. Some more help, please? Thank you for the help.
Mridang Agarwalla
@mridang: See my answer. Don't use this approach because it is wrong.
Welbog
@mridang: Sorry, had a little error there, fixed the 2nd regexp now.@Welbog: It's not wrong per-se. It answers the original question just fine, although I agree with you that inspecting the actual header of the file is the right way to approach this problem.
reko_t
+4  A: 

Don't rely on the names of the files to determine which one is first. You're going to end up finding an edge case where you get the wrong file.

RAR's headers will tell you which file is the first on in the volume, assuming they were created in a somewhat-recent version of RAR.

HEAD_FLAGS Bit flags:
2 bytes

0x0100 - First volume (set only by RAR 3.0 and later)

So open up each file and examine the RAR headers, looking specifically for the flag that indicates which file is the first volume. This will never fail, as long as the archive isn't corrupt.


Update: I've just confirmed this by taking a look at some spanning archives in a hex editor. The files headers are constructed exactly as the link above indicates. It's just a matter of opening the files and reading the header for that flag. The file with that flag is the first volume.

Welbog
Hi Welbog. I understand what you mean. I've been working on Python for while but I've never had to work with something similar to this. Could I also use a similar approach for ZIP files? DO you know a good tutorial for reading file headers? Is there a library of some sort? You solution sounds very robust. I wish my Python skills were equally as robust. :( Thank you.
Mridang Agarwalla
Do you know how to read files in Python? Because that's all you have to do. I don't know any Python but I can't imagine file IO is difficult in it.
Welbog
Makes sense now. File IO is pretty easy and I've worked with it. I'll do some digging and look for something similar. I saw some examples of reading JPEG headers. I'll have a look at those to understand how it works.Thanks a ton!
Mridang Agarwalla