tags:

views:

70

answers:

3

Hi All!!
This is my first question, so I hope I didn't mess too much with the title and the formatting.

I have a bunch of file a client of mine sent me in this form:

Name.Of.Chapter.021x212.The.Actual.Title.Of.the.Chapter.DOC.NAME-Some.stuff.Here.ext

What I need is a regex to output just:

212 The Actual Title Of the Chapter

I'm not gonna use it with any script language in particular; it's a batch renaming of files through an app supporting regex (which already "preserves" the extension).

So far, all I was able to do was this:

/.*x(\d+)\.(.*?)\.[A-Z]{3}.*/ -->REPLACE: $1 $2

(Capture everything before a number preceded by an "x", group numbers after the "x", group everything following until a 3 digit Uppercase word is met, then capture everything that follows)
which gives me back:

212 The.Actual.Title.Of.the.Chapter

Having seen the result I thought that something like:

/.*x(\d+)\.([^.]*?)\.[A-Z]{3}.*/ -->REPLACE: $1 $2

(Changed second group to "Capture everything which is not a dot...") would have worked as expected. Instead, the whole regex fails to match completely.

What am I missing?

TIA

cià
ale

+2  A: 

.*x(\d+)\. matches Name.Of.Chapter.021x212.

\.[A-Z]{3}.* matches .DOC.NAME-Some.stuff.Here.ext

But ([^.]*?) does not match The.Actual.Title.Of.the.Chapter because this regex does not allow for any periods at all.

unutbu
But if I only run "([^.]*?)\." on "The.Actual.Title.Of.the.Chapter" (or the whole original string), it matches.How come it doesn't in the full regex above?And, btw... any hint on what I should do to get the result I need?Tnx.
ALFABreezE
You have a regex which should allow you to rename the file `212 The.Actual.Title.Of.the.Chapter`. Why not simply do a second pass with your renamer app which removes periods?
unutbu
As I told **ghostdog**, at this point it's more a matter of *learning* regex than having the job done. I could use the shell, AppleScript or even the renamer app alone without any regex (and a few steps more), but what I really want to know now is if and how can it be done with regex in a single step. Sure enough, if it turns out it's not possible, the two-pass solution would be my best choice ;) **TNX!**
ALFABreezE
+1  A: 

since you are on Mac, you could use the shell

$ s="Name.Of.Chapter.021x212.The.Actual.Title.Of.the.Chapter.DOC.NAME-Some.stuff.Here.ext"

$ echo ${s#*x}
212.The.Actual.Title.Of.the.Chapter.DOC.NAME-Some.stuff.Here.ext

$ t=${s#*x}

$ echo ${t%.[A-Z][A-Z][A-Z].*}
212.The.Actual.Title.Of.the.Chapter

Or if you prefer sed, eg

echo $filename | sed 's|.[^x]*x||;s/\.[A-Z][A-Z][A-Z].*//' 

For processing multiple files

for file in *.ext
do
  newfile=${file#*x}
  newfile=${newfile%.[A-Z][A-Z][A-Z].*}
  # or 
  # newfile=$(echo $file | sed 's|.[^x]*x||;s/\.[A-Z][A-Z][A-Z].*//')
  mv "$file" "$newfile"
done 
ghostdog74
Yes, you're totally right, **ghostdog**. But I've been stuck with finding a regex solution to this for such a long time that now it's more about learning something about regex than having the job done! ;) **TNX!**
ALFABreezE
@ALFABreezE, some regex tools (like Perl) let you put executable code in the replacement string. You'd still be doing exactly what @ghostdog74 did here, just in one tightly-packed line of code. Is that what you're looking for?
Alan Moore
@Alan Moore - I don't know what @ghostdog74's script does *(pleeease don't tell him! He's been SO nice... :)*. I don't know shell script. I'm using **Name Mangler** (http://www.manytricks.com/namemangler/) which is based on "AGRegEx" (http://sourceforge.net/projects/agkit/) which claims to be *"a Perl-compatible regular expression framework, based on PCRE Library"*. But I don't know if I can put executable code in the "Replace" field of the Name Mangler's GUI. Ok.. Now I feel guilty and think I should give @ghostdog script a try, at least. But first I have to find out what "sed" is...
ALFABreezE
@ALFA: PCRE emulates the syntax and much of the functionality of Perl's regex matching, but not executable code in replacement strings. That doesn't necessarily mean *AGKit* doesn't support it, but that capability would have to be provided by AGKit (or maybe AGRegex). That's how it works in PHP, which also uses PCRE under the hood.
Alan Moore
@Alan Moore - Mmmm... Ok, that's WAY too much for me; I think I'm just gonna trust you. :D Btw, I just noticed that "Name Mangler" has its own scripting language which can handle (probably) multiple pass of regexes and save them as a single action / droplet (haven't tried, yet). But we're going a little OT. All I wanted to know was if it was possible doing it all with a *single* regex string. And it's not. **Thank You All** for your kindness and patience! :)
ALFABreezE
+1  A: 

To your question "How can I remove the dots in the process of matching?" the answer is "You can't." The only way to do that is by processing the result of the match in a second step, as others have said. But I think there's a more basic question that needs to be addressed, which is "What does it mean for a regex to match a given input?"

A regex is usually said to match a string when it describes any substring of that string. If you want to be sure the regex describes the whole string, you need to add the start (^) and end ($) anchors:

/^.*x(\d+)\.(.*?)\.[A-Z]{3}.*$/  

But in your case, you don't need to describe the whole string; if you get rid of the .* at either end, it will serve your just as well:

/x(\d+)\.(.*?)\.[A-Z]{3}/  

I recommend you not get in the habit of "padding" regexes with .* at beginning and end. The leading .* in particular can change the behavior of the regex in unexpected ways. For example, it there were two places in the input string where x(\d+)\. could match, your "real" match would have started at the second one. Also, if it's not anchored with ^ or \A, a leading .* can make the whole regex much less efficient.

I said "usually" above because some tools do automatically "anchor" the match at the beginning (Python's match()) or at both ends (Java's matches()), but that's pretty rare. Most of the shells and command-line tools available on *nix systems define a regex match in the traditional way, but it's a good idea to say what tool(s) you're using, just in case.

Finally, a word or two about vocabulary. The parentheses in (\d+) cause the matched characters to be captured, not grouped. Many regex flavors also support non-capturing parentheses in the form (?:\d+), which are used for grouping only. Any text that is included in the overall match, whether it's captured or not, is said to have been consumed (not captured). The way you used the words "capture" and "group" in your question is guaranteed to cause maximum confusion in anyone who assumes you know what you're talking about. :D

If you haven't read it yet, check out this excellent tutorial.

Alan Moore
Uhu... Sorry if I gave the wrong impression: *I have NO IDEA of what I'm talking about, folks!!* :D As for your answer... It's pretty much exhaustive and, at least, should put an end to my struggles *(thou' I'm pretty sure I'll go on tryin' for a couple of days more :)*. And, yep, I've already taken a glimpse at that "tutorial" (should rather be called a "Bible", actually;) I'll dig into it some more, I promise. As for the `.*` 'padding' stuff, instead, I need to... ehm... "consume(?)" it, 'cause if I don't the regex matches, but I can't replace the *whole text* with the result. **TNX!**
ALFABreezE
If you want the regex to consume the whole input, it should have anchors at both ends in addition to the `.*` padding. But I thought you wanted to extract part of it and do some processing on that part. Did @ghostdog's solution not yield the correct result, even though it wasn't in the form of a single regex replacement?
Alan Moore