tags:

views:

88

answers:

6

This is an example that searches PDF files in the current directory.

import os, os.path
import re

def print_pdf (arg, dir, files):
 for file in files:
  path = os.path.join(dir, file)
  path = os.path.normcase(path)
  if re.search(r".*\.pdf", path):
   print path

os.path.walk('.', print_pdf, 0)

Could anyone explain what r".*\.pdf" means here?

Why ".*\"?

Thanks!

+7  A: 

it means any character zero or more times, followed by the literal dot and letters pdf (due to the greedy nature of the asterisk, it's basically guaranteed that the '.pdf' are going to be at the end of the subject string).

There is glob module to do this the right way:

>>> glob.glob(os.path.join(dirname, '*.pdf'))
SilentGhost
Or at least get rid of the regex with `path.endswith('.pdf')`
prestomation
+2  A: 

The . means match any character but "\n". The * means "repeat the previous character 0 or more times". The \. matches an actual ".".

BTW, this is all in the docs.

Ignacio Vazquez-Abrams
+1  A: 

This searches for a string containing zero or more chars followed by ".pdf" The .* is a common idiom in regexps and it means match any char 0 or more times. The . is because in regexps, the . has a special meaning, and the \ escapes that.

Aaron
+3  A: 

Why ".*\"?

Wrong question, you missed a crucial character of the expression. ;-)

In fact, .* will match any character (. in regex), as many times as possible (* in regex; relates to the previous string, so . in this case).

\., on the other hand, will match exactly one dot (.). \ escapes the following character (.) so it does no longer have its special meaning (e.g. in this case “match any character”) but rather it will be treated as-is.

Konrad Rudolph
Thank you for the explanation! also thanks to SilentGhost, but I can only choose one answer. :)
nonamelive
A: 

The period (.)
will match any character except newlines

The following asterisk (*)
means unlimited number of repetitions of the preceding period

The backslash ()
escapes the period in .pdf So it looks for a real period, so ONLY the .pdf and not "any character".pdf again'

So in the end it looks for Any piece of text that ends in .pdf

S.Hoekstra
A: 

use os.walk() instead. And there's no need to use regex.

for r,d,f in os.walk(path):
    for files in f:
        if files[-4:].lower() == ".pdf":
             print "found pdf: ",os.path.join(r,files)
ghostdog74