tags:

views:

88

answers:

5

I need to read through a log file, extracting all paths, and return a sorted list of the paths containing no duplicates. What's the best way to do it? Using a set?

I thought about something like this:

def geturls(filename)
  f = open(filename)
  s = set() # creates an empty set?

  for line in f:
    # see if the line matches some regex

    if match:
      s.add(match.group(1))

  f.close()

  return sorted(s)

EDIT

The items put in the set are path strings, which should be returned by the functions as a list sorted into alphabetical order.

EDIT 2 Here is some sample data:

10.254.254.28 - - [06/Aug/2007:00:12:20 -0700] "GET /keyser/22300/ HTTP/1.0" 302 528 "-" "Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.8.1.4) Gecko/20070515 Firefox/2.0.0.4" 10.254.254.58 - - [06/Aug/2007:00:10:05 -0700] "GET /edu/languages/google-python-class/images/puzzle/a-baaa.jpg HTTP/1.0" 200 2309 "-" "googlebot-mscrawl-moma (enterprise; bar-XYZ; [email protected],[email protected],[email protected],[email protected])" 10.254.254.28 - - [06/Aug/2007:00:11:08 -0700] "GET /favicon.ico HTTP/1.0" 302 3404 "-" "googlebot-mscrawl-moma (enterprise; bar-XYZ;

The interesting part are the urls between GET and HTTP. Maybe I should have mentioned that this is part of an exercise, and no real world data.

+2  A: 

Only if the order doesn't matter (since sets are unordered), and if the types are hashable (which strings are).

Ignacio Vazquez-Abrams
Sorry, just edited the original question.
Helper Method
The answer still holds.
Ignacio Vazquez-Abrams
A: 

you can use dictionary to store your path.

from collections import defaultdict
h=defaultdict(str)
uniq=[]
for line in open("file"):
    if "pattern" in line:
       # code to extract path here.
       extractedpath= ......
       h[extractedpath.strip()] = "" #using dictionary to store unique values
       if extractedpath not in uniq:
           uniq.append(extractedpath) #using a list to store unique values
ghostdog74
What's the point of a dictionary here? A dict where the keys are irrelevant (set to some dummy Value and never used) is essentially a poor man's semantically-wrong, relatively inefficient set.
delnan
the dictionary is just to store unique path values while iterating over the file. I could have use a list as well..
ghostdog74
@ghostdog74: As @deinan was saying, it would be more efficient to store the paths in a `set` instead of a `dict` since you're not using the associated value dictionaries require to be associated with each key (i.e. the `""` dummy value you're using). Regardless of which data structure you use, each dictionary key or set element can appear in the container only once, so there's no need for maintaining a separate `uniq` list -- just extract all the dictionary keys or set elements into a list after processing all the lines in the file.
martineau
A: 

Only you should have full path names everywhere and if you are in Windows, names can be various cases as they are case insensitive. Also in Python you can also use / instead of \ (yes: be carefull of escaping the backslashes).

If you are dealing actually with URLs, most of time domain.com, domain.com/, www.domain.com and http://www.domain.com mean same thing and you should deside how to normalize.

Tony Veijalainen
What does any of this have to do with the question?
Ignacio Vazquez-Abrams
set requires exact key, one different case letter, extra space or / etc. and you get duplicate entries
Tony Veijalainen
+4  A: 
def sorted_paths(filename):
    with open(filename) as f:
       gen = (matches(line) for line in f)
       s = set(match.group(1) for match in gen if match)
    return sorted(s)
SilentGhost
If the matching group is the whole line, I think the two generators could be merged into one.
mg
Wow, now that's an elegant solution :-).
Helper Method
+2  A: 

This is a good way of doing it, both in terms of performance and in terms of conciseness.

fas
Yes, using a set to remove duplicates is becoming a common Python idiom, as long as the items are hashable. Before they were introduced into the language, dictionaries with dummy values were often used instead.
martineau