I need to read through a log file, extracting all paths, and return a sorted list of the paths containing no duplicates. What's the best way to do it? Using a set
?
I thought about something like this:
def geturls(filename)
f = open(filename)
s = set() # creates an empty set?
for line in f:
# see if the line matches some regex
if match:
s.add(match.group(1))
f.close()
return sorted(s)
EDIT
The items put in the set are path strings, which should be returned by the functions as a list sorted into alphabetical order.
EDIT 2 Here is some sample data:
10.254.254.28 - - [06/Aug/2007:00:12:20 -0700] "GET /keyser/22300/ HTTP/1.0" 302 528 "-" "Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.8.1.4) Gecko/20070515 Firefox/2.0.0.4" 10.254.254.58 - - [06/Aug/2007:00:10:05 -0700] "GET /edu/languages/google-python-class/images/puzzle/a-baaa.jpg HTTP/1.0" 200 2309 "-" "googlebot-mscrawl-moma (enterprise; bar-XYZ; [email protected],[email protected],[email protected],[email protected])" 10.254.254.28 - - [06/Aug/2007:00:11:08 -0700] "GET /favicon.ico HTTP/1.0" 302 3404 "-" "googlebot-mscrawl-moma (enterprise; bar-XYZ;
The interesting part are the urls between GET and HTTP. Maybe I should have mentioned that this is part of an exercise, and no real world data.