views:

124

answers:

3

The following key:value pairs are 'page' and 'page contents'.

{
  'section-a.html':{'contents':'section-b.html section-c.html section-d.html'},
  'section-b.html':{'contents':'section-d.html section-e.html'},
  'section-c.html':{'contents':'product-a.html product-b.html product-c.html product-d.html'},
  'section-d.html':{'contents':'product-a.html product-c.html'},
  'section-e.html':{'contents':'product-b.html product-d.html'},
  'product-a.html':{'contents':''},
  'product-b.html':{'contents':''},
  'product-c.html':{'contents':''},
  'product-d.html':{'contents':''}
}

For any given 'item' how could I find the path(s) to said item? With my very limited knowledge of data structures in most cases, I'm assuming this would be a hierarchy tree. Please correct me if I'm wrong!

UPDATE: My apologies, I should have been more clear about the data and my expected outcome.

Assuming 'page-a' is an index, each 'page' is literally a page appearing on a website, where as each 'item' is something like a product page that would appear on Amazon, Newegg, etc.

Thus, my expected output for 'item-d' would be a path (or paths) to that item. For example (delimiter is arbitrary, for illustration here): item-d has the following paths:

page-a > page-b > page-e > item-d
page-a > page-c > item-d

UPDATE2: Updated my original dict to provide more accurate and real data. '.html' added for clarification.

+2  A: 

Here's a simple approach -- it's O(N squared), so, not all that highly scalable, but will serve you well for a reasonable book size (if you have, say, millions of pages, you need to be thinking about a very different and less simple approach;-).

First, make a more usable dict, mapping page to set of contents: e.g., if the original dict is d, make another dict mud as:

mud = dict((p, set(d[p]['contents'].split())) for p in d)

Then, make the dict mapping each page to its parent pages:

parent = dict((p, [k for k in mud if p in mud[k]]) for p in mud)

Here, I'm using lists of parent pages (sets would be fine too), but that's OK for pages with 0 or 1 parents as in your example, too -- you'll just be using an empty list to mean "no parent", else a list with the parent as the one and only item. This should be an acyclic directed graph (if you're in doubt, you can check, of course, but I'm skipping that check).

Now, given a page, finding the paths up its parent(s) to a parentless-parent ("root page") simply require "walking" the parent dict. E.g., in the 0/1 parent case:

path = [page]
while parent[path[-1]]:
  path.append(parent[path[-1]][0])

If you can clarify your specs better (ranges of number of pages per book, number of parents per page, and so on), this code can no doubt be refined, but as a start I hope it can help.

Edit: as the OP clarified that cases with > 1 parent (and so, multiple paths) are indeed of interest, let me show how do deal with that:

partial_paths = [ [page] ]
while partial_paths:
  path = partial_paths.pop()
  if parent[path[-1]]:
    # add as many partial paths as open from here
    for p in parent[path[-1]]:
      partial_paths.append(path + [p])
  else:
    # we've reached a root (parentless node)
    print(path)

Of course, instead of printing, you can yield each path when it reaches a root (making the function whose body this is into a generator), or otherwise treat it in whatever way you require.

Edit again: a commenter is worried about cycles in the graph. If that worry's warranted, it's not hard to keep track of nodes already seen in a path and detect and warn about any cycles. Fastest is to keep a set alongside each list representing a partial path (we need the list for ordering, but checking for membership is O(1) in sets vs O(N) in lists):

partial_paths = [ ([page], set([page])) ]
while partial_paths:
  path, pset = partial_paths.pop()
  if parent[path[-1]]:
    # add as many partial paths as open from here
    for p in parent[path[-1]]:
      if p in pset:
        print('Cycle: %s (%s)' % (path, p))
        continue
      partial_paths.append((path + [p], pset.union([p])))
  else:
    # we've reached a root (parentless node)
    print('Path: %s' % (path,))

It's probably worthwhile, for clarity, packing the list and set representing a partial path into a small utility class Path with suitable methods.

Alex Martelli
Thank you for your answer. I updated my original question with more clarification. Your code still holds true, except it doesn't seem to take into account multiple paths.
gibson
@gibson, tx for clarifying -- let me expand the answer to show how to deal with nodes with > 1 parent.
Alex Martelli
Alex, why do you think the graph will be acyclic? Or did you mean "should" in the sense that "this should be acyclic, or there will be problems"?
Robert Rossney
@Robert, the OP refers to the structure as "a tree" -- and that's definitionally acyclic. I think it's a more general DAG, but it's not hard to detect cycles if you fear them -- let me edit the answer to show that.
Alex Martelli
Apologies for responding so late to this. The code in blocks 4 and 5 work perfectly. My interest lies more in the 5th block. Robert and Alex: you both mention 'cycles'. Can you please clarify what a 'cycle' is in terms of my expected tree? Is this simply some kind of loop in the tree? My understanding of most trees in general is very limited, so an explanation would be most appreciated. @Alex, I also noticed that block 4 tends to use up a very large amount of system resources if let run. Is this related to 'cycles' in my tree? Thanks again for the explanation and the edits in your answer.
gibson
Also, I am accepting this answer because it has fulfilled my original question.
gibson
@gibson, yes, a cycle is the same thing as a loop -- a graph having cycles (aka loops) is not in fact a tree, because the definition of tree includes its being acyclical, but the last version makes sure not to loop forever even if the input graph is imperfect (and has cycles). As for performance, I mentioned at the very start that these simple approaches are quadratic, but given your example size the resource consumption should not be huge anyway -- but, if your input is actually very large, you may need a different (and much more complicated) approach.
Alex Martelli
A: 

EDIT With the question explained a bit better I think the following might be what you need, or could at least provide something of a starting point.

data = {
  'section-a.html':{'contents':'section-b.html section-c.html section-d.html'},
  'section-b.html':{'contents':'section-d.html section-e.html'},
  'section-c.html':{'contents':\
                    'product-a.html product-b.html product-c.html product-d.html'},
  'section-d.html':{'contents':'product-a.html product-c.html'},
  'section-e.html':{'contents':'product-b.html product-d.html'},
  'product-a.html':{'contents':''},
  'product-b.html':{'contents':''},
  'product-c.html':{'contents':''},
  'product-d.html':{'contents':''}
}

def findSingleItemInData(item):
    return map( lambda x: (item, x), \
                [key for key in data if data[key]['contents'].find(item) <> -1])

def trace(text):
    searchResult = findSingleItemInData(text)
    if not searchResult:
     return text

    retval = [] 
    for item in searchResult:
     retval.append([text, trace(item[-1])]) 

    return retval

print trace('product-d.html')

OLD

I don't really know what you expect to see, but maybe something like this will work.

data = {
   'page-a':{'contents':'page-b page-c'},
   'page-b':{'contents':'page-d page-e'},
   'page-c':{'contents':'item-a item-b item-c item-d'},
   'page-d':{'contents':'item-a item-c'},
   'page-e':{'contents':'item-b item-d'}
}

itemToFind = 'item-c'

for key in data:
  for index, item in enumerate(data[key]['contents'].split()):
    if item == itemToFind:
      print key, 'at position', index

It would be easier, and I think more correct, if you'd use a slightly different data structure:

 data = {
   'page-a':{'contents':['page-b', 'page-c']},
   'page-b':{'contents':['page-d', 'page-e']},
   'page-c':{'contents':['item-a', 'item-b item-c item-d']},
   'page-d':{'contents':['item-a', 'item-c']},
   'page-e':{'contents':['item-b', 'item-d']}
 }

Then you wouldn't need to split.

Given that last case, it can even be expressed a bit shorter:

for key in data:
    print [ (key, index, value) for index,value in \
             enumerate(data[key]['contents']) if value == 'item-c' ]

And even shorter, with the empty lists removed:

print filter(None, [[ (key, index, value) for index,value in \ 
       enumerate(data[key]['contents']) if value == 'item-c' ] for key in data])

That should be a single line, but I used \ as line break indicator so it can be read without scrollbars.

extraneon
Thanks very much for the effort. Your third and fourth blocks of code seem to be on the right track, but they don't give any useful output. Block #3 outputs all empty lists and #4 outputs nothing.
gibson
Did you add print before it to see what's produced?
extraneon
I did. The error I found was in my own code. I had forgotten to use lists in 'contents' instead of space-separated. Dumb mistake. However, this output doesn't appear to provide paths to any given page. As stated in my (now updated, for clarification) question, I'm looking for 'paths' to each page (page being on a website).
gibson
A: 

Here's an illustration for your question. It is easier to reason about graphs when you have a picture.

First, abbreviate the data:

#!/usr/bin/perl -pe
s/section-([a-e])\.html/uc$1/eg; s/product-([a-e])\.html/$1/g

Result:

# graph as adj list
DATA = {
  'A':{'contents':'B C D'},
  'B':{'contents':'D E'},
  'C':{'contents':'a b c d'},
  'D':{'contents':'a c'},
  'E':{'contents':'b d'},
  'a':{'contents':''},
  'b':{'contents':''},
  'c':{'contents':''},
  'd':{'contents':''}
}

Convert to graphviz's format:

with open('data.dot', 'w') as f:
    print >> f, 'digraph {'
    for node, v in data.iteritems():
        for child in v['contents'].split():
            print >> f, '%s -> %s;' % (node, child),
        if v['contents']: # don't print empty lines
            print >> f
    print >> f, '}'

Result:

digraph {
A -> C; A -> B; A -> D;
C -> a; C -> b; C -> c; C -> d;
B -> E; B -> D;
E -> b; E -> d;
D -> a; D -> c;
}

Plot the graph:

$ dot -Tpng -O data.dot

data.dot

J.F. Sebastian