views:

148

answers:

7

I have a data structure like this:

items = [
    ['Schools', '', '', '32'],
    ['Schools', 'Primary schools', '', '16'],
    ['Schools', 'Secondary schools', '', '16'],
    ['Schools', 'Secondary schools', 'Special ed', '8'],
    ['Schools', 'Secondary schools', 'Non-special ed', '8'],
]

It's a list of spending items. Some are aggregates, e.g. items[0] is aggregate spending on all schools, and items[2] is aggregate spending on secondary schools. Those that are not aggregates are items[1], items[3] and items[4].

How can I elegantly reduce the list so it only shows non-aggregate items? In pseudocode:

for each item in items
   check if item[1] is blank, if it is
       check if item[0] matches another item’s[0]
       if it does and if that item’s[1] isn’t blank
           delete item
   check if item[2] is blank, if it is
       check if item[1] matches another item’s[1]
       if it does and if if that item’s[2] isn’t blank
           delete item 

Here's my (lame!) attempt so far:

for i in range(len(items)):
    i -= 1
    if items[i]:
        if items[i][1] == "":
            for other_item in items:
                if items[i][0]==other_item[0] and other_item[1]!="":
                    items_to_remove.append(i)
                    continue
        elif items[i][2]=="":
            for other_item in items:
                if items[i][1] == other_item[1] and other_item[2] != "":
                    items_to_remove.append(i)
                    continue
new_items = [ key for key,_ in groupby(items_to_remove)]
new_items.sort(reverse=True)  
for number in new_items:
    temp_item = items[number]
    items.remove(temp_item)

This is just so ugly. What can I do better?

NB: I could use dictionaries instead of lists, if that would make life easier :)

A: 

You could use a list comprehension, as follows:

items = [a for a in items if a[1] != '' and a[2] != '']

Or, if an empty string in any position denotes an aggregate item:

items = [a for a in items if '' not in a]

Of course, you don't necessarily need to assign the reduced list to items - you can use it however you like.

Neither of your list comprehensions will produce the desired output. See the originally posted pseudocode for the requirements.
Andrew
-1 If only this worked...
katrielalex
Sorry - misunderstood the requirements. That'll teach me to try and bang out an answer in a spare couple of minutes.
+1  A: 
list_keys = [ "".join(x[:-1]) for x in items ]
for i in range(len(list_keys)-1):
  if not list_keys[i+1].startswith(list_keys[i]):
     print items[i]
print items[-1]

Here I find the "key" of each item, which is all entries in an item, concatenated, except the last value.

An aggregate item's key is always a prefix of succeeding items' keys, so we can use this test to detect aggregate items and dismiss them.

This alg. prints (on your input):

['Schools', 'Primary schools', '', '16']
['Schools', 'Secondary schools', 'Special ed', '4']
['Schools', 'Secondary schools', 'Non-special ed', '4'],

Note:
This assumes all items are ordered neatly in a tree structure (as your original data). If it's not, it'll be (slightly) more complicated as you'll have to sort the keys before the loop (and keep track of which key belongs to which item).

adamk
A: 

How about making your items objects?

class School (object):
    __init__(self, is_aggregate=false):
        self.is_aggregate = is_aggregate
Ubersoldat
A: 

I'm using a helper function that

  • makes a set of the entries with more fine-grained data
  • returns a function suitable for filter that removes aggregate items

Run your example data through this function, and it will give the desired output :)

def remove_aggregates(items):

    def mk_pred(index_i, blank_i, items):
        posts = set(x[index_i] for x in items if x[blank_i] != '')    
        def pred(item):
            return not (item[blank_i] == '' and item[index_i] in posts)
        return pred    

    items = filter(mk_pred(0,1,items), items)
    items = filter(mk_pred(1,2,items), items)
    return items
gnud
+2  A: 

Firstly, I suggest your data structure should look more like this:

items = [
    ['Schools', None, None, 32],
    ['Schools', 'Primary schools', None, 16],
    ['Schools', 'Secondary schools', None, 8],
    ['Schools', 'Secondary schools', 'Special ed', 4],
    ['Schools', 'Secondary schools', 'Non-special ed', 4],
]

We can sort them into a dictionary like this:

result = {}

for item in items:
    if not item[0] in result or not isinstance(result[item[0]], dict): result[item[0]] = {}
    if not item[1] in result[item[0]] or not isinstance(result[item[0]][item[1]], dict): result[item[0]][item[1]] = {}
    if not item[2] in result[item[0]][item[1]] or not isinstance(result[item[0]][item[1]][item[2]], dict): result[item[0]][item[1]][item[2]] = {}

    if not item[0]:
        result = item[3]
    elif not item[1]:
        result[item[0]] = item[3]
    elif not item[2]:
        result[item[0]][item[1]] = item[3]
    else:
        result[item[0]][item[1]][item[2]] = item[3]

And you should end up with a dictionary like:

result = {
    'Schools': {
        'Secondary schools': {
            'Non-special ed': '4',
            'Special ed': '4'
        },
        'Primary schools': '16'
    }
}

My routine could probably be optimized and made recursive.

Also, the numbers total to 24 -- is this an error on your part?

rfw
Yes, that was an error :((( stupid of me because it ruins the explanation. well spotted, I've fixed.
AP257
A: 

You've asked how to do so elegantly, and further, how to do so better. Your nota bene suggests that the structure you're working with is still malleable. If you want to be able to do so more elegantly, I would suggest changing the way the data is stored. Some options are:

Include an additional field in each list, which indicates whether it is an aggregate value or not:

items = [
    ['Schools', '', '', '32', True],
    ['Schools', 'Primary schools', '', '16', False],
    ['Schools', 'Secondary schools', '', '8', True],
    ['Schools', 'Secondary schools', 'Special ed', '4', False],
    ['Schools', 'Secondary schools', 'Non-special ed', '4', False],
]

Split your data into two lists:

items = [
    [
        ['Schools', '', '', '32'],
        ['Schools', 'Secondary schools', '', '8'],
    ],
    [
        ['Schools', 'Primary schools', '', '16'],
        ['Schools', 'Secondary schools', 'Special ed', '4'],
        ['Schools', 'Secondary schools', 'Non-special ed', '4'],
    ],
]

Make the aggregate values contain a list of their children (although this still wouldn't be much fun to reduce):

items = [
    ['Schools', '', '', '32', [
        ['Schools', 'Primary schools', '', '16', []],
        ['Schools', 'Secondary schools', '', '8', [
            ['Schools', 'Secondary schools', 'Special ed', '4'],
            ['Schools', 'Secondary schools', 'Non-special ed', '4'],
        ],
    ],
]

I would say that the current structure of your data does not allow you to do anything elegant with it. You require logic along the lines of "if this index is blank but it isn't for another entry that has the same value at another of my indexes", and this has to be done twice per list entry, because that logic can occur at two separate pairs of index locations. Fix the way you store your information, and you will be able to write an elegant method of reducing that data.

For example, if you went with the first option I listed (using booleans to indicate whether the entry is aggregate), you could reduce the list with:

reduced = [item for item in items where item[4] == False]
Andrew
You are right that the data structure is bad, but I don't think your suggestion is that good either. A better way to do this is with nested `dict`s.
katrielalex
A: 

This attemp tries not to be dependend of sorting of the input:

items = [
    ['Schools', '', '', '32'],
    ['Schools', 'Primary schools', '', '16'],
    ['Schools', 'Secondary schools', '', '16'],
    ['Schools', 'Secondary schools', 'Special ed', '8'],
    ['Schools', 'Secondary schools', 'Non-special ed', '8'],
]

def path(item,upto=None):
    return ".".join([p for p in item[:-1] if p][:upto])    

from collections import defaultdict
children_counter = defaultdict(int)
for i in items:
    children_counter[path(i,-1)] += 1

for i in items:
   if children_counter[path(i)] == 0:
        print i
toka