views:

184

answers:

7

I have a list of documents and I want to display them grouped by the first letter of their name on a web-page, over three columns.

In short, something like this:

A | C | E
A | D | F
B | D | F
B | D | F
  | D |

An important difference from say, the Windows Explorer view style is that I want letters to stay with each other. No breaking mid-group. To accommodate this, I don't care if one column is a couple of entries too tall.

I've started by sorting the array of documents by name and splitting them into a nested array. So I know (or can easily find out):

  • How many unique letters there are
  • How many letters there are in each group
  • How many entries there are in total
  • The mean average of how many values there should be in each column (ideally but not neccessarily)

I don't care what your answers come in. I'm looking for the algorithm rather than the implementation so you can code in anything you like (except perhaps Fortran). an explanation in HTML might be a toughie too.

I invite somebody to go wild on the tags because I couldn't think of anything relevant and no, this isn't homework, so please don't mark it as such.

+3  A: 

Well, you can expect always to have some extra rows in each column. I mean, if you have 2 A's, 2 B's and 33 C's, then the third column would be pretty tall compared to others.

It's not the Knapsack problem because they have to be in order.

What you can do is:

  • Count the number of items.
  • See where the first third would fall.
  • If it's exactly a letter change spot, then you're lucky :)
  • If not, then minimize the distance between the third part split spot and the previous / next letter change spot - i.e. if there's a letter change 2 entries before and 10 entries after, then go for the previous one.
  • Finally, take the rest, divide by two and follow the same logic to split as near as you can from the mean value.
Seb
It is more like the packing problem which is a deriviation of the knapsack problem, they all fall under combinatrics (or whatever you call that).
leppie
+5  A: 

Perhaps it helps if you look at the problem like this:

For your example, you have a string like this:

AA BB C DDDD E FFF

The space positions are the places where you could start a new column. Everywhere else you mustn't to keep same letters in the same column. So you actually can mark the space position like this:

AA1BB2C3DDDD4E5FFF

Now you have 5 positions where you can either break the column or not, as it's a binary decision, use a string of 0's and 1's for this and brute force every possible combination:

12345

00000 -> no break at all, column count = 1, max. lines = 13
...
01010 -> your example, column count = 3, max. lines = 5
...
11111 -> breaks everywhere, column count = 6, max. lines = 4

This is a brute force attempt, but you can easily see that the 1's count affects the column count (column count = number of 1's + 1) and you want to minimize max. lines, this should be possible to somehow calculate without having to test each combination.

EDIT2: Didn't recognize you want 3 columns, this makes it easier as you know you will have only 3 1's, but it's still brute force.

EDIT: Another approach I'd favorize:

Write the letter counts like this:

A B C D E F
2 2 1 4 1 3

You can now join letters that are next to each other. Always chose the two letters with the lowest count sum:

2 2 1 4 1 3 - lowest = "2 1"
2  3  4 1 3 - lowest = "1 3"
2  3  4  4  - lowest = "2 3"
  5   4  4  - stop now, as we have 3 columns now

Result: AABBC, DDDD, EFFF

This perhaps won't lead to the optimal solution, but it's a nice and easy way to solve your problem, I think.

schnaader
Your second solution looks beautiful on paper but I'm not sure how I'd do that programmatically. Looks like there'd be an ungodly amount of repetition.
Oli
You could easily do it programmatically. Use a list or an array that contains the letter counts at first. Then calculate sum(i)=count(i)+count(i+1) for each item and join where sum(i) is lowest. Do this until you have only 3 columns left.
schnaader
One problem with your (not terribly unattractive) greedy solution is that you can't handle tiebreaks. Consider the case where you only want two breaks and the sequence is 2 1 2 2. The optimal division is 3/4, but if you group 1 to the right you can't achieve it.
zweiterlinde
Yes, as I said, it won't lead to the optimal solution for all cases, and 3/4 is much better than 5/2. In those cases you could try joining both of the tiebreak combinations instead of just one and take the best result at the end.
schnaader
I've gone with the second approach. You were right - It's pretty simple to code out once you think about it for a minute. Not terribly efficient but that's not an issue where it's being used.
Oli
Note that you're implicitly relaxing the "[allow] one column [to be a] couple entries too tall" constraint with this approach.
tvanfosson
+3  A: 

There is no general solution to this problem given your constraints unless the input is also bounded. Consider, for example, a collection with a single document starting with letters A, B, C, E, and F and 15 (or a million) documents starting with D. In order to group all of D in one column, the column length has to be at least 15. If you use more than two columns then at best column 1 will have a length of 3, column 2 will have a length of 15 (or a million), and column 3 a length of 2. This violates your "within a couple of entries" constraint.

You need to decide if the constraint on having columns not break on letters is important enough to warrant the potential large disparities between column sizes or if the inputs are constrained such that the problem may be solveable with the given constraints. Personally I would rethink the interface as solving an optimization problem just to keep the letters together seems like overkill.

tvanfosson
A: 

I think you should start from defining a kind of "measure" which will tell you which layout is the best one, e.g. take the sum of the (average_size - actual_size(column))^2 for all columns. Then, because you always have 3 columns (is that right?) it should be reasonably fast to take all possible divisions and find the one maximising your measure.

Grzenio
A: 

First, make a pass over the documents to build an array of letter->count tuples.

The first entry is (first letter in array) -> document 0

Then find the entries which should appear in the second and third columns by walking through the array, adding the counts, but stopping just before you would pass the threshold for the 2nd and 3rd column (which is a 1/3 and 2/3 of the total count).

Denis Hennessy
A: 

This problem lends itself to a recursive solution---possibly classic dynamic programming, although I haven't worked it out exactly.

You have a fixed number of potential split points, and a certain number of splits to make. You should be able to have something like

(splits, max_ht, min_ht) = split_list(list, requested_splits, 
                                      curr_max, curr_min)

The function should iterate over the potential split points and recursively call itself (with one less requested split) on the remainder of list. E.g.,

def split_list(list, requested_splits, curr_max, curr_min):
    best_splits = []
    best_split_len = curr_max-curr_min
    best_max = curr_max
    best_min = curr_min

    if requested_splits == 0:
        return (best_splits, curr_max, curr_min)
    for candidate in candidate_split_points:
        this_max = max(curr_max, len(<list up to split point>)
        this_min = min(curr_min, len(<list up to split point>)
        (splits, new_max, new_min) = split_list(<list after split point>,
                                                requested_splits-1,
                                                this_max, this_min)
        if (new_max-new_min) < best_split_len:
            best_split_len = new_max - new_min
            best_max = new_max
            best_min = new_min
            best_splits = [candidate] + splits
    return (best_splits, best_max, best_min)
zweiterlinde
A: 

Here's something you could try. Since you know your ideal column number (n), shove all your elements in the first column to start with.

Repeat the next steps as often as you see fit... its an iterative algorithm, so the results get better fast over the first few iterations, and then your returns start diminishing.

Run through the columns sequentially.

Let the number of items in the current column be numCurrent.

If numCurrent < n, skip this column.

Track the elements that start with the first letter (groupFirst), and the last letter(groupLast) of the current column.

Calculate the no of items of the previous column (if there is one) as numPrev. If abs(n-numCurrent) > abs(n-numPrev+groupFirst), move groupFirst to the previous column.

Recalculate numCurrent.

Like before, if there is a next column, shift groupLast into it if abs(n-numCurrent) > abs(n-numNext+groupLast).

Rinse and repeat. The more rinses, the neater it should look. There will be a point at which no more changes will be possible, and also points at which it can just keep going. You decide how many iterations.

Sudhir Jonathan