ansaurus

Question

Slow implementation of famous compression algorithm (LZ78)

Answer 1

+2 A:

In my opinion an ArrayList isn't the best datastructure for keeping a dictionary with only contains and adds.

EDIT

Try using an HashSet, which stores its elements in a hash table, is the best-performing implementation of the Set interface; however it makes no guarantees concerning the order of iteration

dfa 2009-07-15 10:21:21

what is the best in your opinion?

Robert 2009-07-15 10:22:26

Answer 2

+1 A:

An ArrayList will have O(N) search complexity. Use a data structure such as a hash table or dictionary.

Mitch Wheat 2009-07-15 10:21:36

This implies that string comparison is a primitive operation which is not true. Since strings get larger and larger as their indices of ArrayList increase comparison will also depend on those lengths.

sharptooth 2009-07-15 10:24:43

@sharptooth: the poster's update to the question seems to validate my answer.

Mitch Wheat 2009-07-15 11:30:53

Answer 3

+2 A:

I think I can do better (sorry a bit long):

import java.io.File;
import java.io.FileInputStream;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Map;
import java.util.Set;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;

public class LZ78 {
    /**
     * Uses the LZ78 compression algorithm to approximate the Kolmogorov
     * complexity of a String
     * 
     * @param s
     * @return the approximate Kolmogorov complexity
     */
    public static double kComplexity(String s) {
        Set<String> dictionary = new HashSet<String>();
        StringBuilder w = new StringBuilder();
        double comp = 0;
        for (int i = 0; i < s.length(); i++) {
            char c = s.charAt(i);
            if (dictionary.contains(w.toString() + c)) {
                w.append(c);
            } else {
                comp++;
                dictionary.add(w.toString() + c);
                w = new StringBuilder();
            }
        }
        if (w.length() != 0) {
            comp++;
        }
        return comp;
    }

    private static boolean equal(Object o1, Object o2) {
        return o1 == o2 || (o1 != null && o1.equals(o2));
    }

    public static final class FList<T> {
        public final FList<T> head;
        public final T tail;
        private final int hashCodeValue;

        public FList(FList<T> head, T tail) {
            this.head = head;
            this.tail = tail;
            int v = head != null ? head.hashCodeValue : 0;
            hashCodeValue = v * 31 + (tail != null ? tail.hashCode() : 0);
        }

        @Override
        public boolean equals(Object obj) {
            if (obj instanceof FList<?>) {
                FList<?> that = (FList<?>) obj;
                return equal(this.head, that.head)
                        && equal(this.tail, that.tail);
            }
            return false;
        }

        @Override
        public int hashCode() {
            return hashCodeValue;
        }

        @Override
        public String toString() {
            return head + ", " + tail;
        }
    }

    public static final class FListChar {
        public final FListChar head;
        public final char tail;
        private final int hashCodeValue;

        public FListChar(FListChar head, char tail) {
            this.head = head;
            this.tail = tail;
            int v = head != null ? head.hashCodeValue : 0;
            hashCodeValue = v * 31 + tail;
        }

        @Override
        public boolean equals(Object obj) {
            if (obj instanceof FListChar) {
                FListChar that = (FListChar) obj;
                return equal(this.head, that.head) && this.tail == that.tail;
            }
            return false;
        }

        @Override
        public int hashCode() {
            return hashCodeValue;
        }

        @Override
        public String toString() {
            return head + ", " + tail;
        }
    }

    public static double kComplexity2(String s) {
        Map<FList<Character>, FList<Character>> dictionary = 
            new HashMap<FList<Character>, FList<Character>>();
        FList<Character> w = null;
        double comp = 0;
        for (int i = 0; i < s.length(); i++) {
            char c = s.charAt(i);
            FList<Character> w1 = new FList<Character>(w, c);
            FList<Character> ex = dictionary.get(w1);
            if (ex != null) {
                w = ex;
            } else {
                comp++;
                dictionary.put(w1, w1);
                w = null;
            }
        }
        if (w != null) {
            comp++;
        }
        return comp;
    }

    public static double kComplexity3(String s) {
        Map<FListChar, FListChar> dictionary = 
            new HashMap<FListChar, FListChar>(1024);
        FListChar w = null;
        double comp = 0;
        for (int i = 0; i < s.length(); i++) {
            char c = s.charAt(i);
            FListChar w1 = new FListChar(w, c);
            FListChar ex = dictionary.get(w1);
            if (ex != null) {
                w = ex;
            } else {
                comp++;
                dictionary.put(w1, w1);
                w = null;
            }
        }
        if (w != null) {
            comp++;
        }
        return comp;
    }

    public static void main(String[] args) throws Exception {
        File f = new File("methods.txt");
        byte[] data = new byte[(int) f.length()];
        FileInputStream fin = new FileInputStream(f);
        int len = fin.read(data);
        fin.close();
        final String test = new String(data, 0, len);

        final int n = 100;
        ExecutorService exec = Executors.newFixedThreadPool(1);
        exec.submit(new Runnable() {
            @Override
            public void run() {
                long t = System.nanoTime();
                double value = 0;
                for (int i = 0; i < n; i++) {
                    value += kComplexity(test);
                }
                System.out.printf("kComplexity: %.3f; Time: %d ms%n",
                        value / n, (System.nanoTime() - t) / 1000000);
            }
        });
        exec.submit(new Runnable() {
            @Override
            public void run() {
                long t = System.nanoTime();
                double value = 0;
                for (int i = 0; i < n; i++) {
                    value += kComplexity2(test);
                }
                System.out.printf("kComplexity2: %.3f; Time: %d ms%n", value
                        / n, (System.nanoTime() - t) / 1000000);
            }
        });
        exec.submit(new Runnable() {
            @Override
            public void run() {
                long t = System.nanoTime();
                double value = 0;
                for (int i = 0; i < n; i++) {
                    value += kComplexity3(test);
                }
                System.out.printf("kComplexity3: %.3f; Time: %d ms%n", value
                        / n, (System.nanoTime() - t) / 1000000);
            }
        });
        exec.shutdown();
    }
}

Results:

kComplexity: 41546,000; Time: 17028 ms
kComplexity2: 41546,000; Time: 6555 ms
kComplexity3: 41546,000; Time: 5971 ms

Edit Peer pressure: How does it work?

Franky, have no idea, it just seemed to be a good way to speed up things. I have to figure it out too, so here we go.

It was an observation that the original code made lot of string appends, however, replacing it with a LinkedList<String> wouldn't help as there is a constant pressure for looking up collections in the hash-table - every time, the hashCode() and equals() are used it needs to traverse the entire list.

But how can I make sure the code doesn't perform this unnecessary? The answer: Immutability - if your class is immutable, then at least the hashCode is constant, therefore, can be pre-computed. The equality check can be shortcutted too - but in worst case scenario it will still traverse the entire collection.

That's nice, but then how do you 'modify' an immutable class. No you don't, you create a new one every time a different content is needed. However, when you look close at the contents of the dictionary, you'll recognize it contains redundant information: []a, [abc]d, [abc]e, [abcd]f. So why not just store the head as a pointer to a previously seen pattern and have a tail for the current character?

Exactly. Using immutability and back-references you can spare space and time, and even the garbage collector will love you too. One speciality of my solution is that in F# and Haskell, the list uses a head:[tail] signature - the head is the element type and the tail is a pointer to the next collection. In this question, the opposite was required as the lists grow at the tail side.

From here on its just some further optimization - for example, use a concrete char as the tail type to avoid constant char autoboxing of the generic version.

One drawback of my solution is that it utilizes recursion when calculating the aspects of the list. For relatively small list it's fine, but longer list could require you to increase the Thread stack size on which your computation runs. In theory, with Java 7's tail call optimization, my code can be rewritten in such a way that it allows the JIT to perform the optimization (or is it already so? Hard to tell).

kd304 2009-07-15 10:55:18

this is significantly faster

Robert 2009-07-15 11:26:12

I would like to express my gratitude to the F#/Haskell community for this kind of linked list concept.

kd304 2009-07-15 11:39:33

As I concurred with a comment below - it would be nice to see a theoretical explanation, as although I'm using java I'm not an expert and have not seen some of this before.

Robert 2009-07-15 12:21:44

Then start a new question about it and have a cross-reference between your new question to this question.

kd304 2009-07-15 12:32:52

Or you could just edit your answer to add a short description of how it works. I think you would get more upvotes to your answer that way as well since I sincerely hope no one upvotes things they don't understand and it seems like there's a few of us who are curious about your solution.

kigurai 2009-07-15 14:16:12

Answer 4

A:

Since you are always checking for prefix+c I think a good data structure could be a tree where each child has its parent as prefix:

           root
        /       |
       a        b
      /  |     /  | 
     an  ap   ba bo
         |         
         ape

Another perhaps more simple approach is to use a sorted list and then use binary search to find the string you are looking at. I still think the tree approach will be faster though.

kigurai 2009-07-15 11:30:07

Wow, check my FList<T> and FListChar. Does it fit your description?

kd304 2009-07-15 11:37:27

Maybe, but I did not look at it close enough. As it was only (a lot of) code and no theoretical description I did not spend enough time understanding it. I am not a Java programmer so it would have taken me quite a while to look up all the stuff you used that I had not seen before.

kigurai 2009-07-15 11:59:49

yes, would be nice to see a theoretical explanation too.

Robert 2009-07-15 12:19:30

Answer 5

A:

Another micro optimization you could try is to swap out the collections objects with these fastutil implementations http://fastutil.dsi.unimi.it/ - its basically free speedup.

Chii 2009-07-15 13:13:11

Benchmark it please. You can use my code sample for that if you need.

kd304 2009-07-15 14:16:15

ansaurus

tags:

views:

answers:

Slow implementation of famous compression algorithm (LZ78)

related questions