views:

115

answers:

2

I currently have some python code I'd like to port to C++ as it's currently slower than I'd like it to be. Problem is that I'm using a dictionary in it where the key is a tuple consisting of an object and a string (e.g. (obj, "word")). How on earth do I write something similar in C++? Maybe my algorithm is horrendous and there is some way I can make it faster without resorting to C++?

The whole algorithm below for clarity's sake. The dictionary "post_score" is the issue.

def get_best_match_best(search_text, posts):
    """
    Find the best matches between a search query "search_text" and any of the
    strings in "posts".

    @param search_text: Query to find an appropriate match with in posts.
    @type search_text: string
    @param posts: List of candidates to match with target text.
    @type posts: [cl_post.Post]
    @return: Best matches of the candidates found in posts. The posts are ordered
    according to their rank. First post in list has best match and so on.
    @returntype: [cl_post.Post]
    """
    from math import log

    search_words = separate_words(search_text)
    total_number_of_hits = {}
    post_score = {}
    post_size = {}
    for search_word in search_words:
        total_number_of_hits[search_word] = 0.0
        for post in posts:
            post_score[(post, search_word)] = 0.0
            post_words = separate_words(post.text)
            post_size[post] = len(post_words)
            for post_word in post_words:
                possible_match = abs(len(post_word) - len(search_word)) <= 2
                if possible_match:
                    score = calculate_score(search_word, post_word)
                    post_score[(post, search_word)] += score
                    if score >= 1.0:
                        total_number_of_hits[search_word] += 1.0

    log_of_number_of_posts = log(len(posts))
    matches = []
    for post in posts:
       rank = 0.0
       for search_word in search_words:
           rank += post_score[(post, search_word)] * \
                  (log_of_number_of_posts - log(1.0 + total_number_of_hits[search_word]))
       matches.append((rank / post_size[post], post))
    matches.sort(reverse=True)
    return [post[1] for post in matches]
+3  A: 

map<pair<..., string>, ...> if you're hellbent on using C++ for this.

Ignacio Vazquez-Abrams
If MdaG was truly hellbent, I think he'd have used caps lock.
Ken
Thanks, pair was what I was missing. :-)And I'm not hellbent, if I can use Cython or a better algorithm that's even better. :-)
MdaG
+2  A: 

for once, you're calling separate_words(post.text) for every search_word in search_words. You should call separate_words only once for each post in posts.

That is, rather than:

for search_word in search_words:
    for post in posts:
        # do heavy work

you should instead have:

for post in posts:
    # do the heavy works
    for search_word in search_words:
        ...

If, as I suspected, that separate_words do a lot of string manipulations, don't forget that string manipulations is relatively expensive in python since string is immutable.

Another improvement you can do, is that you don't have to compare every word in search_words with every word in post_words. If you keep the search_words and post_words array sorted by word length, then you can use a sliding window technique. Basically, since search_word will only match a post_word if the difference in their length is less than 2, then you need only to check among the window of two lengths differences, thereby cutting down the number of words to check, e.g.:

search_words = sorted(search_words, key=len)
g_post_words = collections.defaultdict(list) # this can probably use list of list
for post_word in post_words:
    g_post_words[len(post_word)].append(post_word)

for search_word in search_words:
    l = len(search_word)
    # candidates = itertools.chain.from_iterable(g_post_words.get(m, []) for m in range(l - 2, l + 3))
    candidates = itertools.chain(g_post_words.get(l - 2, []), 
                                 g_post_words.get(l - 1, []), 
                                 g_post_words.get(l    , []),
                                 g_post_words.get(l + 1, []),
                                 g_post_words.get(l + 2, [])
                                )
    for post_word in candidates:
        score = calculate_score(search_word, post_word)
        # ... and the rest ...

(this code probably won't work as is, it's just to illustrate the idea)

Lie Ryan
This is valuable input and you're correct. I'm not familiar with itertools, but this is a good time to read up on it. Thank you. :)
MdaG