views:

347

answers:

4

As part of excersise to better understand F# which I am currently learning , I wrote function to split given string into n-grams.
1) I would like to receive feedback about my function : can this be written simpler or in more efficient way?

2) My overall goal is to write function that returns string similarity (on 0.0 .. 1.0 scale) based on n-gram similarity; Does this approach works well for short strings comparisons , or can this method reliably be used to compare large strings (like articles for example).

3) I am aware of the fact that n-gram comparisons ignore context of two strings. What method would you suggest to accomplish my goal?

//s:string - target string to split into n-grams
//n:int - n-gram size to split string into
let ngram_split (s:string, n:int) =
    let ngram_count = s.Length - (s.Length % n)
    let ngram_list = List.init ngram_count (fun i ->
        if( i + n >= s.Length ) then
        s.Substring(i,s.Length - i) + String.init ((i + n) - s.Length)
            (fun i -> "#")
        else
            s.Substring(i,n)
    )
    let ngram_array_unique = ngram_list
                            |> Seq.ofList
                            |> Seq.distinct
                            |> Array.ofSeq

//produce tuples of ngrams (ngram string,how much occurrences in original string)

    Seq.init ngram_array_unique.Length (fun i -> (ngram_array_unique.[i],
        ngram_list |> List.filter(fun item -> item = ngram_array_unique.[i])
                                   |> List.length)
                                        ) 
+2  A: 

I don't know much about evaluating similarity of strings, so I can't give you much feedback regarding points 2 and 3. However, here are a few suggestions that may help to make your implementation simpler.

Many of the operations that you need to do are already available in some F# library function for working with sequences (lists, arrays, etc.). Strings are also sequences (of characters), so you can write the following:

open System

let ngramSplit n (s:string) = 
  let ngrams = Seq.windowed n s
  let grouped = Seq.groupBy id ngrams
  Seq.map (fun (ngram, occurrences) -> 
    String(ngram), Seq.length occurrences) grouped

The Seq.windowed function implements a sliding window, which is exactly what you need to extract the n-grams of your string. The Seq.groupBy function collects the elements of a sequence (n-grams) into a sequence of groups that contain values with the same key. We use id to calculate the key, which means that the n-gram is itself the key (and so we get groups, where each group contains the same n-grams). Then we just convert n-gram to string and count the number of elements in the group.

Alternatively, you can write the entire function as a single processing pipeline like this:

let ngramSplit n (s:string) = 
  s |> Seq.windowed n
    |> Seq.groupBy id 
    |> Seq.map (fun (ngram, occurrences) -> 
         String(ngram), Seq.length occurrences)
Tomas Petricek
Wow - this is really big simplification - great suggestions I think.
Michael
I think in your second code block you need to get rid of `ngrams` since it isn't declared anywhere, but is now being piped in.
Joel Mueller
@Joel: Yes, you're right - fixed it. Thanks!
Tomas Petricek
+1  A: 

Your code looks OK to me. Since ngram extraction and similarity comparison are used very often. You should consider some efficiency issues here.

The MapReduce pattern is very suitable for your frequency counting problem:

  1. get a string, emit (word, 1) pairs out
  2. do a grouping of the words and adds all the counting together.

    let wordCntReducer (wseq: seq<int*int>) =

       wseq 
       |> Seq.groupBy (fun (id,cnt) -> id) 
       |> Seq.map (fun (id, idseq) -> 
                (id, idseq |> Seq.sumBy(fun (id,cnt) -> cnt)))
    (* test: wordCntReducer [1,1; 2,1; 1,1; 2,1; 2,2;] *)
    

You also need to maintain a <word,int> map during your ngram building for a set of strings. As it is much more efficient to handle integers rather than strings during later processing.

(2) to compare the distance between two short strings. A common practice is to use Edit Distance using a simple dynamic programming. To compute the similarity between articles, a state-of-the-art method is to use TFIDF feature representation. Actuallym the code above is for term frequency counting, extracted from my data mining library.

(3) There are complex NLP methods, e.g. tree kernels based on the parse tree, to in-cooperate the context information in.

Yin Zhu
If I undestand you correctly - you mean that before the frequency count I should give "id" to each unique ngram string and then do frequency count using the "id"s instead of ngram strings themselves?
Michael
@Michael, yes. integers are more efficient to work with.
Yin Zhu
I think TFIDF might be exactly what I need - its definitely a step in right direction.
Michael
+1  A: 

I think you have some good answers for question (1).

Question (2):

You probably want cosine similarity to compare two arbitrary collections of n-grams (the larger better). This gives you a range of 0.0 - 1.0 without any scaling needed. The Wikipedia page gives an equation, and the F# translation is pretty straightforward:

let cos a b = 
  let dot = Seq.sum (Seq.map2 ( * ) a b)
  let magnitude v = Math.Sqrt (Seq.sum (Seq.map2 ( * ) v v))
  dot / (magnitude a * magnitude b)

For input, you need to run something like Tomas' answer to get two maps, then remove keys that only exist in one:

let values map = map |> Map.toSeq |> Seq.map snd
let desparse map1 map2 = Map.filter (fun k _ -> Map.containsKey k map2) map1
let distance textA textB =
  let a = ngramSplit 3 textA |> Map.ofSeq
  let b = ngramSplit 3 textB |> Map.ofSeq
  let aValues = desparse a b |> values
  let bValues = desparse b a |> values
  cos aValues bValues

With character-based n-grams, I don't know how good your results will be. It depends on what kind of features of the text you are interested in. I do natural language processing, so usually my first step is part-of-speech tagging. Then I compare over n-grams of the parts of speech. I use T'n'T for this, but it has bizarro licencing issues. Some of my colleagues use ACOPOST instead, a Free alternative (as in beer AND freedom). I don't know how good the accuracy is, but POS tagging is a well-understood problem these days, at least for English and related languages.

Question (3):

The best way to compare two strings that are nearly identical is Levenshtein distance. I don't know if that is your case here, although you can relax the assumptions in a number of ways, eg for comparing DNA strings.

The standard book on this subject is Sankoff and Kruskal's "Time Warps, String Edits, and Maromolecules". It's pretty old (1983), but gives good examples of how to adapt the basic algorithm to a number of applications.

Nathan Sanders
to compute the similarity, a sparse vector is needed. Also your cos formula seems incorrect.
Yin Zhu
Oops. Sorry about that, I shouldn't have written it from scratch. I believe the edited version is correct.
Nathan Sanders
@Nathan thanks for pointing into right direction. Also I had some problems (because of specific grammar problems) with Levenstein distance when tried to compare Hebrew strings - its one of my motivations to look for another algorithm to do string compares.
Michael
If you are comparing lists of short strings, you might look at the [work of Thomas Zastrow](http://www.thomas-zastrow.de/publications.php). He's approaching similarity from a dialect/pronunciation point of view, but with a variety of statistical approaches. I don't see any a priori reason Levenshtein wouldn't work with Hebrew, although you might have to have special handling for vowel diacritics (I'm not sure how they're represented in Unicode).
Nathan Sanders
A: 

Question 3:

My reference book is Computing Patterns in Strings by Bill Smyth

ssp