Hello!
I want to implement some applications with n-grams (preferably in PHP).
Which type of n-grams is more adequate for most purposes? A word level or a character level n-gram? How could you implement an n-gram-tokenizer in PHP?
First, I would like to know what N-grams exactly are. Is this correct? It's how I understand n-grams:
Sentence: "I live in NY."
word level bigrams (2 for n): "# I', "I live", "live in", "in NY", 'NY #'
character level bigrams (2 for n): "#I", "I#", "#l", "li", "iv", "ve", "e#", "#i", "in", "n#", "#N", "NY", "Y#"
When you have this array of n-gram-parts, you drop the duplicate ones and add a counter for each part giving the frequency:
word level bigrams: [1, 1, 1, 1, 1]
character level bigrams: [2, 1, 1, ...]
Is this correct?
Furthermore, I would like to learn more about what you can do with n-grams:
- How can I identify the language of a text using n-grams?
- Is it possible to do machine translation using n-grams even if you don't have a bilingual corpus?
- How can I build a spam filter (spam, ham)? Combine n-grams with a Bayesian filter?
- How can I do topic spotting? For example: Is a text about basketball or dogs? My approach (do the following with a Wikipedia article for "dogs" and "basketball"): build the n-gram vectors for both documents, normalize them, calculate Manhattan/Euclidian distance, the closer the result is to 1 the higher is the similarity
What do you think about my application approaches, especially the last one?
I hope you can help me. Thanks in advance!