ansaurus

Question

Best algorithm to index sentences.

Answer 1

A:

Tree Search Algorithms (like BST, ect)

Dhana 2009-04-08 06:16:05

I wouldn't call it binary...

Paulius Maruška 2009-04-08 06:18:29

Yah, not really. Not really at all.

Darren Clark 2009-04-08 06:20:26

Answer 2

A:

Using an associative array will allow you to quickly parse sentences in Perl. It is much faster than you would anticipate and it can be effectively dumped out in a tree like structure for subsequent usage by a higher level language.

ojblass 2009-04-08 06:21:19

Answer 3

+1 A:

You can try and dig into Markov chains, formed from words of sentences. Also you'll require both-way chain (i.e. to find next and previous words), i.e. store probable words that appear just after the given or just before it.

Of course, Markov chain is a stochastic process to generate content, however similar approach may be used to store information you need.

dragonfly 2009-04-08 06:21:55

Why was this downvoted? This is how commercial applications work when doing word prediction and parsing.

Christoffer 2009-04-08 07:43:31

Because its probabilistic indexing when the asker wanted deterministic indexing. Also Markov chains are only good at predicting simple constrained speech and not much else.

Unknown 2009-04-11 08:05:32

Answer 4

+1 A:

That looks like it could be stored in a very simple database with the following tables:

Words:
    Id     integer primary-key
    Word   varchar(20)
Following:
    WordId1 integer foreign-key Words(Id) indexed
    WordId2 integer foreign-key Words(Id) indexed

Then, whenever you parse a sentence, just insert the ones that aren't already there, as follows:

The beautiful sky.
    Words (1,'the')
    Words (2, 'beautiful')
    Words (3,, 'sky')
    Following (1, 2)
    Following (2, 3)
Beautiful sky dream.
    Words (4, 'dream')
    Following (3, 4)
Beautiful dream.
    Following (2, 4)

Then you can query to your hearts content on what words follow or precede other words.

paxdiablo 2009-04-08 06:25:17

Answer 5

+4 A:

Short Answer

Create a struct with two vectors of previous/forward links. Then store the word structs in a hash table with the key as the word itself.

Long Answer

This is a linguistic parsing problem that is not easily solved unless you don't mind gibberish.

I went to the park basketball court.
Would you park the car.

Your linking algorithm will create sentences like:

I went to the park the car.
Would you park basketball court.

I'm not quite sure of the SEO applications of this, but I would not welcome another gibberish spam site taking up a search result.

Unknown 2009-04-08 06:51:58

Answer 6

A:

This oughta get you close, in C#:

class Program
{
    public class Node
    {
        private string _term;
        private Dictionary<string, KeyValuePair<Node, Node>> _related = new Dictionary<string, KeyValuePair<Node, Node>>();

        public Node(string term)
        {
            _term = term;
        }

        public void Add(string phrase, Node previous, string [] phraseRemainder, Dictionary<string,Node> existing)
        {
            Node next= null;
            if (phraseRemainder.Length > 0)
            {
                if (!existing.TryGetValue(phraseRemainder[0], out next))
                {
                    existing[phraseRemainder[0]] = next = new Node(phraseRemainder[0]);
                }
                next.Add(phrase, this, phraseRemainder.Skip(1).ToArray(), existing);
            }
            _related.Add(phrase, new KeyValuePair<Node, Node>(previous, next));

        }
    }


    static void Main(string[] args)
    {
        string [] sentences = 
            new string [] { 
                "The beautiful sky",
                "Beautiful sky dream",
                "beautiful dream"
            };

        Dictionary<string, Node> parsedSentences = new Dictionary<string,Node>();

        foreach(string sentence in sentences)
        {
            string [] words = sentence.ToLowerInvariant().Split(' ');
            Node startNode;
            if (!parsedSentences.TryGetValue(words[0],out startNode))
            {
                parsedSentences[words[0]] = startNode = new Node(words[0]);
            }
            if (words.Length > 1)
                startNode.Add(sentence,null,words.Skip(1).ToArray(),parsedSentences);
        }
    }
}

I took the liberty of assuming you wanted to preserve the actual initial phrase. At the end of this, you'll have a list of words in the phrases, and in each one, a list of phrases that use that word, with references to the next and previous words in each phrase.

Darren Clark 2009-04-08 07:12:33

Answer 7

+1 A:

I imagine you would want some sort of Inverted index structure. You would have a Hashmap with the words as keys pointing to lists of pairs of the form (sentence_id, position). You would then store your sentences as arrays or linked lists. Your example would look like this:

sentence[0] = ['the','beautiful', 'sky'];
sentence[1] = ['beautiful','sky', 'dream'];
sentence[2] = ['beautiful', 'dream'];

inverted_index = 
{
 'the': {(0,0)},
 'beautiful': {(0,1), (1,0), (2,0)},
 'sky' : {(0,2),(1,1)},
 'dream':{(1,2), (2,1)}
};

Using this structure lookups on words can be done in constant time. Having identified the word you want, finding the previous and subsequent word in a given sentence can also be done in constant time.

Hope this helps.

Il-Bhima 2009-04-08 07:18:37

ansaurus

tags:

views:

answers:

Best algorithm to index sentences.

related questions