Tree Search Algorithms (like BST, ect)
Using an associative array will allow you to quickly parse sentences in Perl. It is much faster than you would anticipate and it can be effectively dumped out in a tree like structure for subsequent usage by a higher level language.
You can try and dig into Markov chains, formed from words of sentences. Also you'll require both-way chain (i.e. to find next and previous words), i.e. store probable words that appear just after the given or just before it.
Of course, Markov chain is a stochastic process to generate content, however similar approach may be used to store information you need.
That looks like it could be stored in a very simple database with the following tables:
Words:
Id integer primary-key
Word varchar(20)
Following:
WordId1 integer foreign-key Words(Id) indexed
WordId2 integer foreign-key Words(Id) indexed
Then, whenever you parse a sentence, just insert the ones that aren't already there, as follows:
The beautiful sky.
Words (1,'the')
Words (2, 'beautiful')
Words (3,, 'sky')
Following (1, 2)
Following (2, 3)
Beautiful sky dream.
Words (4, 'dream')
Following (3, 4)
Beautiful dream.
Following (2, 4)
Then you can query to your hearts content on what words follow or precede other words.
Short Answer
Create a struct with two vectors of previous/forward links. Then store the word structs in a hash table with the key as the word itself.
Long Answer
This is a linguistic parsing problem that is not easily solved unless you don't mind gibberish.
- I went to the park basketball court.
- Would you park the car.
Your linking algorithm will create sentences like:
- I went to the park the car.
- Would you park basketball court.
I'm not quite sure of the SEO applications of this, but I would not welcome another gibberish spam site taking up a search result.
This oughta get you close, in C#:
class Program
{
public class Node
{
private string _term;
private Dictionary<string, KeyValuePair<Node, Node>> _related = new Dictionary<string, KeyValuePair<Node, Node>>();
public Node(string term)
{
_term = term;
}
public void Add(string phrase, Node previous, string [] phraseRemainder, Dictionary<string,Node> existing)
{
Node next= null;
if (phraseRemainder.Length > 0)
{
if (!existing.TryGetValue(phraseRemainder[0], out next))
{
existing[phraseRemainder[0]] = next = new Node(phraseRemainder[0]);
}
next.Add(phrase, this, phraseRemainder.Skip(1).ToArray(), existing);
}
_related.Add(phrase, new KeyValuePair<Node, Node>(previous, next));
}
}
static void Main(string[] args)
{
string [] sentences =
new string [] {
"The beautiful sky",
"Beautiful sky dream",
"beautiful dream"
};
Dictionary<string, Node> parsedSentences = new Dictionary<string,Node>();
foreach(string sentence in sentences)
{
string [] words = sentence.ToLowerInvariant().Split(' ');
Node startNode;
if (!parsedSentences.TryGetValue(words[0],out startNode))
{
parsedSentences[words[0]] = startNode = new Node(words[0]);
}
if (words.Length > 1)
startNode.Add(sentence,null,words.Skip(1).ToArray(),parsedSentences);
}
}
}
I took the liberty of assuming you wanted to preserve the actual initial phrase. At the end of this, you'll have a list of words in the phrases, and in each one, a list of phrases that use that word, with references to the next and previous words in each phrase.
I imagine you would want some sort of Inverted index structure. You would have a Hashmap with the words as keys pointing to lists of pairs of the form (sentence_id, position)
. You would then store your sentences as arrays or linked lists. Your example would look like this:
sentence[0] = ['the','beautiful', 'sky'];
sentence[1] = ['beautiful','sky', 'dream'];
sentence[2] = ['beautiful', 'dream'];
inverted_index =
{
'the': {(0,0)},
'beautiful': {(0,1), (1,0), (2,0)},
'sky' : {(0,2),(1,1)},
'dream':{(1,2), (2,1)}
};
Using this structure lookups on words can be done in constant time. Having identified the word you want, finding the previous and subsequent word in a given sentence can also be done in constant time.
Hope this helps.