views:

366

answers:

5

Does anyone know how to or have some code on counting the number of unique phrases in a document? (Single word, two word phrases, three word phrases).

Thanks

Example of what I'm looking for: What I mean is I have a text document, and i need to see what the most popular word phrases are. Example text

I took the car to the car wash.

I : 1
took : 1
the : 2
car: 2
to : 1
wash : 1
I took : 1
took the : 1
the car : 2
car to : 1
to the : 1
car wash : 1
I took the : 1
took the car : 1
the car to : 1
car to the : 1
to the car : 1
the car wash : 1
I took the car to : 1
took the car to the : 1
the car to the car : 1
car to the car wash : 1

I need the phrase, and the count that it shows up.

Any help would be appreciated. The closet thing I found to this was a PHP script from http://tools.seobook.com/general/keyword-density/source.php

I used to have some code for this, but I cannot find it.

A: 

From Delphi Basics website.

var
  position : Integer;

begin
  // Look for the word 'Cat' in a sentence
  // Note : that this search is case sensitive, so that
  //        the first 'cat' is not matched
  position := AnsiPos('Cat', 'The cat sat on the Cat mat');
  if position = 0
  then ShowMessage('''Cat'' not found in the sentence')
  else ShowMessage('''Cat'' was found at character '+IntToStr(position));
end;

Maybe it will help

Grant
This doesn't look like what he was looking for, but I formatted the code to make it readable at least. Welcome to StackOverflow, Grant.
Mason Wheeler
Yeah, not exactly what I was looking for, but I did let me know I was not 100% clear, so I edited what I was looking for.
Brad
A: 

The number of possible combinations rises really quickly. Assume 30000 words in mainstream use in a language, then the number of 3 phrase combinations is in the magnitude of 30000^3

Anyway, the level zero implementation would be to build a (hash) list of words, filter the list if necessary for very common words (the,of etc) to reduce the number of phrases. Other things you might want to do is reduce plurals to singles, remove trailing 's, casing etc.

Then walk the text word for word (tokenizer style), skipping the common words, and simply keep an ordered list of phrases you encounter with a count, and hope your memory doesn't run out, since Delphi has no 64-bit version :)

Didn't Knuth have a whole book about combinations?

Marco van de Voort
+2  A: 

Here is some initial code that solves your problem.

function CountWordSequences(const s:string; Counts:TStrings = nil):TStrings;
var
  words, seqs : TStrings;
  nw,i,j:integer;
  t :string;
begin
  if Counts=nil then Counts:=TStringList.Create;
  words:=TStringList.Create;        // build a list of all words
  words.DelimitedText:=s;
  seqs:=TStringList.Create;
  for nw:=1 to words.Count do       // build a list of all word sequences
   begin
    for i:=0 to words.Count-nw do
     begin
      t:='';
      for j:=0 to nw-1 do
       begin
        t:=t+words[i+j];
        if j<>nw-1 then t:=t+' ';
       end;
      seqs.Add(t);
     end;
   end;
  words.Destroy;
  for i:=0 to seqs.Count-1 do         // count repeated sequences
   begin
    j:=Counts.IndexOf(seqs.Strings[i]);
    if j=-1 then
      Counts.AddObject(seqs.Strings[i],TObject(1))
    else
      Counts.Objects[j] := TObject(Succ(Integer(Counts.Objects[j])));
   end;
  seqs.Destroy;
  result:=Counts;
end;

You will need to elaborate this code for real world production, for example, by recognizing more word delimiters (not only blanks), and by implementing some sort of case insensitivity.

To test it, put a Button, an EntryField and a Memo in a Form, and add the following code.

procedure TForm1.Button1Click(Sender: TObject);
var i:integer; l:TStrings;
 begin
  l:=CountWordSequences(edit1.Text,TStringList.Create);
  for i:=1 to l.count do
    memo1.Lines.Add('"'+l.Strings[i-1]+'": '+inttostr(Integer(l.Objects[i-1])));
 end;

I first try with I took the car to the car wash

gives

"I": 1
"took": 1
"the": 2
"car": 2
"to": 1
"wash.": 1
"I took": 1
"took the": 1
"the car": 2
"car to": 1
"to the": 1
"car wash.": 1
"I took the": 1
"took the car": 1
"the car to": 1
"car to the": 1
"to the car": 1
"the car wash.": 1
"I took the car": 1
"took the car to": 1
"the car to the": 1
"car to the car": 1
"to the car wash.": 1
"I took the car to": 1
"took the car to the": 1
"the car to the car": 1
"car to the car wash.": 1
"I took the car to the": 1
"took the car to the car": 1
"the car to the car wash.": 1
"I took the car to the car": 1
"took the car to the car wash.": 1
"I took the car to the car wash.": 1
PA
THANKS! I'm going to test it out tonight.
Brad
PA, this works great for short groups of words. Any ideas on how to make it work better for 300-1000 word groups?
Brad
Paraphrasing Marco's answer "The number of possible combinations rises really quickly". Using StringList to hold (and search) that many combinations would not work. I would move to some sort of database, probably an in-memory database such as SQLite or Firebird Embedded.
PA
A: 

This is how I would go about solving the problem. Assuming that each pass thru the data file would create a new data file for the next step. The control character mentioned can be any character which does not naturally appear in the data. When you write a control character, don't write duplicates.

  1. Run through your document and count each word separately.
  2. Run through your document again and replace any word used only once with a control character, addding to a new list the pairs that occur (words A B C become item A B and item B C). Control characters act as hard delimiters. Any word that is alone between control characters should also be converted since it can't be converted to a pair.
  3. Run through your document again and replace any pair used only once with a control character, adding to a new list any triplets that occur. Convert pairs between control characters to control characters.

Repeat adding another word level to each list until you get an empty list or you have the maximum phrase you want to support.

This method implies the fact that your most common phrases can never contain a smaller phrase used less often.

skamradt
A: 

Hi, I've got a text file and I'd be grateful if someone could set up that delphi script which PA submitted. I'm willing to pay you for your time too. Please contact me: dddobos at gmail dot com. Thanks. (Sorry for not posting a programming response... just saw this from a google search and really need to get this problem solved!)

dan