ansaurus

Question

What is a simple way to generate keywords from a text?

Answer 1

+2 A:

The simplest way to do what you want is this...

>>> text = "this is some of the sample text"
>>> words = [word for word in set(text.split(" ")) if len(word) > 3]
>>> words
['this', 'some', 'sample', 'text']

I don't know of any standard module that does this, but it wouldn't be hard to replace the limit on three letter words with a lookup into a set of common English words.

Andrew Wilkinson 2009-01-21 15:54:43

Good answer, but I'd clarify that you'll want to be using a lookup "set" of English words rather than a list so that your lookup is constant time and not O(n).

Eli Courtwright 2009-01-21 16:11:18

Good point. I've edited the test to reflect that. Thanks :-)

Andrew Wilkinson 2009-01-22 09:16:49

Answer 2

+12 A:

The name for the "high frequency English words" is stop words and there are many lists available. I'm not aware of any python or perl libraries, but you could encode your stop word list in a binary tree or hash (or you could use python's frozenset), then as you read each word from the input text, check if it is in your 'stop list' and filter it out.

Note that after you remove the stop words, you'll need to do some stemming to normalize the resulting text (remove plurals, -ings, -eds), then remove all the duplicate "keywords".

florin 2009-01-21 16:14:29

Answer 3

+4 A:

In Perl there's Lingua::EN::Keywords.

Leon Timmermans 2009-01-21 16:40:40

Answer 4

+8 A:

You could try using the perl module Lingua::EN::Tagger for a quick and easy solution.

A more complicated module Lingua::EN::Semtags::Engine uses Lingua::EN::Tagger with a WordNet database to get a more structured output. Both are pretty easy to use, just check out the documentation on CPAN or use perldoc after you install the module.

andymurd 2009-01-21 16:44:49

Answer 5

+3 A:

To find the most frequently-used words in a text, do something like this:

#!/usr/bin/perl -w

use strict;
use warnings 'all';

# Read the text:
open my $ifh, '<', 'text.txt'
  or die "Cannot open file: $!";
local $/;
my $text = <$ifh>;

# Find all the words, and count how many times they appear:
my %words = ( );
map { $words{$_}++ }
  grep { length > 1 && $_ =~ m/^[\@a-z-']+$/i }
    map { s/[",\.]//g; $_ }
      split /\s/, $text;

print "Words, sorted by frequency:\n";
my (@data_line);
format FMT = 
@<<<<<<<<<<<<<<<<<<<<<<...     @########
@data_line
.
local $~ = 'FMT';

# Sort them by frequency:
map { @data_line = ($_, $words{$_}); write(); }
  sort { $words{$b} <=> $words{$a} }
    grep { $words{$_} > 2 }
      keys(%words);

Example output looks like this:

john@ubuntu-pc1:~/Desktop$ perl frequency.pl 
Words, sorted by frequency:
for                                   32
Jan                                   27
am                                    26
of                                    21
your                                  21
to                                    18
in                                    17
the                                   17
Get                                   13
you                                   13
OTRS                                  11
today                                 11
PSM                                   10
Card                                  10
me                                     9
on                                     9
and                                    9
Offline                                9
with                                   9
Invited                                9
Black                                  8
get                                    8
Web                                    7
Starred                                7
All                                    7
View                                   7
Obama                                  7

JDrago 2009-01-21 17:47:07

Little bit complicated way to do same as with this oneliner: perl -ne '$h{$1}++ while m/\b(\w{3,})\b/g;END{printf"%-20s %5d\n",$_,$h{$_}for sort{$h{$b}<=>$h{$a}}grep{$h{$_}>2}keys%h}'

Hynek -Pichi- Vychodil 2009-01-22 14:28:24

Answer 6

+1 A:

One liner solution (words longer than two chars which occurred more than two times):

perl -ne'$h{$1}++while m/\b(\w{3,})\b/g;END{printf"%-20s %5d\n",$_,$h{$_}for sort{$h{$b}<=>$h{$a}}grep{$h{$_}>2}keys%h}'

EDIT: If one wants to sort alphabetically words with same frequency can use this enhanced one:

perl -ne'$h{$1}++while m/\b(\w{3,})\b/g;END{printf"%-20s %5d\n",$_,$h{$_}for sort{$h{$b}<=>$h{$a}or$a cmp$b}grep{$h{$_}>2}keys%h}'

Hynek -Pichi- Vychodil 2009-01-22 14:36:23

I like this one :)

JDrago 2009-01-22 15:46:05

I have added enhanced one for you ;-)

Hynek -Pichi- Vychodil 2009-01-22 17:44:21

Answer 7

A:

I think the most accurate way that still maintains a semblance of simplicity would be to count the word frequencies in your source, then weight them according to their frequencies in common English (or whatever other language) usage.

Words that appear less frequently in common use, like "coffeehouse" are more likely to be a keyword than words that appear more often, like "dog." Still, if your source mentions "dog" 500 times and "coffeehouse" twice it's more likely that "dog" is a keyword even though it's a common word.

Deciding on the weighting scheme would be the difficult part.

Steve Losh 2009-01-22 15:54:08

ansaurus

tags:

views:

answers:

What is a simple way to generate keywords from a text?

related questions