views:

37

answers:

4

I have a text file, and each line is of the form:

TAB WORD TAB PoS TAB FREQ#

Word PoS Freq
the Det 61847
of Prep 29391
and Conj 26817
a Det 21626
in Prep 18214
to Inf 16284
it Pron 10875
is Verb 9982
to Prep 9343
was Verb 9236
I Pron 8875
for Prep 8412
that Conj 7308
you Pron 6954

Would one of you regex wizards kindly assist me in isolating the WORDS from the file? I'll do a find and replace in TextPad, hopefully, and that will be that. Multiple find and replaces is fine. One thing: notice that searching for "verb" would also turn up the WORD of "verb," not just the part of speech, so be carefull. In the end I want to end up with 1 word per line.

Thanks so much!

+1  A: 

You could just use awk to remove the first column, as in

awk '{print $1}' /path/to/filename

Skip the first line by using

awk 'NR!=1 {print $1}' /path/to/filename
Peter
+1  A: 

There's not really any need to use a regular expression for this. For example, you can use cut:

cut -f1 <inputfile
Greg Hewgill
+1  A: 

Something like \s*([a-zA-z]+)\s*([a-zA-z]+) would return the word and PoS as groups. You can then use them in the replace statement as $1 and $2 to output as you want.

If you only want the WORD part you can just use $1 in the replace.

Chris R
+1  A: 

I think microsoft excel can help you that better...

Just copy the whole text on excel and it will be formatted as table then go ahead and select the appropriate column cells for the word, finally copy them on notepad.

I bet this is the easiest path.

If in case excel stores all values in a single column, in a separate column extract the word by:

=Trim(LEFT(C1,maxchar))

jerjer
Good idea... you often forget the easiest tools!
cksubs