views:

52

answers:

2

I have a data set that looks like the following:
movie (year) genre
for example.

some words (1934) action

My goal is to grab each "movie" field and then check a different file that also has a bunch of movies and delete the lines from the second file that do not contain the movie. I have been trying to use awk to do this, but have only been able to match the year field. Is there a way that I can create a variable for the movie field? I feel like the easiest way to do this would be to match the year field and create a variable from everything that comes before it in each line. I have not been able to figure this out, is there some way to do this that might be easier than my suggestion?

+3  A: 

assuming your dataset is in a file

$ cat dataset
Terminator (19XX) action
The Ghostrider (2009) supernatural

$ awk -F"[()]" '{print $1}' dataset
Terminator
The Ghostrider

$ awk -F"[()]" '{print $1}' dataset > movie_names

$ grep -f movie_names secondfile
$ grep -f secondfile movie_names

Of course, you can do it with just awk as well

awk -F"[()]" 'FNR==NR { m[++d]=$1;next } { for(i=1;i<=d;i++){if( $0 ~ m[i] ){ print }}}' dataset secondfile
ghostdog74
that is great! Did not know that -F accepts regular expressions. You can combine this in one command line as "awk -F"[()]" '{print $1}' dataset | fgrep -f - secondfile. This way, you dont need the temporary file movie_names.
raja kolluru
Thanks for the answer, this does exactly what I needed.@raha I will have to try that oneliner, looks like it would work nicely
Isawpalmetto
A: 

You can ask sed to remove the year field and everything that comes after it.

$ cat file | sed 's/([0-9]\+).*//'

This will only return the name of the movie on each line. You can then pipe it into a while read; loop.

If needed you can refine the regex so that it only matches on 4 digits (this one will match any number of digits between parens).

Jean