tags:

views:

750

answers:

5

I need this grep call:

grep "field3=highland" data_file

to return both results with "field3=highland" as well as "field3=chicago highland". How can I redesign the grep call to account for both scenarios?

A: 

If you mean to match the third field of the line against you string (rather than matching a literal "field3=highland") grep is not the right tool for you. In that case consider awk:

awk '$3=="highland" { print $0 }' <input file>

for an exact match or

awk '$3~".*highland.*" { print $0 }' <input file>

to match with a regular expression.

Note that awk assumes a space as the field separator, but you can use "-F <field separator>" to change it on the command line so that

awk -F : '$1~".*oo.*" {print $0}' /etc/passwd

grabs the root line from the password file.

dmckee
Oh, hmm, good point. Is it the third actual field or does the line really say `field3=`?
DigitalRoss
Er...I don't know. I just read it as wanting to match a particular field and started fiddling. Certainly `awk` is overkill if the "field=" is a literal part of the desired match.
dmckee
OK, looks like I solved the wrong problem. ::sigh::
dmckee
But your answer might be useful for someone who has a similar problem and has stumbled onto SO.
Barry Brown
+1  A: 
$ grep 'f=h\|f=c h' << eof
> f=c h
> f=h
> not
> going f= to
> match
> eof
f=c h
f=h
$

Or, if the idea is that c can be anything, perhaps something like:

$ grep 'f=.*h'
DigitalRoss
But c is variable in my case.
this is nice, but it works only for GNU grep
Davide
+2  A: 

you can use the * wildcard

grep "field3=.*highland" data_file
akf
This doesn't seem to be working.
Remember that grep uses regexs not globs. You probably want "field3=.*highland" in this kind of application.
dmckee
THAT's GREAT!!!"field3=.*highland" works just rightTHANKS
And it should probably be `=[^=]*highland` to avoid picking up `field3=opportunity field4=knocks field5=scottish highlands`.
Jonathan Leffler
added the `.` (as it was intended, but untyped ;) ). I wont take credit for the good point on the greedy `*`.
akf
+1  A: 

If you want to get all lines with 'field3=' followed by any characters followed by 'highland', you need:

grep 'field3=.*highland' data_file

The '.' means any character and the '*' means zero or more occurences of the last pattern. So '.*' is effectively any string, including the empty one.

paxdiablo
+1  A: 

goe,

My advice would be to spend considerably more effort on composing your question.

You mention "grep tool (Linux)" and "SQL LIKE operator" ... in the subject ... then include a frankly unintelligible question which seems to be about matching two different variations of a sample line of input.

You're getting answers which are only guesses at what your actual question might be.

I think the question is something like:

"I have data which contains some lines like: field3=highland and field3=other stuff highland and I want to match all those lines (filtering out everything else)."

The simplest regular expression which might work would be:

grep "field3=.*highland

... but this would match things like "field3=highlands" and "field3=thighland" and "myfield3=...", etc. Also it would fail to match "field3 =..." (with the space between the field designator and the equal sign).

Is the "field3" supposed to be at the beginning of the line? Is the highland supposed to be anchored at the end of the line? Should "highland" only match if it's not a substring in a longer "word" (i.e. if the character before the "h" and after the "d" is non-alphabetic)?

There are a great many questions about your expected inputs and desired results ... which will have considerable effect on the sorts of regular expressions that will match or not.

The reference to SQL LIKE expressions and it's % tokens is mostly useless. For the most part a % token in an SQL LIKE expression is equivalent to the ".*" regular expression. If you have a snippet of SQL that works (over the same range of inputs) and you're trying to find a functionally equivalent regular expression ... then you should take the time to paste in the working SQL expression.

Also there's nothing particularly specific to grep (Linux or otherwise) in this question. It would be better tagged as a question about regular expressions.

In general there are three or four common abstractions for matching text against patterns: regular expressions (with many variants), "glob" and "wildmat" patterns (shell and MS-DOS like), and SQL LIKE expressions.

Of these regular expressions are the most commonly used by programmers ... and they are, by far, the most complicated. They range from the oldest simplest variations (as included in the historical UNIX ed line editors from which grep was orginally excerpted), to the more powerful "extended" versions (typified by egrep or grep -E) and up to the insanely elaborate "Perl compatible regular expressions" (now widely used by other programming languages as the PCRE libraries).

Glob patterns are far simpler. They support "shell wild cards" ... originally just ? and * (any single character, or any number of any characters, respectively). Later enhancements which are supported by modern shells and other tools include support for character classes (such as [0-9] for any digit and [a-zA-Z] for any letter, and so on). Some of these also support negated character classes.

Because glob patterns use special characters (? and *) which are similar to regular expression syntax, albeit for different purposes ... and because they use almost identical syntax for describing character classes and their complements, glob patterns are often mistaken for regular expressions. When I teach classes in systems administration I usually have to make this point so that students "unlearn" the sloppiness of terminology that's so common.

The old MS-DOS "wildmat" or "wildcard matching" can be thought of as a variant of the original glob patterns. It only supports the ? and * meta-characters ... with mostly the same semantics as UNIX shell globbing. However, I counsel against thinking of them this way. The underlying semantics of how an MS-DOS command line handles arguments containing these patterns is sufficiently different that thinking of them as "globs" is a trap. (A command like: COPY *.TXT *.BAK is perfectly sensible under MS-DOS while a UNIX command like: cp *.txt *.bak is wrong for almost any reasonable situation).

Obviously, as I've described above, the SQL LIKE expression is quite similar to a UNIX glob. There are only two "special" or "meta" characters in most basic SQL LIKE implementations % (analogous to *) and _ (analogous to ?).

Notice the weasel words here, though. I won't claim that % is the same as a glob * nor that _ is the same as a glob's ? character. There may be some corner cases (regarding how these might batch at the beginnings or endings of strings, or adjacent to whitespace etc). There may be differences among different implementations of SQL and there may even be some cruftier versions of the UNIX/Linux fnmatch (globbing) libraries that would make a difference if you tried to rely on such claims.

Jim Dennis
@Jim: sorry - I edited the question to make some of your opening remarks no longer ... accurate? But I edited the question title and body (in several rather rapid iterations, though SO only records the last) precisely because of the points you make. And SQL did not seem relevant, so I removed it from the title.
Jonathan Leffler