tags:

views:

61

answers:

4

I am basically grepping with a regular expression on. Now in the output i would like to see only the strings that match my reg exp.

In a bunch of xml files (mostly they are single line files with huge amomunt of data in a line), i would like to get all the words that start with MAIL_

Also the grep command on the shell should give only the words that matched and not the entire line, which is the entire file in this case.

How do i do this.

I am trying

grep -Gril MAIL_* .
grep -Grio MAIL_* .
grep -Gro MAIL_* .

Not sure how to achieve this.

Rgds, AJ

A: 
grep -o or --only-matching

outputs only the matching text instead of complete lines but the problem could be your regex that's not restrictive or greedy enough and actually matches the whole file.

chocolate_jesus
now the type of words i want are present like this in the filetype="MAIL_ABC_CDE"type="MAIL_XXX_AAA_AAA"etcthere can be any number of _'sWHat should be the reg exp i shoudl use? any idea on that?
AJ
+1  A: 

First of all, with GNU grep that is installed with Ubuntu, -G flag (use basic regexp) is the default, so you can omit it, but, even better, use extended regexp with -E.

-r flag means recursive search within files of a directory, this is what you need.

And, you are right to use -o flag to print matching part of a line. Also, to omit file names you will need a -h flag.

The only mistake you made is the regular expression itself. You missed character specification before *. Your command should look like this:

grep -Ehro 'MAIL_[^[:space:]]*' .

Sample output (not recursive):

$ echo "Some garbage MAIL_OPTION comes MAIL_VALUE here" | grep -Eho 'MAIL_[^[:space:]]*'
MAIL_OPTION
MAIL_VALUE
thor
great..that works, but one quick questionhow do i do if i know the MAIL_* stuff are either present astype="MAIL_*" or >MAIL_*< in the files?any help on that one?
AJ
I don't get it. Could you rephrase your question?You want to see surrounding characters around your MAIL_XXX stuff?Like, you want to see " and <> in output of grep command?
thor
if your MAIL_* could only contain alphabetic characters (a-z), then you can change regexp to 'MAIL_[[:alpha:]]*'
thor
+1  A: 

Try the following command

grep -Eo 'MAIL_[[:alnum:]_]*'
banx
A: 

From your comment to Thor's answer it seems you also want to distinguish if the MAIL_.* text is a text node or an attribute, not just to isolate it whenever it appears in the XML document. Grep cannot parse XML, you need a proper XML parser for that.

A command line xml parser is xmlstarlet. It is packaged in Ubuntu.

Using it on this example file example file:

$ cat test.xml 
<some_root>
    <test a="MAIL_as_attribute">will be printed if you want matching attributes</test>
    <bar>MAIL_as_text will be printed if you want matching text nodes</bar>
    <MAIL_will_not_be_printed>abc</MAIL_will_not_be_printed>
</some_root>

For selecting text nodes you can use:

$ xmlstarlet sel -t -m '//*' -v 'text()' -n test.xml | grep -Eo 'MAIL_[^[:space:]]*'
MAIL_as_text

And for selecting attributes:

$ xmlstarlet sel -t -m '//*[@*]' -v '@*' -n test.xml | grep -Eo 'MAIL_[^[:space:]]*'
MAIL_as_attribute

Brief explanations:

  • //* is an XPath expression that selects all elements in the document and text() outputs the value of their children text nodes, therefore everything except text nodes gets filtered out
  • //*[@*] is an XPath expression that selects all attributes in the document and then @* outputs their value
Catalin Iacob