tags:

views:

47

answers:

3

I have this working regex (tested on regex coach):

\n[\s]*[0-9]*[\s]*[0-9]*(\.)?[0-9]*(e\+)?[0-9]*

that is supposed to pick up the first 2 columns of this file

http://wwwhomes.uni-bielefeld.de/achim/highly.txt

I read through the man pages, and it says that ^ will match at the beggining of the line so I replaced \n with ^ but egrep isn't agreeing with me when I do this:

egrep -e ^[\s]*[0-9]*[\s]*[0-9]*(\.)?[0-9]*(e\+)?[0-9]* "wwwhomes.uni-bielefeld.de achim highly.txt"

EDIT: it has something to do with (e\+)?

EDIT 2: okay, I'm simplifying the regex. forget about trying to get numbers in scientific notation here is what I am using:

egrep -e "^[[:space:]]*[0-9]*[[:space:]]*[0-9]*" "wwwhomes.uni-bielefeld.de achim highly.txt"

it returns the header lines:

   no       number      divisors    2 3 5 71113171923293137414347535961677173
------------------------------------------------------------------------------

this isn't right...

Final edit:

I needed a combination of grep and sed to get the proper data out. grep removed the header lines and sed formatted the text

grep  -E -o -e "^[[:space:]]+[0-9]+[[:space:]]+[0-9e\+\.]+[[:space:]]+[0-9e\+\.]+" "wwwhomes.uni-bielefeld.de achim highly.txt" >grepped.txt

sed -r "s/^\s*[0-9]+\s*([0-9.e+]+)\s*([0-9.e+]+)/\1,\2/" "grepped.txt" >seded.txt 
+2  A: 

ERE does not support \s. Use [[:space:]] for \s, or simply  .

It seems the ^ points to the start of text that is not yet matched. I don't know why. (This behavior is true on grep (GNU grep) 2.5.1 on Mac OS X only?)

The regex matches the header lines because all elements in the regex is optional. You need to change some of those * into +.


Since the file is in fixed-length format, it is far easier to use cut than constructing a regex.

cut -c 1-20 highly.txt

You could use grep -v to filter out the undesired results.

KennyTM
Noted, problem still here. How do I force it to match the beginning of line only? It seems like egrep doesn't care that I added ^
mna
Noted, but this doesn't get rid of the re-occurring headers
mna
+1  A: 

Try adding a -o option to grep to make it print only the part that matched the pattern instead of the line that has the pattern:

egrep -o -e  "^[[:space:]]*[0-9]*[[:space:]]*[0-9.e+]*" file
      ^^

Working link

Alternatively you can use sed as:

sed -r 's/^\s*([0-9]+)\s*([0-9.e+]+).*/\1 \2/' file
codaddict
Thanks. Can you tell me what tool I could use to do something like "^[[:space:]]*([0-9]*)[[:space:]]*([0-9.e+]*)" -output "\1,\2" ? I'm new to the whole bash :S
mna
That would be `sed`. I'll update the answer with it.
codaddict
A: 

if you have data that looks properly formatted, with delimiters that you can identify (eg in your case, tabs/spaces), there is no need to use regex. Use awk.

awk '!/--/&&$1!="no"{print $1,$2}' file

I believe this one liner is all you need since you said you want to get the first 2 columns and skip the headers. you can use cut too, but its not as flexible as awk.

ghostdog74
how do I suppress the 'no-number' lines awk returns?
mna
the one liner already does that. See that `$1!="no"` ?
ghostdog74
$1!=" no" white spaces :)
mna