tags:

views:

69

answers:

6

I have lines of data that looks like this:

sp_A0A342_ATPB_COFAR_6_+_contigs_full.fasta
sp_A0A342_ATPB_COFAR_9_-_contigs_full.fasta
sp_A0A373_RK16_COFAR_10_-_contigs_full.fasta
sp_A0A373_RK16_COFAR_8_+_contigs_full.fasta
sp_A0A4W3_SPEA_GEOSL_15_-_contigs_full.fasta

How can I use sed to delete parts of string after 4th column (_ separated) for each line. Finally yielding:

sp_A0A342_ATPB_COFAR
sp_A0A342_ATPB_COFAR
sp_A0A373_RK16_COFAR
sp_A0A373_RK16_COFAR
sp_A0A4W3_SPEA_GEOSL
+10  A: 

cut is a better fit.

cut -d_ -f 1-4 old_file

This simply means use _ as delimiter, and keep fields 1-4.

If you insist on sed:

sed 's/\(_[^_]*\)\{4\}$//'

This left hand side matches exactly four repetitions of a group, consisting of an underscore followed by 0 or more non-underscores. After that, we must be at the end of the line. This is all replaced by nothing.

Matthew Flaschen
+1  A: 
sed -e 's/_[0-9][0-9]*_[+-]_contigs_full.fasta$//g'

Still the cut answer is probably faster and just generally better.

Slartibartfast
+1  A: 

Yes, cut is way better, and yes matching the back of each is easier.

I finally got a match using the beginning of each line:

 sed -r 's/(([^_]*_){3}([^_]*)).*/\1/' oldFile > newFile
Peter Ajtai
+1  A: 
sed -e 's/\([^_]*\)_\([^_]*\)_\([^_]*\)_\([^_]*\)_.*/\1_\2_\3_\4' infile > outfile

Match "any number of not '_'", saving what was matched between \( and \), followed by '_'. Do this 4 times, then match anything for the rest of the line (to be ignored). Substitute with each of the matches separated by '_'.

Scott Thomson
+1  A: 

Here's another possibility:

sed -E -e 's|^([^_]+(_[^_]+){3}).*$|\1|'

where -E, like -r in GNU sed, turns on extended regular expressions for readability.

Just because you can do it in sed, though, doesn't mean you should. I like cut much much better for this.

Owen S.
+1  A: 

AWK likes to play in the fields:

awk 'BEGIN{FS=OFS="_"}{print $1,$2,$3,$4}' inputfile

or, more generally:

awk -v count=4 'BEGIN{FS="_"}{for(i=1;i<=count;i++){printf "%s%s",sep,$i;sep=FS};printf "\n"}'
Dennis Williamson