tags:

views:

2688

answers:

5

I am trying to do some simple formatting stuff with 'sed' in linux, and i need to use a regex to trim a string after the 15th character, and append a '...' to the end. Something like this:

before: this is a long string that needs to be shortened
after: this is a long ...

Can anyone please show me how i could write this as a regex, and if possible explain how it works so that i might learn regex a little better?

+16  A: 

The following works for me:

echo "This is a test with more than 15 characters" | sed "s/\(.\{15\}\).\+$/\1…/"

What happens here is that we match any character ( .) 15 times ({15}). We capture the text so matched inside parentheses. The following part (.+$) matches all the rest, until the end of the line. We replace this by whatever we've captured inside the parentheses (\1), followed by the hyperbolic ellipsis.

To satisfy sed's regex dialect (BRE) we have to escape some of the characters.

Konrad Rudolph
Should probably make that .+ instead of .*, so that it doesn't match a string of exactly 15 characters.
Adam Jaskiewicz
+1  A: 

With Perl regular expressions:

$ echo 'this is a long string that needs to be shortened' \
| perl -pe 's/^(.{15}).+/$1.../'
this is a long ...

The easiest way to think about regular expressions is to consider it a pattern that needs to be matched. In this case the pattern begins with the beginning of the line:

^

(Note that / is an arbitrary separator. Other characters could be used instead.) The ^ is the symbol that represents the start of the line in a regex. Next the regex matches any character:

^.

A . is the regex symbol for any character. But we want to match the first 15 characters:

^.{15}

There are several different modifiers that represent a repetition. The most common is * which signifies 0 or more. A + indicates 1 or more. {15} obviously represents exactly 15. (The {...} notations is more general. So * could be written {0,} and + is the same as {1,}.) Now we need to capture the first 15 characters so that we can use them later:

^(.{15})

Everything between ( and ) is captured and placed in a special variable called $1 (or sometimes \1). The second chunk captured would be placed in $2 and so on. Finally, you need to match to the end of the line so that you can throw that part away:

^(.{15}).+

I initially used *, but as another person pointed out, that probably isn't what is wanted when the string is exactly 15 characters long:

$ echo 'this is a long ' \
| perl -pe 's/^(.{15}).*/$1.../'
this is a long ...

Using a + means the pattern will not match if there is not a 16th character to replace.

The second half of the statement is what gets printed:

$1...

The $1 variable that we caught earlier is used and the dots are literal .s on this side of the substitution. Generally, everything except regex variables are literal on the right side of a substitution statement.

Jon Ericson
A: 

In perl, you could write s/(.{15}).*/$1.../. I'm not sure sed can use the {15} notation but if not, s/\(...............\).*/\1.../ (with 15 dots in the group).

I can never remember whether you need to escape ( when grouping in sed. I just tried it and you do need \( and \)

Adrian Pronk
+5  A: 

Explanation of Konrand Rudolph's answer, since you requested explanations (ah, as I wrote this, Konrad added his own explanation too!)

 sed "s/\(.\{15\}\).+$/\1…/"

\(

start a group - ask the regexp engine to remember what's inside the parens, and assign the first such group to \1, the second to \2 etc. We will only need \1 here

.

Match anything...

\{15\}

... 15 times.

\)

end the group. So \1 will contain the first 15 characters

 .+

match anything again. The + means "one or more times",so will match characters beyond the 15 characters we matched above,...

 $

...until the end of the line

Now for the replace bit:

\1

Replace with the contents of \1

...

and three dots.

Done!

Paul
Typo: You write \} when you mean \) to end the group.
strager
I hadn't noticed that an explanation was requested and thought the author just needed a way around BRE's quirks.
Konrad Rudolph
Don't worry. You still get all the votes. ;-)
Jon Ericson
A: 

Do you really want to just whack off everything after the 15th character, or are you trying to impose a 15-character maximum length? What if the string is 16 characters long? All of the solutions presented so far will chop off that one excess character only to replace it with three dots. (I know Konrad and Paul used the ellipsis character, but the OP used three dots in the example; we should get a ruling on that.)

If you want to trim the strings to a maximum length of 15 including the three dots, you can do this:

s/^\(.\{12\}\).\{3\}.\+$/\1.../

It still only matches if there are more than 15 characters, but then it chops off everything after the 12th character to make room for the dots.

Alan Moore