tags:

views:

43

answers:

2

I'm treating a list of strings, but I want to alter the strings so they don't look ugly to the user. An example list would be

2736162 Magazines
23-2311 Numbers
1-38122 Faces
5-231123 Newspapers
31-31235 Armynews
33-12331 Celebrities 1
33-22113 Celebrities 2
Cars
Glasses

And what I want is to trim out the beginning so that the ugly sequence of numbers and "-" are left out, and the user only sees the data that makes sense like:

Magazines
Numbers
Faces
Newspapers
Armynews
Celebrities 1
Celebrities 2
Cars
Glasses

How would I trim out the digits/-'s in the beginning with regex ?

EDIT Would it be possible to design the same REGEX to also strip these values from:

FFKKA9101U- Aquatic Environmental Chemistry
FLVKB0381U- Clinical Drug Development
4761-F-Filosofisk kulturkritik
B22-1U-Dynamic biochemistry

to:

Aquatic Environmental Chemistry
Clinical Drug Development
Filosofisk kulturkritik
Dynamic biochemistry

the rule I would think of is that if there are only capital letters, digits and - or + signs before a - it only makes sense to the machine, and is not an actual word, and therefore should be stripped out, I don't know how to formulate this in regex though.

+4  A: 

It looks like you can match and replace ^[\d-]*\s* with the empty string.

The […] is a character class. Something like [aeiou] matches one of any of the lowercase vowels. \d is the shorthand for the digit character class, so [\d-] matches either a digit or a dash. The \s is the shorthand for the whitespace character class.

The ^ is the beginning of the line anchor. The * is "zero-or-more" repetition.

Thus the pattern matches, at the beginning of a line, a sequence of digits or dash, followed by a sequence of whitespaces.

It's not clear from the question, but if the input is a multiline text (instead of applying the regex one line at a time), then you'd want to enable the multiline mode as well.


C# snippet

Here's an example snippet in C#:

var text = @"
2736162 Magazines
23-2311 Numbers
1-38122 Faces
5-231123 Newspapers
31-31235 Armynews
33-12331 Celebrities 1
33-22113 Celebrities 2
Cars
Glasses
";

Console.WriteLine(
  Regex.Replace(
     text,
     @"^[\d-]*\s*",
     "",
     RegexOptions.Multiline
  )
);

The output is (as seen on ideone.com):

Magazines
Numbers
Faces
Newspapers
Armynews
Celebrities 1
Celebrities 2
Cars
Glasses

Depending on flavor, you may have to specify the multiline mode as a /m flag (or (?m) embedded). You may also have to double the backslash if you're representing the pattern as a string literal, e.g. in Java you can use text.replaceAll("(?m)^[\\d-]*\\s*", "").


Special note on including dash in a character class

Do be careful when including the - inside a […] character class, since it can signify a range instead of a literal - character. Something like [a-z] matches a lowercase letter. Something like [az-] matches either 'a', 'z', or '-'.

Related questions

polygenelubricants
thank you so much! Please have a look at my edit if you feel like it, I know you've already answered the question, so it's up to you to allow me to steal more of your expertise.
Jakob
@Jakob: Try `^[A-Z0-9-]*-\s*` and tell me if it works.
polygenelubricants
@polygenelubricants - try looking at: http://ideone.com/SoVXM it get's the newest strings, but not the ones that just have digits in front of the name
Jakob
@Jakob: so now you have two different patterns, so you can just "or" the pattern using alternation with `|`. I also modified the first pattern to use `\s+` instead of `\s*`. This seems to work with your current test: http://ideone.com/aqjyx
polygenelubricants
@polygenelubricants - I looked it up in my encyclopedia - what you just did; it's the definition of AWESOME! thank you :D
Jakob
A: 

If there are digits(with or without -'s) on every line you can just split the line on space, exclude first piece and then join again.

codaddict