ansaurus

Question

How to use regular expressions to match everything before a certain type of word

Answer 1

+3 A:

Replace

^.*?(?=[A-Z][a-z])

with the empty string. This works for ASCII input. For non-ASCII input (Unicode, other languages), different strategies apply.

Explanation

.*?    Everything, until
(?=    followed by
[A-Z]  one of A .. Z and
[a-z]  one of a .. z
)

The Java Unicode-enabled variant would be this:

^.*?(?=\p{Lu}\p{Ll})

Tomalak 2009-02-17 23:55:34

@Tomalak Thanks this is really close to what I want. Its returning the values that I dont want. Is there way I can switch it where it returns the string that I need?

John Daly 2009-02-18 01:05:32

@Tomalak nevermind this works. I really appreciate your help as well as the other that helped out

John Daly 2009-02-18 01:07:30

Answer 2

A:

then you can do something like this

'.*([A-Z][a-z].*)\s*'

.* matches anything
( [A-Z] #followed by an uper case char 
  [a-z] #followed by a lower case 
  .*)   #followed by anything
  \s*   #followed by zeror or more white space

Which is what you are looking for I think

hhafez 2009-02-17 23:56:48

Answer 3

+2 A:

Having woken up a bit, you don't need to delete anything, or even create a sub-group - just find the pattern expressed elsewhere in answers. Here's a complete example:

import java.util.regex.*;

public class Test
{
    public static void main(String args[])
    {
        Pattern pattern = Pattern.compile("[A-Z][a-z].*");

        String original = "THIS IS A TEST - - +++ This is a test";
        Matcher match = pattern.matcher(original);
        if (match.find())
        {
            System.out.println(match.group());
        }
        else
        {
            System.out.println("No match");
        }        
    }
}

EDIT: Original answer

This looks like it's doing the right thing:

import java.util.regex.*;

public class Test
{
    public static void main(String args[])
    {
        Pattern pattern = Pattern.compile("^.*?([A-Z][a-z].*)$");

        String original = "THIS IS A TEST - - +++ This is a test";
        String replaced = pattern.matcher(original).replaceAll("$1");

        System.out.println(replaced);
    }
}

Basically the trick is not to ignore everything before the proper word - it's to group everything from the proper word onwards, and replace the whole text with that group.

The above would fail with "*** FOO *** I am fond of peanuts" because the "I" wouldn't be considered a proper word. If you want to fix that, change the [a-z] to [a-z\s] which will allow for whitespace instead of a letter.

Jon Skeet 2009-02-17 23:56:58

I think from the question he is looking for everything before the proper word (not the other way around as your example shows)

hhafez 2009-02-17 23:58:14

He wants to *delete* everything before the proper word. Look at his example - he wants the result to be "This is a test" which is exactly what my code produces.

Jon Skeet 2009-02-18 00:01:51

However, it's more complicated than it needs to be, due to a different misreading. Editing...

Jon Skeet 2009-02-18 00:02:28

you'r right I misunderstood what the questioner meant, I'm fixing my example then

hhafez 2009-02-18 00:06:48

This is working but there is one scenario that it does notTake the example: THIS IS A TEST - - +++ This This is a testThe second "This" causes issues. It deletes the first "This". After encountering the first propper word I need to stop processing thus producing This This is a test

John Daly 2009-02-18 00:12:29

Oh damn it. ;-) +1

Tomalak 2009-02-18 00:16:21

I don't know regular expressions, but you deserve +1 for the effort!

Bill K 2009-02-18 00:16:53

I am a bit confused ;) Is there a way to just do this check once instead on every "proper" word. Its deleting info that I need.

John Daly 2009-02-18 00:22:05

@John Daly: The regex just *returns* the string you look for (in "match.group()"). You don't need to touch the original string, using the match is the same thing.

Tomalak 2009-02-18 00:25:29

@John Daly: I've just tried your example in the comment and it produced "This This is a test" with my first piece of code. If it did something else on your box, I'm confused...

Jon Skeet 2009-02-18 00:28:42

@Jon Skeet : Using your expression ^.*([A-Z][a-z].*)$ and the example THIS IS A TEST - - +++ This This is a test is produces This is a test when I execute it. Am I missing something?

John Daly 2009-02-18 00:33:55

Oh, you mean the *second* code example. I'll try that and fix it if necessary. I suggest you use the first one though.

Jon Skeet 2009-02-18 07:47:16

Fixed second example to use a reluctant quantifier.

Jon Skeet 2009-02-18 07:50:10

Answer 4

A:

([A-Z][a-z].+)

would match:

This is a text

Maiku Mori 2009-02-17 23:58:15

Answer 5

A:

I know my opinion on this really isn't that popular so you guys can down-vote me into oblivion if you want, but I have to rant a little (and this contains an solution, just not in the way the poster asked for).

I really don't get why people go to regular expressions so quickly.

I've done a lot of string parsing (Used to screen-scrape vt100 menu screens) and I've never found a single case where Regular Expressions would have been much easier than just writing code. (Maybe a couple would have been a little easier, but not much).

I kind of understand they are supposed to be easier once you know them--but you see someone ask a question like this and realize they aren't easy for every programmer to just get by glancing at it. If it costs 1 programmer somewhere down the line 10 minutes of thought, it has a huge net loss over just coding it, even if you took 5 minutes to write 5 lines.

So it's going to need documentation--and if someone who is at that same level comes across it, he won't be able to modify it without knowledge outside his domain, even with documentation.

I mean if the poster had to ask on a trivial case--then there just isn't such thing as a trivial case.

public String getRealText(String scanMe) {
    for(int i=0 ; i < scanMe.length ; i++)
        if( isUpper(scanMe[i]) && isLower(scanMe[i+1]) )
            return scanMe.subString(i);
return null; }

I mean it's 5 lines, but it's simple, readable, and faster than most (all?) RE parsers. Once you've wrapped a regular expression in a method and commented it, the difference in size isn't measurable. The difference in time--well for the poster it would have obviously been a LOT less time--as it might be for the next guy that comes across his code.

And this string operation is one of the ones that are even easier in C with pointers--and it would be even quicker since the testing functions are macros in C.

By the way, make sure you look for a space in the second slot, not just a lower case variable, otherwise you'll miss any lines starting with the words A or I.

Bill K 2009-02-18 00:41:39

ansaurus

tags:

views:

answers:

How to use regular expressions to match everything before a certain type of word

related questions