tags:

views:

158

answers:

9
+7  Q: 

Java string search

Hi, If I am looking for a particular word inside a string, for example, in the string "how are you" I am looking for "are". Would a regular indexOf() work faster and better or a Regex match()

String testStr = "how are you";
String lookUp = "are";

//METHOD1
if (testStr.indexOf(lookUp) != -1)
{
 System.out.println("Found!");
}

//OR
//METHOD 2
if (testStr.match(".*"+lookUp+".*"))
{
 System.out.println("Found!");
}

Which of the two methods above is a better way of looking for a string inside another string? Or is there a much better alternative?

  • Ivard
+8  A: 

If you don't care whether it's actually the entire word you're matching, then indexOf() will be a lot faster.

If, on the other hand, you need to be able to differentiate between are, harebrained, aren't etc., then you need a regex: \bare\b will only match are as an entire word (\\bare\\b in Java).

\b is a word boundary anchor, and it matches the empty space between an alphanumeric character (letter, digit, or underscore) and a non-alphanumeric character.

Caveat: This also means that if your search term isn't actually a word (let's say you're looking for ###), then these word boundary anchors will only match in a string like aaa###zzz, but not in +++###+++.

Tim Pietzcker
+1  A: 

Method one should be faster because it has lesser overhead. if it is about performance in searching in huge files a specialized method like boyer moore pattern matching could lead to further improvements.

stacker
For so e reason the link isn't displayed http://en.wikipedia.org/wiki/Boyer–Moore_string_search_algorithm
stacker
@stacker: The dash in `Boyer-Moore` was really an en-dash (`U+2013`). I don't know offhand if that's legal in a URL, but SO doesn't seem to like it.
Alan Moore
A: 

If you are looking up one string inside another you should be using indexOf or contains method. Example: See if "foo" is present in a string.

But if you are looking for a pattern use the match method.
Example: See if "foo" is present at the beginning/end of the string. Or see if it's present as a whole word.

Using the match method for simple string searching is not efficient because of the regex engine overhead.

codaddict
A: 

The first method is faster and since it's not a complex expressions there is no reason to use regex here.

Emil
+1  A: 

If you are looking for a fixed string, not a pattern, as in the example in your question, indexOf will be better (simpler) and faster, since it does not need to use regular expressions.

Also, if the string you are searching for does contain characters that have a special meaning in regular expressions, with indexOf you don't need to worry about escaping these characters.

In general, use indexOf where possible, and match for pattern matching, where indexOf cannot do what you need.

Grodriguez
A: 

of course indexOf()is better than match(). one 'match()' consists of many compares: a==a,r==r ,e==e ; at the same time,you append wildcards which would be divided into many cases:

  1. ?are
    ??are
    ???are
    ????are
    ........ are are? are?? are???

until it's as long as the original strings.

shenju
A: 

Your question practically answers itself; if you have to ask whether regex is the better choice, it almost certainly isn't. Also, when you're choosing between regex and non-regex solutions, performance should never be your primary criterion. Wait until you've got some working code and profile it.

Alan Moore
A: 

A better approach to compare the both versions is to analyze the source code of indexOf method and the regex.matches methods itself, calculating runtime of both the algorithm implementations in Big_O_notation and comparing their best, average and worst cases (charsequence found at start, middle or end of the string respectively). The source code goes here indexOf_source and here regex.matches. We need to do a run-time analysis of both to see what it is exactly doing. Hectic task but it's the only way to make a true comparison, the rest of them being only assumptions. Good question though.

A_Var