ansaurus

Question

HTML Parsing/Scraping Algorithm Help..Java

Answer 1

+2 A:

You could do this with a regular expression:

for (String line: lines) {
    if (line.matches("[A-Z]+\\b.*")) {
        ...
    }
}

This matches any line that has one or more capital letters [A-Z]+, followed by a word boundary \\b, followed by anything else .*. You could get rid of the \\b.* if you only expect there to be a single name on each line and nothing after.

Alternatively you could use a String.split() to break up the line into words and then check the first word for all caps:

for (String line: lines) {
    String[] words = line.split("\\s");

    if (words.length > 0 && words[0].equals(words[0].toUpperCase())) {
        ...
    }
}

Here \\s matches any space, tab, or other whitespace character.

John Kugelman 2009-10-10 20:47:32

Answer 2

+2 A:

String line = "AARON asdfasdflökj";

int i;
String cmp;

if( (i=line.indexOf(' ')) != -1 ) {
 cmp = line.substring( 0, i );
} else {
 cmp = line;
}

if( cmp.equals( cmp.toUpperCase() ) ) {
 // Line starts with all capitals
} else {
 // ...
}

The first if checks wheter there's a space in the String line and removes everything behind it. The second if checks if every char is upper case in the String.

svens 2009-10-10 20:52:16

Answer 3

+3 A:

First, instead of reinventing the wheel and because it can be a pain to parse bad HTML, I'd use an existing HTML parser, something like TagSoup or Jericho. Actually, Jericho would have my preference here as it has a built-in functionality to extract all text from HTML markup.

Then, I'd use a regex (\p{Upper}+) to extract all words in uppercase. See java.util.regex.

Pascal Thivent 2009-10-10 21:09:29

ansaurus

tags:

views:

answers:

HTML Parsing/Scraping Algorithm Help..Java

related questions