tags:

views:

208

answers:

4

I am attempting to parse HTML for specific data but am having issues with return characters, at least I think that's what the problem is. I am using a simple substring method to take apart the HTML as I know beforehand what I am looking for.

Here is my parse method:

public static void parse(String response, String[] hashItem, String[][] startEnd) throws Exception
{

    for (i = 0; i < hashItem.length; i++)
    {
        part = response.substring(response.indexOf(startEnd[i][0]) + startEnd[i][0].length());
        value = part.substring(0, part.indexOf(startEnd[i][1]));
        DATABASE.setHash(hashItem[i], value);
    }
}

Here is a sample of the HTML that is giving me issues

<table cellspacing=0 cellpadding=2 class=smallfont>
<tr onclick="lu();" onmouseover="style.cursor='hand'">
<td class=bodybox nowrap>&nbsp;     21,773,177,147 $&nbsp;</td><td></td>
<td class=bodybox nowrap>&nbsp;        629,991,926 F&nbsp;</td><td></td>
<td class=bodybox nowrap>&nbsp;             24,537 P&nbsp;</td><td></td>
<td class=bodybox nowrap>&nbsp;                  0 T&nbsp;</td>
<td></td><td class=bodybox nowrap>&nbsp;RT&nbsp;</td>

There are hidden return characters but when I try to add them into the string that I am trying to use it doesn't work out well, if at all. Is there a method or perhaps a better way to strip hidden characters from the HTML to make it easier to parse? Any help is greatly appreciated as always.

A: 

You can try with XMLPullParser available in Android. You can use StringBuffer to append characters in between tags.

Rahul
A: 

Try using a regex to gain the information you want: http://java.sun.com/developer/technicalArticles/releases/1.4regex/

You could even use it to remove the hidden characters. Or maybe use String.Replace to remove the newline characters?

BeRecursive
I tried String.ReplaceAll("\n") but still gave me issues.
Alejandro Huerta
I figured out where I went wrong with the replaceAll("\n", "") and it worked well, thank you.
Alejandro Huerta
+2  A: 

If you want to make parsing very easy, try Jsoup:

This example will download the page, parse and get the text.

Document doc = Jsoup.connect("http://jsoup.org").get();

Elements tds = doc.select("td.bodybox");

for (Element td : tds) {
  String tdText = td.text();
}
droidgren
Jsoup is working out pretty well, thank you. Only issue I can see at the moment is it gives me extra characters at the beginning and end, for instance: "  21,670,510,504 $Â" I believe is it because of the " " within the HTML. Is there anyway to have Jsoup delete that?
Alejandro Huerta
Alejandro Huerta
You can also replace afterwards: `tdText.replace(Jsoup.parse(" ").text(), " ");`
Michael Mrozek
A: 

You can parse the HTML file using a XMLReader for example as far as i know, check this article http://www.ibm.com/developerworks/xml/library/x-andbene1/

Kharizmi