Working on Android SDK, it's Java minus some things.
I have a solution that pulls out two regex patterns from web pages. The problems I'm having is that it's finding things inside HTML tags. I tried jTidy, but it was just too slow on the Android. Not sure why but my Scanner regex match solution whips it many times over.
currently, I grab the page source into a IntputStream
is = uconn.getInputStream();
and the match and extract like this:
Scanner scanner = new Scanner(in, "UTF-8");
String match = "";
while (match != null) {
match = scanner.findWithinHorizon(extractPattern, 0);
if (match != null) {
String matchit = scanner.match().group(grp);
it works very nicely and is fast.
My regex pattern is already kinda crazy, actually two patterns in an or like this (p1|p2)
Any ideas on how I do that "but not inside HTML tags" or exclude HTML tags at the start? If I can exclude HTML tags from my source that will likely speed up my interface significantly as I have a few other things I need to do with the raw data.