tags:

views:

227

answers:

3

I'm trying to parse some strings from a web page but I keep getting strings that happen to be broken up with no way to check if the string is complete or not. At the moment, I have a buffer of 1024 bytes that I'm receiving parts of the page with. What should I do to make sure I get the full string, preferably without an overly large buffer.

A: 

I'm not totally sure I understand what you're doing and what you mean by a "broken string," but I'll try and give you an answer.

By broken string, I'll assume you mean some logical ending to a piece of HTML or text. Ultimately, you've got no way to no but parse, and if you aren't at some logical stopping point, keep reading. If you're using a char[] to hold the data, then you'll be certain to have some trouble with the buffer. Depending on how you read the data in, the method may change, but the process is roughly:

(kinda C, not technically accurate)

int allocLen = 1024;
char buffer[] = malloc(allocLen);
readInNBytes(buffer, 128);
if (notAtLogicalEnd(buffer))
     realloc(buffer, allocLen *= 2);
else
     // we're done?

Now, obviously this leaves out the detail of determining if your string is broken, but that's still up in the air for interpretation. There are several ways you could check if you're at a valid end of a the data: look for space characters, line breaks, and so on, or check if the HTML terminates with a [/html] tag. Either way, you've gotta read the whole data set in.

I'd be curious to know how you're reading in the HTML data and your full explanation of 'broken string', however, and I'll revise my answer.

Tony k
A: 

I think what you are trying to say is that your string doesn't always end up in the same iteration of your buffer. If that is so there are basically two options.

  1. Use HUGE buffers. There is no way to prove that you will not get any misses here but it will lower the chance significantly.
  2. If you know the max length of the string that you are looking for you can make two buffers. The first holds the current part that you just got and the other holds the previous. The reason you need to know the length of the string is because the size of your buffers needs to be atleast the size of the string.

The second solution is by far the better one but it does rely on the knowledge of the max length of strings.

Matt Campbell
A: 

This is only tangentially related to your question, but you may be solving the wrong problem. For years I used to scrape HTML off of web pages to try to get at certain strings. Then after hearing about the Chickenfoot extension to Firefox, I realized it would be much easier to use the w3m web browser to convert HTML to ASCII and then scrape the ASCII using a standard mechanism like LPEG or parsing combinators. This idea doesn't work for every problem, but when it does it is usually much, much easier than scraping HTML.

For example, I recently used this technique to scrape lyrics to over 200,000 songs for a homework assignment.

Norman Ramsey