tags:

views:

715

answers:

6

Hello,

I am designing a crawler which will get certain content from a webpage (using either string manipulation or regex).

I'm able to get the contents of the webpage as a response stream (using the whole httpwebrequest thing), and then for testing/dev purposes, I write the stream content to a multi-line textbox in my ASP.NET webpage.

Is it possible for me to loop through the content of the textbox and then say "If textbox1.text.contains (or save the textbox text as a string variable), a certain string then increment a count". The problem with the textbox is the string loses formatting, so it's in one long line with no line breaking. Can that be changed?

I'd like to do this rather than write the content to a file because writing to a file means I would have to handle all sorts of external issues. Of course, if this is the only way, then so be it. If I do have to write to a file, then what's the best strategy to loop through each and every line (I'm a little overwhelmed and thus confused as there's many logical and language methods to use), looking for a condition? So if I want to look for the string "Hello", in the following text:

My name is xyz I am xyz years of age Hello blah blah blah Bye

When I reach hello I want to increment an integer variable.

Thanks,

A: 

I do it this way in an project, there may be a better way to do it, but this works :)

string template = txtTemplate.Text;
            string[] lines = template.Split(Environment.NewLine.ToCharArray());
Søren Pedersen
A: 

That is a nice creative way.

However, I am returning a complex HTML document (for testing purposes, I am using Microsoft's homepage so I get all the HTML). Do I not have to specify where I want to break the line?

Given your method, if each line is in a collection (Which is a though I had), then I can loop through each member of the collection and look for the condition I want.

You should've added this as a comment to the above answer
Jon Limjap
+1  A: 

In my opinion you can split the content of the text in words instead of lines:

public int CountOccurences(string searchString)
{
    int i;
    var words = txtBox.Text.Split(" ");

    foreach (var s in words)
        if (s.Contains(searchString))
           i++;

    return i;
}

No need to preserve linebreaks, if I understand your purpose correctly.

Also note that this will not work for multiple word searches.

Jon Limjap
If I searched for "test" in the following string, it'd only give me 1 occurance: "test, test. test: and test" - This is clearly wrong.
Simon Johnson
No way, that'd give you 4. The split would give you an array with "test,", "test.", "test:", "and", "test". Contains is a substring search, not an exact one. I do admit however that this won't work with, say "testtest testone".
Jon Limjap
A: 

If textbox contents were returned with line-breaks representing where word-wrapping occurs, that result will be dependant on style (e.g. font-size, width of the textbox, etc.) rather than what the user actually entered. Depending on what you actually want to do, this is almost certainly NOT what you want.

If the user physically presses the 'carriage return / enter' key, the relevant character(s) will be included in the string.

Bobby Jack
A: 

Why do you need to have a textbox at all? Your real goal is to increment a counter based on the text that the crawler finds. You can accomplish this just by examining the stream itself:

  Stream response = webRequest.GetResponse().GetResponseStream();
  StreamReader reader = new StreamReader(response);
  String line = null;

  while ( line = reader.ReadLine() ) 
  {
    if (line.Contains("hello"))
    {
      // increment your counter
    }
  }

Extending this if line contains more than one instance of the string in question is left as an exercise to the reader :).

You can still write the contents to a text box if you want to examine them manually, but attempting to iterate over the lines of the text box is simply obscuring the problem.

JSBangs
A: 

The textbox was to show the contents of the html page. This is for my use so if I am running the webpage without any breakpoints, I can see if the stream is visually being returned. Also, it's a client requirement so they can see what is happening at every step. Not really worth the extra lines of code but it's trivial really, and the last of my concerns.

The code in the while loop I don't understand. Where is the instruction to go to the next line? This is my weakness with the readline method, as I seldom see the logic that forces the next line to be read.

I do need to store the line as a string var where a certain string is found, as I will need to do some operations (et a certain part of the string) so I've always been looking at readline.

Thanks!

The logic for going to the next line is "while (line = reader.ReadLine())". The ReadLine() method advances to the next line and returns the line just read as a string. When there are no more lines, it returns null, which causes the condition to become false.
JSBangs