views:

95

answers:

3

First of all this is tough thing to solve, so far I didn't come up with a good example but I hope someone here will figure this out. I hope there is known way to solve these kind of problems, or an obscure algorithm.

Scenario:

  • In my application I do several requests to the very same webpage
  • Webpage has dynamic and random content in it such as (datetime, and quote of the day etc. in theory can be anything)
  • Response of this application has got 2 cases, let's call them "TRUE" and "FALSE". For example sometimes response would return a "True Text" sometimes it would be "False Text".
  • My application knows 3 samples of "TRUE" case and 3 samples of "FALSE" case, but these are also include random content such as "time" as well.

Challenge

  • Now when my application gets a new response how can I understand if this response is an example of "TRUE" or "FALSE" case?

What I've tried

  • Process the first sample of TRUE case line by line and generate an integer array from the value of characters
  • Do the same thing for second TRUE sample
  • Do the same thing for third TRUE sample
  • Analyse the differences between these stored TRUE cases and create a new array with
  • Now, I know which lines are dynamic (such as datetime), now I create a new final TRUE case array which stores only static lines to a final TRUE case array.
  • Now when I got a new case, I create a similar array then compare it with previously stored final TRUE case and if does match (except filtered lines) it's a TRUE case if other lines are massively changed (there is a tolerance value) then it's FALSE.

Limitations and weaknesses of this algorithm is pretty obvious. Although I've got some good results in some cases, but it doesn't work as expected all the time.

My current class works like this:

Dim Analyser AS NEW ContentAnalyzer()
Analyser.AddTrueCase(True1Html)
Analyser.AddTrueCase(True2Html)
Analyser.AddTrueCase(True3Html)

'This will return True if the UnknownHtml is similar to TRUE case, otherwise False
Analyser.IsThisTrue(UnknownHtml)

Sorry the title doesn't make much sense, I couldn't find a good way to describe it.

+1  A: 

Either this is really misstated or I'm just not getting something:

The application requests the web page and gets it and has to ascertain if it is another "True" or "False", right? This is to say that part of the web request isn't to return the true or false at the beginning which is where my first confusion is.

Secondly, why aren't you doing a similar comparison on the false cases and seeing if there are sufficient similarities to create 3 buckets of results for some random page requested:

1) Page is more similar to true and thus is viewed as true.

2) Page is more similar to false and thus is viewed as false.

3) Page isn't more similar to either and thus the result is something like a null or exception situation as it isn't possible to discern which result makes sense.

Example of where that 3rd case could happen: Suppose the page contains an integer and if positive the result is true and if negative the result is false. What if the result is 0? Does 0 count as positive since it is equal to its absolute value or does it count as a negative for some reason?

Or am I way off in what you are trying to do here?

JB King
Basically when I send "?true=1" I always know it'll return TRUE case, when I send "?false=1" it'll always return FALSE. But when I send "?random=1" it'll return a random case which I need to figure out. I think that's where you confused.
dr. evil
As you said my algorithm is pretty much primitive and missing lots of cases, although I like the idea of extra check for FALSE base, which makes a lot of sense. But I feel like I got this wrong from the beginning there should be a better way to do this.
dr. evil
Ah, so there are a couple of different parts to it and it is just the case where the random is returned that you have to figure out what is going on. Now I see the problem a bit better but this should be researched somewhere I'd think.
JB King
+2  A: 

It sounds like you're doing fairly simple document classification. This is a heavily researched field, especially lately due to spam filters. Look into a library for document classification in your language of choice.

Classifier4j looks like a popular library that runs on the Java VM and has been ported to .NET.

RossFabricant
I think you right, I'll look into document classification subject and will try to implement into this.
dr. evil
This is definitely way to go. Thanks a lot, I'm just a little bit worried about processing power since I need to execute this for new signatures so many times and constantly, and little bit worried about big class and license issues, but hey that's another story :)
dr. evil
Thanks a lot, I love SO when someone just save a problem within 5 minutes after I spent a week on it!
dr. evil
You're welcome. As the saying goes "Weeks in the lab can save you hours in the library." Books like The Algorithm Design Manual can help you get a sense for how to apply existing algorithms to your problems.
RossFabricant
A: 

Perhaps you mean something like Bayesian Filtering? You could look at what Paul Graham has done with Spam: http://www.paulgraham.com/better.html

Svante