ansaurus

Question

How to understand if the static part of the text has been changed? (diff algorithm related)

Answer 1

+1 A:

Either this is really misstated or I'm just not getting something:

The application requests the web page and gets it and has to ascertain if it is another "True" or "False", right? This is to say that part of the web request isn't to return the true or false at the beginning which is where my first confusion is.

Secondly, why aren't you doing a similar comparison on the false cases and seeing if there are sufficient similarities to create 3 buckets of results for some random page requested:

1) Page is more similar to true and thus is viewed as true.

2) Page is more similar to false and thus is viewed as false.

3) Page isn't more similar to either and thus the result is something like a null or exception situation as it isn't possible to discern which result makes sense.

Example of where that 3rd case could happen: Suppose the page contains an integer and if positive the result is true and if negative the result is false. What if the result is 0? Does 0 count as positive since it is equal to its absolute value or does it count as a negative for some reason?

Or am I way off in what you are trying to do here?

JB King 2009-04-02 22:25:29

Basically when I send "?true=1" I always know it'll return TRUE case, when I send "?false=1" it'll always return FALSE. But when I send "?random=1" it'll return a random case which I need to figure out. I think that's where you confused.

dr. evil 2009-04-02 22:35:28

As you said my algorithm is pretty much primitive and missing lots of cases, although I like the idea of extra check for FALSE base, which makes a lot of sense. But I feel like I got this wrong from the beginning there should be a better way to do this.

dr. evil 2009-04-02 22:36:38

Ah, so there are a couple of different parts to it and it is just the case where the random is returned that you have to figure out what is going on. Now I see the problem a bit better but this should be researched somewhere I'd think.

JB King 2009-04-02 22:51:34

Answer 2

+2 A:

It sounds like you're doing fairly simple document classification. This is a heavily researched field, especially lately due to spam filters. Look into a library for document classification in your language of choice.

Classifier4j looks like a popular library that runs on the Java VM and has been ported to .NET.

RossFabricant 2009-04-02 22:31:22

I think you right, I'll look into document classification subject and will try to implement into this.

dr. evil 2009-04-02 22:37:20

This is definitely way to go. Thanks a lot, I'm just a little bit worried about processing power since I need to execute this for new signatures so many times and constantly, and little bit worried about big class and license issues, but hey that's another story :)

dr. evil 2009-04-02 22:52:33

Thanks a lot, I love SO when someone just save a problem within 5 minutes after I spent a week on it!

dr. evil 2009-04-03 21:31:47

You're welcome. As the saying goes "Weeks in the lab can save you hours in the library." Books like The Algorithm Design Manual can help you get a sense for how to apply existing algorithms to your problems.

RossFabricant 2009-04-03 21:44:21

Answer 3

A:

Perhaps you mean something like Bayesian Filtering? You could look at what Paul Graham has done with Spam: http://www.paulgraham.com/better.html

Svante 2009-04-02 22:49:21

ansaurus

tags:

views:

answers:

How to understand if the static part of the text has been changed? (diff algorithm related)

related questions