views:

520

answers:

4

The kernel trick maps a non-linear problem into a linear problem.

My questions are:
1. What is the main difference between a linear and a non-linear problem? What is the intuition behind the difference of these two classes of problem? And How does kernel trick helps use the linear classifiers on a non-linear problem?
2. Why is the dot product so important in the two cases?

Thanks.

+2  A: 
  1. Linear equations are homogeneous, and superposition applies. You can create solutions using combinations of other known solutions; this is one reason why Fourier transforms work so well. Non-linear equations are not homogeneous, and superposition does not apply. Non-linear equations usually have to be solved numerically using iterative, incremental techniques.
  2. I'm not sure how to express the importance of the dot product, but it does take two vectors and returns a scalar. Certainly a solution to a scalar equation is less work than solving a vector or higher-order tensor equation, simply because there are fewer components to deal with.

My intuition in this matter is based more on physics, so I'm having a hard time translating to AI.

duffymo
Could you explain how the kernel trick translates one to another?
kunjaan
Hard to describe quickly, but I think the idea is the same one that makes data of a certain type look like a straight line when you plot it on log or semi-log graph paper. (Do they show kids what log and semi-log graph paper is these days?)
duffymo
+3  A: 

The main difference (for practical purposes) is: A linear problem either does have a solution (and then it's easily found), or you get a definite answer that there is no solution at all. You do know this much, before you even know the problem at all. As long as it's linear, you'll get an answer; quickly.

The intuition beheind this is the fact that if you have two straight lines in some space, it's pretty easy to see whether they intersect or not, and if they do, it's easy to know where.

If the problem is not linear -- well, it can be anything, and you know just about nothing.

The dot product of two vectors just means the following: The sum of the products of the corresponding elements. So if your problem is

c1 * x1 + c2 * x2 + c3 * x3 = 0

(where you usually know the coefficients c, and you're looking for the variables x), the left hand side is the dot product of the vectors (c1,c2,c3) and (x1,x2,x3).

The above equation is (pretty much) the very defintion of a linear problem, so there's your connection between the dot product and linear problems.

balpha
+1. Not sure why this was -1ed -- it addresses the first part of part 1 of the question, which other posts didn't address.
j_random_hacker
+16  A: 

Many classifiers, among them the linear Support Vector Machine (SVM), can only solve problems that are linearly separable, i.e. where the points belonging to class 1 can be separated from the points belonging to class 2 by a hyperplane.

In many cases, a problem that is not linearly separable can be solved by applying a transform phi() to the data points; this transform is said to transform the points to feature space. The hope is that, in feature space, the points will be linearly separable. (Note: This is not the kernel trick yet... stay tuned.)

It can be shown that, the higher the dimension of the feature space, the greater the number of problems that are linearly separable in that space. Therefore, one would ideally want the feature space to be as high-dimensional as possible.

Unfortunately, as the dimension of feature space increases, so does the amount of computation required. This is where the kernel trick comes in. Many machine learning algorithms (among them the SVM) can be formulated in such a way that the only operation they perform on the data points is a scalar product between two data points. (I will denote a scalar product between x1 and x2 by <x1, x2>.)

If we transform our points to feature space, the scalar product now looks like this:

<phi(x1), phi(x2)>

The key insight is that there exists a class of functions called kernels that can be used to optimize the computation of this scalar product. A kernel is a function K(x1, x2) that has the property that

K(x1, x2) = <phi(x1), phi(x2)>

for some function phi(). In other words: We can evaluate the scalar product in the low-dimensional data space (where x1 and x2 "live") without having to transform to the high-dimensional feature space (where phi(x1) and phi(x2) "live") -- but we still get the benefits of transforming to the high-dimensional feature space. This is called the kernel trick.

Many popular kernels, such as the Gaussian kernel, actually correspond to a transform phi() that transforms into an infinte-dimensional feature space. The kernel trick allows us to compute scalar products in this space without having to represent points in this space explicitly (which, obviously, is impossible on computers with finite amounts of memory).

Martin B
+1, nice explanation.
j_random_hacker
+1 - very nice, indeed.
duffymo
+20  A: 

When people say linear problem with respect to a classification problem, they usually mean linearly separable problem. Linearly separable means that there is some function that can separate the two classes that is a linear combination of the input variable. For example, if you have two input variables, x1 and x2, there are some numbers theta1 and theta2 such that the function theta1.x1 + theta2.x2 will be sufficient to predict the output. In two dimensions this corresponds to a straight line, in 3D it becomes a plane and in higher dimensional spaces it becomes a hyperplane.

You can get some kind of intuition about these concepts by thinking about points and lines in 2D/3D. Here's a very contrived pair of examples...

2d scatter plot (larger)

This is a plot of a linearly inseparable problem. There is no straight line that can separate the red and blue points.

3d scatter plot (larger)

However, if we give each point an extra coordinate (specifically 1 - sqrt(x*x + y*y)... I told you it was contrived), then the problem becomes linearly separable since the red and blue points can be separated by a 2-dimensional plane going through z=0.

Hopefully, these examples demonstrate part of the idea behind the kernel trick:

Mapping a problem into a space with a larger number of dimensions makes it more likely that the problem will become linearly separable.

The second idea behind the kernel trick (and the reason why it is so tricky) is that it is usually very awkward and computationally expensive to work in a very high-dimensional space. However, if an algorithm only uses the dot products between points (which you can think of as distances), then you only have to work with a matrix of scalars. You can implicitly perform the calculations in the higher-dimensional space without ever actually having to do the mapping or handle the higher-dimensional data.

StompChicken
+1, those plots are very intuitive.
j_random_hacker