You can get some info, from a mathematical viewpoint, on this wikipedia article or on msdn.
Essentially, when you render a 3d model to the screen, you start with a simple collection of vertices scattered in 3d space. These vertices all have their own positions expressed in "object space". That is, they usually have coordinates which have no meaning in the scene that is being rendered, but only express the relations between one vertex and the other of the same model.
For instance, the positions of the vertices of a model could only range from -1 to 1 (or similar, it depends on how the model has been created).
In order to render the model in the correct position, you have to scale, rotate and translate it to the "real" position in your scene. This position you are moving to is expressed in "world space" coordinates which also express the real relationships between vertices in your scene. To do so, you simply multiply each vertex' position with its World matrix. This matrix must be created to include the translation/rotation/scale parameters you need to apply, in order for the object to appear in the correct position in the scene.
At this point (after multiplying all vertices of all your models with a world matrix) your vertices are expressed in world coordinates, but you still cannot render them correctly because their position is not relative to your "view" (i.e. your camera). So, this time you multiply everything using a View matrix which reflects the position and orientation of the viewpoint from which you are rendering the scene.
All vertices are now in the correct position, but in order to simulate perspective you still have to multiply everything with a Projection matrix. This last multiplication determines how the position of the vertices changes based on distance from the camera.
And now finally all vertices, starting from their position in "object space", have been moved to the final position on the screen, where they will be rendered, rasterized and then presented.