Assuming you're using one of the known fast algorithms, the only way to speed it up is when you are doing a lot of measurements on a lot of triangles. In that case, you can keep a lot of quantities precomputed in "edge" or "winding" structures. Instead of storing the 3 points, you store meshes comprised of edge structures. Projection then becomes very quick and barycentric tests can be coded so that they are branch-predictable.
The real key is to just keep everything in cache. Processors can do MUL and DIV in nearly 1 clock cycle so memory is usually the bottleneck.
Also, consider writing the algo in SSE3 or something similar (such as Mono's SIMD support). It's work, but you can usually do a couple triangles at a time if you think hard enough about it.
I'll try to find some papers on the topic, but you might want to Google for "Ray Mesh Intersection". That will bring up all the great work from the 80s and 90s when people worked hard on optimizing this stuff.