Are there any cases in which anything more than a linear speed increase comes from parallelising an algorithm ?
The maximum you can reach from a theory viewpoint is linear speedup. In practice, it is possible super linear speedup. If you can distribute your problem in a away that you can leverage effects of processor caches, e.g. because it does not fit in the cache of a single core, your problem can scale better than linear.
In theory, no - but in practice this might be the case (depending on the underlying hardware and your specific problem). Its not trivial to compare parallel and sequential code (you have to compare the fastest sequential implementation with your parallel implementation, not just your parallel implementation running on a single processor/thread).
But still, when someone speaks about more-than-linear speed-up I would always be suspicious; they either didn't measure it correctly (see above), measured an artifact (hardware/OS dependent) and should document it accordingly, or this only works for a specific combination of problem/implementation/hardware.