Here is more depth of what I meant, though it will be interesting to see if others agree or what they have to say.
Consider how computers work today. You have hardware that has integer and floating-point registers, and a vast array of random access memory, and instructions which are mostly of the form 'based on reading the value of this register/memory cell, go poke this new value into this register/cell'. (Updating memory cells has all kinds of perf implications when it comes to cache lines and coherency and memory models and whatnot.) Integers are 32 or 64 bits, and nearly all programming languages surface these data types that exactly match the hardware. Nearly every runtime works with a small call stack where stack-allocated objects are cheap, and a more expensive 'heap' where other objects can be created and destroyed when non-stack-based-lifetimes are needed.
Now consider most modern functional programming languages. Immutability is the norm; you will rarely 'poke' memory with new values. (This means you create more new objects, which means you allocate more.) Lambdas and continuations are the norm; you more rarely have object lifetimes that correspond to the stack. (Indeed, some FP runtimes don't use a stack; in a CPS implementation the notion of stack and program counter are out-of-place.) Recursion is a looping construct, so you at least need 'tail' calls to not consume the stack anyway. Practically everything needs to "heap" allocate, and of course you need a garbage-collector. Algebraic data types provide tagged data; in theory these tags would only require an extra 2 or 3 bits of data, but to match the runtime, they often need to have an extra word of memory or more. ... I am kind of meandering, but the things you do most often in an FP language tend to correspond to exactly to the things that scale worst or are most expensive on the typical computer hardware architecture and basic language runtime.
It doesn't have to be that way. One can imagine a world where the runtime eschews a stack, and makes heap/allocation fast (and not a bottleneck for multi-threaded apps). One can imagine a world where the interoperable integer types have 29 or 60 bits, and the runtime/hardware use the extra leftover bits of the word for e.g. the GC, or algebraic type tags, or whatnot. (I think some FP implementations/runtimes do some of these tricks.) Etc. etc... The point is, if you take a modern functional language as a given, and then design runtime/hardware around it, it would look very different from the typical hardware/runtimes of today.
(I don't think I communicated that terrifically, and I am imprecise about many details I don't know precisely, but hopefully you grok the gist of my thesis here.)