The EJB specification don't specify how clustering should be achieved, so this will depend on the particular implementation used. Actually, the EJB specifications are on purpose written to not make assumptions about the deployment: they don't mandate any support of clustering, but are written in a way that makes it possible (and a lot of restrictions in the EJB model stems from potential clustering issues, e.g. access to the file system). The implementer is then free to support clustering or not, and still comply with the spec.
In Glassfish, the reference to the remote EJB does the distribution itself. See my answer here for more information. Each request could potentially be dispatched to a different node. That's probably the way most implementations work. So I would say your assumption is correct.
I do hope however that they optimize the case when one EJB calls another EJB and try to dispatch the invacation on the same node whenever possible. That will depend whether the deployment is homogeneous or not (all nodes have the same beans, or not). Again, the spec are a bit vague regarding such points. But I guess that most deployment are homogeneous in practice: the same ear is deployed on all nodes.
Regarding the performance overhead of remote vs. local calls, I did some measures once (on Glassfish). See my answer here. Inter EJB calls in the same .ear through remote interface was 3x slower than local calls. That sounds big, but we are speaking of milliseconds, so the relative overhead depends on what the methods really does. I don't know the performance of other app. server.
Hope it helps.