Yes, your reasoning is basically correct. You would create a thread per core, an io_service instance per thread, and call io_service.run() in each thread.
However, the question is whether you'd really do it that way. These are the problems I see:
You can end up with very busy cores and idling cores depending on how the work is balanced across your connections. Micro-optimising for cache hits in a core might mean that you end up losing the ability to have an idle core do work when the "optimal" core is not ready.
At socket speeds (ie: slow), how much of a win will you get from CPU cache hits? If one connection requires enough CPU to keep a core busy and you only up as many connections as cores, then great. Otherwise the inability to move work around to deal with variance in workload might destroy any win you get from cache hits. And if you are doing lots of different work in each thread, the cache isn't going to that hot anyway.
If you're just doing I/O the cache win might not be that big, regardless. Depends on your actual workload.
My recommendation would be to have one io_service instance and call io_service.run() in a thread per core. If you get inadequate performance or have classes of connections where there is a lot of CPU per connection and you can get cache wins, move those to specific io_service instances.
This is a case where you should do profiling to see how much cache misses are costing you, and where.