In a native system without a higher power (such as a VM) capable of performing garbage collection, you aren't going to do much better performance or complexity wise than reference counting.
You are are correct the reference counting can be tricky - not only does the increment and decrement have to atomic, but you need to ensure that the object can't be deleted out from under you before you are able to increment it. Thus, if you store the reference counter inside the object, you'll have to somehow avoid the race that occurs between the time you read the pointer to the object out of the cache, and manage to increment the pointer.
If your structure is a standard container, which is not already thread-safe, you will also have to protect the container from unsupported concurrent access. This protection can dovetail nicely with avoiding the reference counting race condition described above - if you use a read-writer lock to protect the structure, combined with atomic increments of the in-object reference counter while still holding the reader lock, you'll be protected from anyone deleting the object out from under you before you get the reference count, since such mutators must be "writers".
Here, objects can be evicted from the cache while still having a positive reference count - they will be destroyed when the last outstanding reference is dropped (by your smart pointer class). This is typically considered a feature, since it means that at least some object can always be removed from the cache, but it also has the downside that there is no strict upper on the number of objects "alive" in memory, since the reference counting allows objects to say alive even after they've left the cache. Whether this is acceptable to you depends on your requirements and details such as how long other threads may hold references to objects.
If you don't have access to (non-standard) atomic increment routines, you can use a mutex to do the atomic increment/decrement, although this may increase the cost significantly in both time and per-object space.
If you want to get more exotic (and faster) you'll need to design a container which is itself threadsafe, and come up with a more complex reference counting mechanism. For example, you may be able to create a hash table where the primary bucket array is never re-allocated, so can be accessed without locking. Furthermore, you can use non-portable double-wide CAS (compare and swap) operations on that array to both read a pointer and increment a reference count adjacent to it (128 bits of stuff on a 64-bit arch), allowing you to avoid the race mentioned above.
A completely different track would be to implement some kind of "delayed safe delete" strategy. Here avoid reference counting entirely. You remove references from your cache, but do not delete objects immediately, since other threads may still hold pointers to the object. Then later at some "safe" time you delete the object. Of course, the trick is discover when such a safe time exists. Basic strategies involve each thread signaling when they "enter" and "leave" a danger zone during which they may access the cache and hold references to contained objects. Once all threads which were in the danger zone when an object was removed from the cache have left the danger zone, you can free the object while being sure that no more references are held.
How practical this is depends on whether you have logical "enter" and "leave" points in your application (many request-oriented applications will), and whether the "enter" and "leave" costs can be amortized across many cache accesses. The upside is no reference counting! Of course, you still need a thread-safe container.
You can find references to many academic papers on the topic and some practical performance considerations by examining the papers linked here.