There are many different conceptual stores of data in a computing machine as well as many different actual stores of memory on various concrete implementations of the a computing machine.
In the beginning there was tape, and lots of it.
As an example there is the conceptual Turing Machine where the storage is the TAPE, the TABLE and the STATE REGISTER.
These are all different forms of storage:
The TAPE is mutable and unbounded.
The TABLE may be mutable (though often isn't) but is bounded
the STATE register is mutable and bounded.
The HEAD POSITION is simple, but very important in that it allows free movement, bounded only by the tape itself and the logic within the TABLE.
Further you can see that in the idealised models the TABLE and the STATE REGISTER are assumed to be 'fast' to access any entry inside (it's takes some constant time to access/modify) whereas accessing some arbitrary element of the tape is dependent on the distance from the current location on the tape to wherever the data is.
This allows considerable analysis about various aspects of a computing system, and crucially models the idea that, in a system an unbounded source of data will, inevitably exhibit non linear behaviour in terms of the cost of access (since the representation of how to get to the data will itself grow as the data stored grows).
A Turing machine's model of memory was not that far away from actual physical machines in the early era of mainframe computers with most permanent storage being based on (admittedly finite) tapes and the STATE REGISTER being based on a small simple pool of 'memory' in the machine with equal costs to access any entry. Many of the first computers has no 'programs' in the sense you or I think of. Instead they could have their TABLE altered (often literally by hand) and then all further behaviour was fixed. This was separation of DATA from INSTRUCTIONS.
As modern computers have changed this link with one of the first formal descriptions of a computing device has become more and more distant.
Random considered not harmful
Most modern machines are based on a Von Neumann Architecture. These explicitly place instructions and data within the same 'pool' of addressable memory. Their initial TABLE is simply sufficient to start reading instructions from memory which can decide, based on data, which instructions to use in future.
Many differing ways were designed to maintain the STATE, some kept it all in the main memory pool, others created specific regions of memory as either arbitrarily addressable named 'registers' separate from main memory or as 'stacks' often implemented by 'taking' a section of the main memory.
These are not exclusive, many have the concept of a stack and registers, some providing separate memory for the stack, others not and putting it within main memory. Several machines either mandated (or encouraged through convention) that certain registers referred to the conceptual stack, a common name for a specific register in many architectures is the Stack Pointer (SP).
Almost all had the concept of a particular storage location (typically a register) which indicated the next instruction to execute. Alteration of this value (from the basic add one to the last value) providing the primary means for the computer to be general purpose, in this way the Instruction Pointer (or IP as this special location came to be known) represents something similar to the location of the HEAD in the Turing machine (except it repesents the instructions not the data). By moving all data to a randomly addressable form the concept of the tape for data was removed from 'memory'.
Instead that which was once the primary memory of a computer is instead relegated to that of an IO device. Storage, but no longer what most people would term Memory in the sense in which we use it now.
In early computers of this kind the registers and the RAM normally operated at the same 'speed' though there was a cost to moving data from RAM into a register and back again this was a simple instruction like any other.
As the speed with which the logic units could operate increased, the speed with which (larger and larger) RAM could be accessed fell in relative terms. Part of this was that it quickly became easier to separate the logic units from the memory units from a cost perspective.
As such registers were 'special' storage which were 'close' to the logic units, typically operating at the same speed as the logic units. Main memory became 'further away' from the logic elements and began to cost more cycles to access.
Messages in a Bottleneck
Code was coming to be designed to execute in a manner which reduced the number of times that data had to be copied into a register to be operated on, then copied back to main memory. Computer designers realised that, in general a particular 'style' was in fact quite common in many programs: A smaller portion of main memory was accessed far more frequently than other parts.
The ability of a compiler/human to effectively use registers to manage this was limited. Trying to arrange the computation to make effective use of this faster storage involved complex dependency determination. Adding registers, since each was a 'special name' was not a simple solution since adding more involved recoding every application and forcing rewriting for each 'upgrade'.
It was realised that it was possible to put some of the main memory 'closer' to the logic, perhaps not as close as registers but much closer than main memory. Since it was neither cost effective (nor possible in many cases) to put all such memory close this would need to be a 'window' onto the main memory. Early designers realised that placing this in the control of the programmer would lead to many of the same problems associated with the fixed number of registers. Therefore they attempted to use this closer memory in a manner which was transparent to the programmer and the program except in so far as it affected performance.
Cache memory was born, thus temporal locality made for faster computers with no code changes.
Various forms were tried (and many co-exist even now), like exclusive and inclusive (now something could be in two places at once). It was also calculated that, if memory tended to have temporal locality it also tended towards spatial locality, as such data could be in the cache even if you hadn't requested it yet speculatively.
Some split the cache between data and instructions, others did not but almost all computers moved to having cache.
It's turtles all the way down
It wasn't long before memory became so much slower than cache that it became worth putting a slower, but bigger cache between main memory and the little cache. Level 1 and level 2 cache were created, and even more besides.
Bigger better faster stronger
As we got faster it became clear that a lot of code was in fact capable of executing at the same time, either through pipelining or through independent parallel execution. When doing this, running multiple logic units 'hidden' from the simple view as seen by the programmer and compiler, more 'hidden' registers were required to account for the multiple in flight operations. Register renaming was designed to allow there to be multiple copies of the same 'register' in use at one in different execution units to hide changes in one from another.
'Where' a piece of state was stored became even more complex, though thankfully still hidden from most programmers.
At the same time this was happening we began to experience the problems inherent in making one cpu faster, largely down to the very bottleneck cache was trying to mitigate. As such transistor budget began to be spent on other entirely independent sections of logic and it was hoped that programmers would start to write code expecting multiple units rather than one.
This leads to some problems. Previously all the 'hidden' ways that memory latency aware mitigated never made a difference to the programmer. Now however the 'shared' main memory' model of this new SMP world order worked against you. an 'address' in memory couple be in many places at once, indeed it could have several different values in those places.
Now it became important not only where memory was but how stable it was. whether a piece of memory was in the cache or not mattered, and programming models had to change. At this point the architecture's 'Memory Model' became very important (and highly complex).
Divide and Conquer
Since this complex interaction of 'hidden' opaque memory mangling was not interacting well with the new multiple separate units model various possible solutions began to present themselves.
One is obvious, if no caching is required on a value request that it not happen, simplifying cache coherency protocols.
Another is less obvious (and still unclear as to it's suitability) stop having this big shared pool of memory and instead give each unit it's own chunk of memory with explicit interaction between them and the old Main Memory.
This splitting of memory becomes even more complex as devices, previously not considered in the same class as general purpose central processing units are bent into this model.
Don't forget the future
This is of course just the beginning of more changes, in future you might find memory being described with even more complexity as qbits or as something else entirely unknown.