Here's how I prefer to think about it. Consider the implementation of a variable containing a 32 bit integer. When treated as a value type, the entire value fits into 32 bits of storage. That's what a value type is: the storage contains just the bits that make up the value, nothing more, nothing less.
Now consider the implementation of a variable containing an object reference. The variable contains a "reference", which could be implemented in any number of ways. It could be a handle into a garbage collector structure, or it could be an address on the managed heap, or whatever. But it's something which allows you to find an object. That's what a reference type is: the storage associated with a variable of reference type contains some bits that allow you to reference an object.
Clearly those two things are completely different.
Now suppose you have a variable of type object, and you wish to copy the contents of a variable of type int into it. How do you do it? The 32 bits that make up an integer aren't one of these "reference" things, it's just a bucket that contains 32 bits. References could be 64 bit pointers into the managed heap, or 32 bit handles into a garbage collector data structure, or any other implementation you can think of, but a 32 bit integer can only be a 32 bit integer.
So what you do in that scenario is you box the integer: you make a new object that contains storage for an integer, and then you store a reference to the new object.
Boxing is only necessary if you want to (1) have a unified type system, and (2) ensure that a 32 bit integer consumes 32 bits of memory. If you're willing to reject either of those then you don't need boxing; we are not willing to reject those, and so boxing is what we're forced to live with.