Sound is fundamentally a pressure wave, made up of "peaks" which are regions of higher pressure and "troughs" which are regions of lower pressure. A microphone responds to the incident pressure wave by taking advantage of some physical material property to measure that pressure over time. Making a recording is the process of sampling those measurements and transcribing them to some media.
A reasonable representation to choose is to call the ambient pressure zero, with higher and lower pressures positive and negative. Another reasonable representation is to take ambient pressure as half-scale, with lower pressures below and higher pressures above half. Other representations are possible, and it isn't even required that the relationship between the incident pressure and the measured value be linear.
Whether a signed or unsigned representation is used is only a matter of history and convention. 16-bit audio is usually represented as signed but 8-bit audio is usually not, for instance.
Historically, the telephone system has used 8 bit unsigned measurements following a non-linear function called either a-law or µ-law. The non-linear representation supports greater dynamic range within the same bit rate than a linear representation.