As a visualizer plays a song file, it reads the audio data in very short time slices (usually less than 20 milliseconds). The visualizer does a Fourier transform on each slice, extracting the frequency components, and updates the visual display using the frequency information.
How the visual display is updated in response to the frequency info is up to the programmer. Generally, the graphics methods have to be extremely fast and lightweight in order to update the visuals in time with the music (and not bog down the PC). In the early days (and still), visualizers often modified the color palette in Windows directly to achieve some pretty cool effects.
One characteristic of frequency-component-based visualizers is that they don't often seem to respond to the "beats" of music (like percussion hits, for example) very well. More interesting and responsive visualizers can be written that combine the frequency-domain information with an awareness of "spikes" in the audio that often correspond to percussion hits.