I think the answer depends on the source of the longer audio stream. If the longer stream contains the exact image of the shorter one (for example, if it was created by an audio editor with access to the original) then you have a simple string search problem and many answers exist, like Boyer-Moore.
If however, the original was decoded and re-encoded (i.e: you are testing to see if some guy used part of your band's mp3 in his youtube video), then you've got a much more difficult problem.
I'd probably try to solve it in the frequency domain - Generate a 'signature' of the file 1 based on a sequence of smallish FFT windows, then doing a best-fit against FFTs from file 2. I have no idea how well this would work.