What you're talking about is what's known in the embedded world as a "bare-metal" application. They're very common for things like a ARM Cortex-M3 that goes in (say) a debit-card validator box or an interactive toy, and doesn't have enough memory or capability to run a full operating system. So, instead of getting an "ARM/Linux" compiler that would compile an application to run on Linux on an ARM processor, you get an "ARM bare-metal" compiler that compiles things to run on an ARM processor without an operating system. (I'm using ARM rather than x86 as an example, because x86 bare-metal applications are really quite rare these days.)
As stated in your question and the other answers, your application will need to do some things that would otherwise be taken care of by the operating system.
First, it needs to initialize the memory system, the interrupt vectors, and various other bits of board goo. Typically this is something that a bare-metal compiler will do for you, though if you have a weird board, you may need to tell it how to do that. This gets things from the point where the board turns on to the point where your main() function starts.
Then, you need to interact with things outside the CPU and RAM. An operating system includes all sorts of functions for doing this -- disk I/O, screen output, keyboard and mouse input, networking, etc., so forth, and so on. Without an operating system, you have to get that from somewhere else. You may get some of that from libraries from your hardware manufacturer; for instance, a board I was recently playing with has a 40x200-pixel LED screen, and it came with a library with the code to turn that on and set individual pixel values on it. And there are several companies selling libraries to implement a TCP/IP stack and things like that, for doing networking or whatnot.
Consider, for example, that this makes it difficult to do even a basic printf. When you have an operating system, printf just sends a message to the operating system that says "put this string on the console", and the operating system finds the current cursor position on the console, and does all the stuff to figure out what pixels to change on the screen, and what CPU instructions to use to change those pixels, in order to do that.
Oh, and did we mention that you first have to figure out how to get the program into the CPU? A typical computer has a bit of programmable ROM that it will load instructions from when it starts up. On an x86, this is the BIOS, and it usually already contains a handy program that gets the CPU started, sets up the display, looks for disks, and loads a program off the disk that it finds. On an embedded system, that's typically where your program goes -- which means you need some way to put your program there. Often, that means you have a device called a "debugger" that's physically attached to your embedded board that loads the program -- and can also do things that allow you to pause the processor and determine what its state is, so that you can step through your program just as if you were running it in a software debugger on your computer. But I digress.
Anyway, to answer your second question, this executable that you'd create is something that gets stored in that ROM on your embedded board -- or perhaps you'd just store a bit of it in ROM (which is, after all, pretty small) and store the rest on a flash drive, and the bit in ROM would include the instructions to get the rest of it off the flash drive. It would probably be stored as a file on your main computer (that is, the Linux or Windows computer where you're creating it), but that's just for storage, it wouldn't run there.
You'll notice that when you've got a lot of these libraries together, they're doing a fair bit of what an operating system does, and there's sort of this space between the pile of libraries and a real operating system. In that space goes what's called an RTOS -- "real-time operating system". The smaller ones of these are really just collections of libraries that work together to do all the operating-systemy things, and sometimes also include stuff so you can run multiple threads at once (and then you can have different threads act like different programs) -- though all of this is all compiled into the same compiled "program", and the RTOS is really nothing more than a library you've included. Larger ones start storing parts of the code in separate places, and I think some of them can even load pieces of code off of disks -- just like Windows and Linux do when running a program. It's sort of a continuum, rather than an either/or.
The FreeRTOS system is an open-source RTOS that's towards the smaller end of the RTOS space; they might be a good place to look at some of this if you're more interested. They do have some examples of x86 applications, which would give you an idea of what sort of x86 systems would run a bare-metal or RTOS-based program and how you'd compile something to run on one; link here: http://www.freertos.org/a00090.html#186.