A CUDA thread is very lightweight and can be scheduled/stalled with very little penalty. This is unlike a CPU thread which has a lot of overhead to switch in and out of execution. As a result, CPUs are great for task parallelism and GPUs will excel at data parallelism.
In the CUDA architecture a (NVIDIA) GPU has "Streaming Multiprocessors" (SMs), each of which will execute a block of threads. Each SM has a set of Stream Processors (SPs), each of which will be executing instructions for one thread at any given moment (cycle).
Actually the minimum number of threads inside a block is one. If you have just one thread per block, your code will execute correctly. However, it is far more efficient to set up a block such that it has a multiple of 32 threads. This is due to the way the hardware schedules operations across a "warp" which is 32 threads.
You can cross compile your program. You could run it in emulation mode, i.e. the CPU is "emulating" a CUDA GPU, but to run on hardware you would need an NVIDIA GPU (CUDA enabled, anything recent, post 2006 or so, will do).
A high-end current generation GPU has 240 cores (SPs), you could consider this as executing 240 threads at any given moment but it is useful to think of the GPU as executing thousands of threads simultaneously since the state (context) for multiple threads is loaded.
I think it is important to recognise that there are differences between CPU threads and GPU threads. They do have the same name but a GPU thread is lightweight and typically operates on a small subset of the data. Maybe it will help to think of a (set of) CPU thread(s) doing the non-parallel work, then each CPU thread forks into thousands of GPU threads for the data parallel work, then they join back to the CPU thread. Clearly if you can get the CPU thread to do work at the same time as the GPU then that will be even better.
Remember that, unlike a CPU, a GPU is a throughput architecture which means that instead of caches to hide latency, the program should create many threads so that while some threads are waiting for data to return from memory other threads can be executing. I'd recommend watching the "Advanced C for CUDA" talk from the GPU Technology Conference for more information.