I don't think this can be done without using at least one kernel-level object (Mutex or Semaphore), because you need the help of the kernel to make the calling process block until the lock is available.
Critical sections do provide blocking, but the API is too limited. e.g. you cannot grab a CS, discover that a read lock is available but not a write lock, and wait for the other process to finish reading (because if the other process has the critical section it will block other readers which is wrong, and if it doesn't then your process will not block but spin, burning CPU cycles.)
However what you can do is use a spin lock and fall back to a mutex whenever there is contention. The critical section is itself implemented this way. I would take an existing critical section implementation and replace the PID field with separate reader & writer counts.