First let us note that if you extend C to infinite memory, and consider running UNIX on the Turing machine, then an NP machine is one which is allowed to the UNIX fork instruction, and produce two independent processes with a duplicated copies of memory, with no time cost, and the program terminates when exactly one of the outputs terminates.
That this is true is easy to prove: Given any nondeterministic automaton, fork on each step according to the number of outcomes. When any fork halts, you kill all the other processes. This simulates a nondeterministic machine with "fork". to go the other way, simulate UNIX on your nondeterministic machine, and have a nondeterministic step at each "fork". They are equivalent concepts.
The natural generalization of this is to use the UNIX threading instruction to produce parallel threads rather than parallel processes. In this case, the processes can share memory with each other, but one has to be careful, because exponentially many processes will be using exponentially much memory, so they can't search all of it. With less risk of mistake, you can allow the processes to send fixed length messages to another process, whose process label they already know. This is equivalent to allowing any pair to share memory, since syncing all the memory you used until time t only takes time polynomial in t.
Observation: A probabilistic version of this machine can simulate any quantum process.
Given a finite size exponentiated-Hamiltonian U matrix on N states, you want to compute the quantum evolution to time T, then reduce the state according to a measurement, then compute the quantum evolution again. To do this, you fork a machine to simulate each path in the path integral, and keep track of it's U-matrix weight. You keep track of the final state of each forked process.
Then you congeal the processes by sending a message to the nearest processor with the same final state, and adding your amplitudes, shutting down the processor with the smaller number. This congeals your state to half the states. Then you congeal again, and in log(T) steps, you know the amplitude for every state. This also allows you to rotate by a Unitary you can construct before making a measurement.
Then you square this amplitude for each state, and you pick another process with a square amplitude, and pick one of the two at random according to the square amplitude. Again, after log steps, you have picked one of the processes according to the square amplitude.
This means that BQP is inside SHM-P. SHM-P includes NP so it is not reasonable that it is BQP. It shouldn't be P-space, since you are still limited to polynomial time computation on any of the threads.