While there are some obvious exception to it ("static" pattern on a television screen, the "dark frame" noise pattern of a camera), images are rarely generated by random processes. Declaring that an image is drawn from such or such distribution or generated by such or such random process is just a post-hoc modeling decision, and there is no "ground truth" to validate or invalidate this choice; other than the performance of the machine vision / image enhancement etc. method derived from the modeling decision.
So you can view a $m \times n$ image as a single random matrix (I assume this is what you mean by your first alternative - the image as a whole being considered as a single multidimensional random variable); or you can view it as a random field (a collection of random variables indexed by $|[1, M]| \times |[1, N]|$). I've encountered the random field view more often than the random matrix view.
When using a random-field approach, you can view each pixel as i.i.d ; or you can introduce dependencies between pixel values through the use of a Markov Random Field model. These are not the only option - you could very well consider a two-layer model where one first random process assigns a region index to each pixel of the image, and then where the value of each pixel is drawn from a distribution indexed by the region id! No approach is "better" than another. The more complex the model, the more "plausible" the images it will generate, but the more intractable the computations might become. It often helps, when using this kind of statistical approaches, to draw a few sample images from the chosen distribution/random process, and look at them to have a good grasp of what kind of assumptions you have built into your model.