Light is remarkably predictable and, as with most energies, strictly a linear phenomenon.
If we have a pixel of emissive energy, and we use photographic terms, our ability to perform math is ridiculously simple. To increase the amount of light by double, or in photographic terms a stop, we simply double the value. Want to go down a stop of intensity or halve the quantity? Easy, halve the value.
Our visual system on the other hand, is strictly nonlinear. We sense the visual stimuli and uniquely bend the values to meet the needs of our perceptual system. In particular, we bend a very specific range such that we can detect gradations acutely, while sacrificing the darker and lighter regions our iris has dialed into view.
Given that our devices are significantly lower dynamic range than any physical scene, we cannot however simply pass radiometrically linear ratios of light to our display / output referred contexts. If we did so, the values would appear vastly too dark, even with darker viewing conditions.
To compensate for this complex phenomena, the values need to be bent away from the radiometrically linear ratios such that the relative emission is roughly that which our eyes would expect to perceive the relative levels as expected.
Most imaging formats bake this bent version, also known as transfer curved or tone response curved, into the data itself and the curve varies by color space. Others, with unique design constraints, may or may not depending on file tags and metadata. EXR is one such example that specifically mandates a linear format for the data within.
Finally, the issue of black and white or grayscale adds another layer of complexity. Such a representation is typically an averaged luminance obtained from the RGB triplet. Every colorspace within an RGB color encoding scheme will have uniquely colored primary lights. These colors can be mapped to an absolute color model known as XYZ. The Y axis in this non orthogonal model is relative luminance. To obtain a greyscale representation of any RGB colorspace, one multiplies the Y position of each channel by the value.
TL;DR No you cannot rely on the data values in any given format to be representations of radiometric ratios. It varies by format and within a format, by color space. Further, even if you invert the two part formula for something like an sRGB JPEG, you will only arrive at a rough display linear value set that terminates at display referred 1.0, having compressed and discarded much more of the dynamic range from something such as a camera. EXRs on the other hand will often offer scene linear representations that require no transfer curve inversions. With focus on a sensor-like response, a sensor will capture largely linearized values with nonlinear responses near the edges of the sensitivity ranges.