I think there are a few related, yet separate, questions that must be addressed here.
1) How do we know the Minkowski metric is a flat metric?
There are several ways to go about this question. Perhaps the most rigorous is to calculate the Riemann tensor, as this approach gives you quantities that are independent of the coordinate system. OP, in your question as written now, you've written the metric with respect to cartesian coordinates. A Minkowski metric is still Minkowski even when written in other coordinate systems, and it is still flat.
2) Why do we expect that the universe should be modeled with a flat metric?
This relies upon both mathematical logic as well as physical observations. Even when we expect the universe to be curved in some way, any curved manifold will look flat at a small enough length scale. Once we've already decided we live on some arbitrary curved, differentiable manifold, the flat space argument is pure mathematics--math tells us this is the case, but it doesn't tell us what length scales we must limit ourselves to to make it so. That has to be determined through observation. Nevertheless, relying upon a flat space model gives us a good jumping off point to eventually build up to GR, so it's usually useful.
3) Why do we pick Minkowski space, then, as the model? What physical considerations drive this choice of model?
Differential geometry will tell you that there are only three flat models of geometries: Minkowski space, Galilean space, and Euclidean space. EAch of these can be parameterized with different coordinate systems, but these are the only flat models available to us that meet certain broad criteria. In particular, we restrict ourselves to one timelike dimension, which may or may not have a different signature from spatial dimensions. In Minkowski space, the timelike dimension has the opposite sign in its metric component. In Galilgen space, the timelike dimension is null and has zero for its metric component. In Euclidean space, it has the same sign as spatial dimensions.
Here's where the physics comes in: we know that the speed of light appears invariant in all frames of reference. We associate different frames of reference with rotation-like transformations on each of these spaces that mixes time and space. In Minkowski space, the rotation-like transformation is Lorentz boosting. In Galilean space, the transformation is a Galilean transformation. In Euclidean space, the operation is just rotation. These operations are unique to each kind of spacetime model, so there is no ambiguity.
To measure the same speed of light, one eventually concludes that light follows a trajectory that is invariant under the rotation-like transformation. That rules out Euclidean space, as there are no vectors in the plane of rotation that aren't transformed. It also rules out Galilean space, as there is only one vector that doesn't change under a Galilean transformation--the direction of time--and that direction has no speed with respect to the observer in any Galilean frame of reference.
So the only option is Minkowski space, which admits two vectors light can follow in any given plane, both of which involve trajectories that are measured as having the same coordinate speed regardless of reference frame.
Minkowski space is the only model of flat spacetime that allows us associate light with specific trajectories in the spacetime and in ways that correspond to our physical observations.