This is my attempt at an intuitive answer at something very tricky.
Nothing observable can depend on the arbitrary renormalization scale $\mu$, like you said. However, our calculations can be made easier or more difficult, and our approximations can be made more or less accurate, by a good choice of $\mu$. As an analogy, while the electric field does not depend on the coordinate system we use to calculate it, it is much easier to write down the electric field for a point charge in spherical coordinates than in Cartesian coordinates.
One of the issues with perturbation theory is the presence of large logs. These are corrections that appear at higher loop orders that look like (in an OS scheme)
$$
\sim \alpha_s^n \left[\log\left(\frac{p^2}{m^2}\right)\right]^n
$$
These corrections become large (leading to a breakdown in perturbation theory) when the energy scale $\sim |p|$ becomes large enough to be comparable to some fixed mass scale $m$.
By using an $\bar{MS}$ scheme, these logs get replaced with terms like
$$
\sim \alpha_s^n \left[\log\left(\frac{p^2}{\mu^2}\right)\right]^n
$$
where $\mu$ is this arbitrary energy scale.
Now, you can think of $\mu$ as parameterizing different "flavors" of perturbation theory. As an analogy, we can say that in general in electrostatics we want to solve the Poisson equation, but to do a specific calculation we need to choose a coordinate system, and that will make the Laplacian operator take a specific form leading to a concrete PDE. Similarly, choosing $\mu$ will lead to a certain concrete form of the perturbation series.
If we choose $\mu$ so that is of order the energy scale we are interested in, then all these "large logs" will vanish. The "cost" is that the coupling constant must depend on the energy scale, in order that the final observable quantity does not depend on $\mu$.
The way I like to think of this, is that perturbation theory splits a calculation into a tree part and loop parts. Different choices of $\mu$ have the effect of shuffling terms between the tree and loop parts. Of course, the actual observable quantity is a sum of tree and loop, and is not sensitive to this shuffling. Nevertheless, if we can find a choice of $\mu$ that puts as much of the final answer into the "tree" part, and reduces the "loop" part, that will help us, psychologically, because we can use our classical intuition to understand the tree part, and because we can retain control of the loop expansion (meaning each higher loop is less important, without large log corrections).
As a result, by using a running coupling, we "resum" these large log corrections, and the tree level answer gives a good approximation to the result over a wider range of energies, than we could achieve by using a fixed coupling constant in an $OS$ scheme -- even though no observable depends on $\mu$. The genius is that we discovered we had the freedom to introduce this extra parameter that controls how fast perturbation theory converges at different energy scales, and we can choose the parameter so perturbation theory converges very quickly at the scale of interest, but without any observable depending on this parameter.