When we're using STFT to do some kinda non-linear process (such as phase-vocoder applications), the consequences of the non-linear process are unpredictable enough that it's usually impossible to avoid clicks if your window is the rectangular window and, in the output, one segment of sound is butt-spliced to the following segment. There will be a discontinuity at that splice point.
So some windowing that has a gradual fade-in and fade-out is necessary to keep the process smooth. The windowing of the input data need not be the same as the windowing of the output. When I was doing this STFT, my preference for an input window was to use the Gaussian window because the Fourier Transform of a gaussian function in the time domain is a gaussian function in the frequency domain. A single sinusoid will slide this gaussian up in the frequency spectrum. Theoretically, there are no ripples or bumps in that result from a single sinusoid. That makes it easier to identify individual frequency components without confusing them with bumps from other frequency components. Also there is a lot of mathematical commonality between a gaussian function and a linear-swept chirp.
Normally, you want the effective window at the output to be, what we call, complementary. That is the falling slope of the previous frame adds to the rising slope of the current frame and adds to a constant. This all depends on the overlap, but there are also weird special cases.
But with fast-convolution, that is using a form of the STFT and the FFT to perform a long convolution (like $h[n]$ might be several hundred thousand samples long), then the process is linear and you can predict what the algorithm will do about the discontinuities at the edges of each segment. Overlap-save (I think a better name is "Overlap-scrap") recognizes that a certain number of contiguous output samples are crap (because the impulse response straddles the boundary between $x[N-1]$ and $x[n]$ in the convolution) and simply discards them and "saves" the good samples.
Overlap-add zero-pads the input and uses linearity to determine that the falling tail of the output of the previous frame can add to the rising tail of the current frame to create the correct output.
If you have MIPS to burn, you can do very high-quality Overlap-Add convolution with the STFT and a Hann window, but you still must zero-pad the input to the FFT and your frame hop length will be half what it would be with a rectangular window. But doing that and getting it right helps set you up to use STFT to do other, more sophisticated, operations such as the phase vocoder.