When finding the power spectrum density (PSD) from the $N$-FFT of time-domain signal $x(t)$ of $N$ data points (assuming $N=2^n$, $n\in\mathbb{Z}^+$), we normalize the power of the FFT by the sampling rate $f_s$ times the number of data points $N$ like so
$$S[i]=\frac{2|X[i]|^2}{f_s\times N},\,\,\,i=0,\frac{fs}{N},\frac{2f_s}{N},\cdots,\frac{fs}{2}$$
where $S[i]$ and $X[i]$ are the PSD and FFT at frequency bin $i$.
I understand the scaling factor 2 since we need preserve the total energy after removing the negative spectrum, but I don't understand the scaling $\frac{1}{f_s N}$.
Also, the scaling $1/N$ happens as I understand when we use rectangular pulse shaping. What happens if we use root raise cosine filtering where each pulse spans $L$ symbols, for example?