If the OP is actually interested in the simple time delay between the files, then computing the cross-correlation function and finding the delay parameter $\tau$ that maximizes this function would provide a reasonable estimate of the delay.
However delay and phase are not the same thing. For a fixed delay, the phase for every frequency component in the signal will be increasing in the negative direction versus frequency, with a slope that is larger for a larger delay (this is the basic property of "linear phase" filters which provide a fixed delay and therefore no group-delay distortion).
If the OP is truly interested in the phase characteristics between the waveforms; meaning the phase for every frequency component (or more specifically the frequency response, including magnitude and phase), then the Wiener-Hopf equations can be used to determine this: given signal $A$ that passes through a distortion channel to create signal $B$, we can use the Wiener-Hopf equations to determine the channel from $A$ and $B$ as a frequency response showing how the amplitude and phase is varied for every frequency component. I detail this further at this post here:
Compensating Loudspeaker frequency response in an audio signal
and
How determine the delay in my signal practically
Depending on the order in which we use $A$ and $B$ in the Wiener-Hopf equations, we can either determine the channel or the compensation to equalize for the effects of the channel (for subsequent waveforms that pass through the same channel). This is a linear equalization approach, which is less effective for channels with deep frequency nulls (multipath fading with long delay spreads) as it will lead to noise enhancement at those nulls compared to non-linear equalization approaches.