The phase between the sine waves can be computed as suggested, at any sampling rate as long as it is greater than twice the sum of the frequency of the sine waves. Further the result is sensitive to the peak amplitude so the waveforms need to be normalized or the phase result accordingly scaled.
This is all clear from the cosine product rule that the product of two sinusoids will have a low frequency component that is proportional to the cosine of the phase between them:
$$\cos(\alpha)\cos(\beta)= \frac{1}{2}(\alpha + \beta)+ \frac{1}{2}(\alpha - \beta)$$
So when the two frequencies are the same and there is a phase difference we have:
$$2\cos(\omega_1t+ \theta)\cos(\omega_1t) = \cos(2\omega_1t+ \theta)+ \cos(\theta)$$
After a low pass filter we get $\cos(\theta)$. Note that the time delay of the low pass filter has no impact on the result (the answer for a static phase offset is the DC mean itself). The loss of the filter will change the magnitude and this must be calibrated out accordingly.
If the system is sampled and the product is done in discrete time, the sampling rate must be high enough to support the double frequency component above as otherwise it can alias to be within the bandwidth of the subsequent low pass filter. Sampling at an even higher frequency will only provide more samples in the determining the mean (which is the purpose of the low pass filter), so will not be detrimental to the result other than needlessly increasing filter complexity to achieve the same cutoff frequency.
Here is a simulation using the OP's parameters demonstrating in simplest form the desired result and operation:
Python code:
Fs = 64*10**6
N = 2048
f = 100000
t = np.linspace(0, (N-1)/Fs,N)
amp = 1
phase = np.pi/4
y = amp * (np.sin( f * 2 * np.pi* t) )
z = amp * (np.sin(f * 2 * np.pi * t+ phase) )
x = y * z
print(f"Actual phase : {phase:0.3f}")
print(f"Measured phase: {np.arccos(2*np.mean(x)):0.3f}")

NOTES on the results:
y and x are the inputs as sinusoids with a 0.785 radian offset in phase between them. z is the result of the product $xy$.
There should not be any DC offsets on the sinusoidal waveforms; given the result for a static phase offset is a DC term, this will only be an interfering offset (error).
The average of the product is the desired result that will be proportional to the cosine of the phase angle. Here with the waveforms each normalized to 1, the solution is $\phi = \overline{2xy}$, meaning twice the average of the product. The purpose of any low pass filter is to remove the higher (double) frequency component and pass the DC value. If the phase is expected to change with time within a certain bandwidth, then the low pass filter used must be designed to pass that bandwidth.
Note the measured result is 0.755, which is in error from the actual phase offset of 0.785. This is because the output has not gone through an integer number of cycles, so the final cycle is incompletely contributing to the mean. This however does not mean we should always attempt to average over an integer number of cycles of the double frequency output, but instead illustrates that the averaging duration in this case is not long enough for the frequency of the sine wave used. As the duration is increased (or similarly as the frequency of the sine wave is increased), the fractional error from any partial cycles is reduced. For applications where this can't be avoided, windowing techniques could be used to reduce the effects of this similar to that done for FFT processing.
Given the target application requires high efficiency, consider using a CIC filter instead of the IIR filters suggested in the OP's code which will combine performing a moving average together with a decimated (down-sampled output). The implementation to do a moving average over all 2048 samples is shown below:

The $z^{-1}$ blocks are one sample delays at the input rate on the left of the decimator, and output rate at the right of the decimator (running at 1/2048 of the input rate). The diagram to the left of the decimator just after the product represents and accumulator, and this accumulator must be designed such that it wraps on an overflow, and no more than one wrap is allowed during the period of a decimation cycle. The entire diagram shown must be implemented in fixed point arithmetic. The decimator indicates selecting every 2048th sample from the accumulator output and throwing away the rest. Every subsequent sample at the decimator output is subtracted from the previous. Alternatively this could be implemented with an "integrate and dump" where an accumulator grows over 2048 samples, dumps the result and clears. The result is equivalent to a moving average, exactly so if the output is divided by 2048 (convenient bit shift operation given that is a power of 2, but other scaling may be needed depending on input level). What is convenient with the CIC implementation shown, is that it is completely programmable in terms of the moving average number of sample duration just by changing the decimation: change 2048 above to 512 for example to then be a moving average over 512 samples (and scale accordingly).
Finally for ultimate efficiency a Cordic Rotator could be considered for efficient computation of the inverse cosine.
HOWEVER, if this is to be used in a Phase Lock Loop, then the loop filter itself will properly serve as the filter, and there is no reason for an inverse cosine computation: The loop will drive the phase difference to 90°, where $\cos(\theta)=0$, since that is the measure of loop error. The error in phase from locked condition will then be $\sin(\theta)$ and for small angles $\sin(\theta) \approx \theta$. Further any gain will be part of the loop gain itself, so the only concern is for the input sinusoids to be normalized to some constant value (as done with an AGC or automatic gain control). Everything gets simpler in terms of what else is needed to measure phase when the multiplier is used as the phase detector in a loop!
yin the code still looks to be an array of zeros – SakSath Apr 03 '23 at 12:47