First of all, you'd go on an information scavenger hunt through the internet on that cassette's data format: http://beebwiki.mdfs.net/Acorn_cassette_format tells us you're quite spot on:
Signal
The signal may be in one of three states; zero, one or no carrier.
Breaks in the carrier are detected and used to reset the cassette
loading firmware.
Zero is represented by a sinusoidal wave at 1200 Hz. The exact
frequency is nominally 16 000 000 / 13 312, or 1201.9 Hz, but tape
decks vary in speed so small differences are tolerated. One cycle (at
1200 baud) or four cycles (at 300 baud) represent one zero bit.
Binary one is represented by a sinusoidal wave at 2400 Hz. The nominal
frequency is closer to 2403.8 Hz. Two cycles (at 1200 baud) or eight
cycles (at 300 baud) represent a one bit. An odd number of 2400 Hz
cycles can and does occur.
To allow the tape deck circuitry to settle, each data stream is
preceded by 5.1 seconds of 2400 Hz leader tone. This is reduced to 1.1
seconds if the recording computer has paused in the middle of a file,
or 0.9 seconds between data blocks recorded in one go.
At the end of the stream is a 5.3 second, 2400 Hz trailer tone. This
is reduced to 0.2 seconds when pausing in the middle of a file (giving
at least 1.3 seconds' delay between data blocks.) The timings are
derived from VSYNC interrupts so they vary between recordings.
So, yes, this is Frequency Shift Keying (FSK) and can be demodulated as such.
Considering modern PCs'/laptops'/phones'/fridges'/toaster ovens' computational power is basically huge compared to the data that fits on a cassette, we don't need to be very considerate on how efficiently we demodulate this, at all.
So, you'd go and
- design two bandpass filters: one that's centered around 1201.9 Hz, the other one at 2403.8 Hz,
- take their outputs, convert them to their absolute value, then
- weight one of them so that they have the same amplitude when excited,
- subtract them and finally
- make a 0/1 decision based on the sign of the result to get the bit.
Sounds a bit taunting, but looks roughly like this in GNU Radio 3.9:

(You can ignore everything that says "QT" in its name, it's just for interaction / vizualization. Once you've figured out the right amplitude so that 0 and 1 bits come out the same (and that's really just necessary to minimize the probability of bit errors), you'd just multiply with that fixed value instead of having the Qt range.)
You'll have to add some discarding for the start and stop bits surrounding each byte, and you'll need to keep synchronized to these – I think the most elegant (and easy) way would be to write a quick few lines of python code in an "embedded Python Block" in GNU Radio Companion to detect the start of the start bit, then pass on through the 8 data bits, ignore the stop bit and go waiting for the next start bit. Pretty doable!