1

I want to generate human voice using fft. For this, I analyzed a fft of my mother saying "e" and the result was that the points are in a normal distribution. Then I created a fft using a normal distribution, but the sound is a hissing.

On this site https://murilos.pythonanywhere.com/ are the two ffts, one with my mother voice and the other randomly generated.

OverLordGoldDragon
  • 8,912
  • 5
  • 23
  • 74
Eleno
  • 33
  • 4
  • Hi Eleno, and welcome to SE.SP! I don't understand the question. Are you asking what the characteristics of human voice frequency content is? As is, your question reads "why does a recording of my mother sound like my mother?" – Jdip Jan 14 '23 at 09:07
  • 1
    Are you assuming that things with the same PDF of discrete PSD amplitudes need to sound the same? Why? – Marcus Müller Jan 14 '23 at 10:42
  • @Jdip I edited my question. – Eleno Jan 14 '23 at 14:07
  • @Marcus I don't know what PDF and PSD are. – Eleno Jan 14 '23 at 14:08
  • To clarify, are you only seeking to generate "e", or a full word / sentence? And should it sound like a particular speaker (e.g. you), or any human-sounding "e"? – OverLordGoldDragon Jan 15 '23 at 14:13
  • @OverLordGoldDragon I want to understand what makes the human voice different from other sounds. Then I want to recognize consonants and vowels spoken by anyone. – Eleno Jan 15 '23 at 22:30
  • @Eleno Updated. "recognize consonants and vowels spoken by anyone" - you'll need machine learning, in addition signal processing for the "understanding" part. – OverLordGoldDragon Jan 16 '23 at 12:17
  • @OverLordGoldDragon I put vowels spoken by me and my mother on the site, I think there is a similarity. Maybe we don't need to use artificial intelligence. – Eleno Jan 16 '23 at 17:27
  • If it's recognizing "e" and who said it, you'll need to learn from data. If it's just recognizing "e", probably not, and FFT likely can do it, but it's not the best option, and you'll still need to work with data and engineered priors. All this is still separate from generation, though related. You should specify what exactly you're looking for, but to me it sounds like an overkill project for your skill level and the priority should be learning. – OverLordGoldDragon Jan 16 '23 at 17:52
  • @OverLordGoldDragon I want to recognize the vowels – Eleno Jan 17 '23 at 12:44

1 Answers1

1

generate human voice using fft

Can't do. The FFT is too primitive for anything but simplest audio generation: different voices, instruments, etc. will all have similar or identical-looking FFT. It's not much more useful than manually drawing a raw signal.

FFT could synthesize an "e". It could synthesize a piano keystroke. Where it stops is sentences and music. Or even a single word spoken in your own voice.

Speech pieced together one independent letter at a time is iron screeching. The principal challenge is in "temporal coherence" - making realistic-sounding speech, accounting for tone and speaker - which introduces nonlinear interdependencies between any individual block that FFT could generate.

The FFT is brittle to noise, local time shifts, and time warps. It's a necessary, but not sufficient, stepping stone - the next step is time-frequency, see "Modulation Model vs Fourier Transform" here - and the next step is feature engineering, see this post.

As for how to synthesize an "e"... dunno, I'll let others answer, but it's simple enough for FFT.

OverLordGoldDragon
  • 8,912
  • 5
  • 23
  • 74