Voiceless vocals: An auditory illusion<span class="wtr-time-wrap block after-title"><span class="wtr-time-number">13</span> min read</span>

So, I was browsing the interwebs a few nights back and I found this video on YouTube which was quite interesting and I thought I should write a few words about my recent find. Well, honestly, it’s not the first time I came across something like this but it was the first time it was anime related. So I grasped the opportunity.

First things first. Let me tell you what the video is about. It contains really popular anime openings played on a piano.

*Don’t start going like “Oh. So, what’s exciting about it?”. Have some patience and wait for it…*

But it wasn’t just any piano performance. It was a digital piano. There was no performer. All songs were just some series of midi sequences.

*Now, I get it if you think “Well, that’s a whole lot more underwhelming than I was expecting it to be”. But again: Wait for it…*

You could hear the vocals. You could hear the lyrics. But the only thing you were actually listening to was the digital sound of an E. Piano.

*Wait, what?*

There’s your interesting part. How can you possibly listen to distinct vowels and consonants, while the sound is solely produced by something that is incapable of producing speech?

You can have a look at the video I’m talking about here.


Spoiler: from now on, everything will be science related…

How do we speak? Well, it’s an elaborate yet simple mechanism that consists of a few parts. It’s something similar to a pump (think of a syringe) that has some extra bits at its end. But that’s a bit oversimplified. Our vocal tract has three distinct parts:
i. the part that is responsible for inhaling and exhaling air, aka. breathing,
ii. the part that vibrates and produces the tones, aka. the source of the sound and
iii. the part that shapes the sound we produce and helps us articulate indistinct sounds into speech, aka the resonator.

So, the organs responsible for the first part of the vocal mechanism are our lungs. Yes, their job is to gather the inhaled air, take the oxygen, distribute it into our system and release carbon dioxide by exhaling. The muscle responsible for this motion is the diaphragm, which lowers to create more space into our chest cavity, increase the capacity of our lungs and allow them to gather air. Breathing by contracting the diaphragm muscles while not actively rising the chest (ie. contracting the intercostal muscles) is the natural way of breathing and allows for better and deeper breaths while at rest.

2 illustrations of upper human body in profile, lungs-diaphragm while breathing
Illustration from Encyclopaedia Britannica showing the mechanism of breathing. (source)

If you want to check out this 100% natural, 100% free way of breathing and happen to live with a dog or a cat, you can observe this phenomenon by making your own pet, a means of observation… So, if you have a pet, observe him/her while they’re asleep and see which parts expand and contract as they breathe in and out.

So if you are a teacher or a lecturer, a call center operator, a TV/radio presenter or host, a podcast host or whatever else that requires hours after hours of talking, you are (or should probably be) using your diaphragm to breathe, as it’s something relaxing and doesn’t induce stress to the vocal mechanism. Well more or less, what I described above is a natural process that has been turned into a technique to remind us how we should properly utilize something that we already possess. So, what we aim is to produce consistent and uninterrupted air flow, which would keep the loudness of our voices stable. It’s a hot topic, especially in singing and speech related fields and it’s something that has been misunderstood quite regularly, but I don’t want to get too deep into that just yet. Perhaps in a different article some time in the future. The bottom line is to know about human anatomy and how each part works.

Part two of the human voice mechanism. It’s what I referred to earlier as “the part that vibrates and produces the tones, aka. the source of the sound”. This is probably the most misunderstood and the most tricky to explain and understand part of our vocal tracts, as it has more than one uses and direct feedback can be hard to understand at times. So, we have to count on other forms of feedback. The easiest, most common and least invasive form is auditory feedback. This means that the quality of the sound we produce ‘tells’ us a lot about our vocal instrument. Then, we have the slightly more invasive diagnostics tool, endoscopy. That’s when a small tube that has a tiny camera at its end goes through the nose and helps us see the inside part of the larynx, including the vocal folds. But we’ll get to our vocal folds soon enough.

Allow me to start from the top for a second. The second part of our voices is the part that contains our larynges. The larynx is the part in our necks that protrudes a bit and it’s where many men have their Adam’s apples. So, if you gently touch your throat and pretend to swallow (no liquids, we don’t want any choking hazards; just swallow your spit), the part that will move up and down is the larynx. Our larynges have our vocal folds (it’s the proper way of saying vocal cords). They are membranes that open and close for a number of reasons.

illustration of chest showing the lungs, vocal cords in open and closed positions
Illustration from Mayo Clinic showing the location of the larynx and the vocal cords within the larynx, in an open and closed state. (source)

Long story short:
They are open while at rest, so there is a space between them, called the glottis.
When we swallow they close so as to prevent liquids and food from going to our lungs, to prevent choking.
They are open while we yawn.
They are closed and suddenly open when we cough, to increase pressure and release air more rapidly.
They start off closed and vibrate (aka open and close rapidly) to produce sound.

So, when we phonate, that is when we produce sound using our vocal folds, we release small puffs of air and depending on the number of times our vocal folds touch (think of it like clapping), we produce different tones. The sounds we produce travel through the air. To rephrase, the source of the sound is our vocal folds and the medium is the air. So, if you gently touch again your throat and speak or hum, you’ll feel your larynx vibrate a bit. Now, this part is directly related to the first part, ie. our lungs and our diaphragm.

Personal note:
What I am about to say is something that I find hilarious and probably won’t be ever, ever, ever be found in an academic paper or journal. Hajimemashou ka [Shall we start?] Brace yourselves. I talked about clapping-like motions and small puffs of air, right? Technically speaking… we are farting from our necks. I just threw it out there. *laughs hard*. All this science talk turned into a fart joke by someone who -according to PaperSailor- has the sense of humor of a three-year-old when it comes to poop jokes… *continues laughing*

Part three. *deep breath*. That’s a lot more straightforward because you can stand in front of a mirror, open your mouth and observe. Our mouth is what shapes the puff sounds produced by our vocal folds. And to be more exact, it’s our articulators. What are articulators? Anything and everything that exists from our larynx upwards. It’s the velum (aka the soft palate), the tongue, the teeth, the lips, the nasal cavity… every single thing.

illustration of human voice articulators
Illustration from Encyclopaedia Britannica showing the human voice articulators. (source)

So, depending on our individual features and on where we create blockages, the timbre of our voice changes, which is why each one of us has a different voice and why we are able to produce different sounds. Depending on what language(s) you speak, you produce different sounds, different vowels and different consonants. For instance, in English, we have two differently pronounced ‘a’ in banana and in chocolate. To pronounce these two sounds, you change the position of your articulators accordingly. Well, obviously, you cannot change the position of all your articulators, as some of them are fixed and immovable (eg. your teeth).

Vowels aside, we are left with consonants, that are grouped into two types: the voiced ones and the voiceless ones. Some of the voiceless include ‘p’, ‘k’ and ‘t’, while their voiced counterparts respectively are ‘b’, ‘g’ and ‘d’. A tricky one would be ‘th’ from theater, which is voiceless and ‘th’ from there, which is voiced. In English, both sounds are identical in writing, while their pronunciation is different. So, why are they called voiceless and voiced? Well, their pronunciation -as far as our articulators are concerned- is exactly the same, meaning that we position our articulators in the exact same spots. The only difference between the two types of consonants is that we phonate voiced consonants. So, when we want to say the word ‘voice’, we vibrate our vocal folds while pronouncing ‘v’, ‘o’, ‘i’ and we stop vibrating them while we pronounce ‘ce’ [Note: I included ‘ce’ together, as ‘e’ is silent and doesn’t correspond to a distinct sound].

8 illustrations of location-shape of tongue while pronouncing different vowels
Illustration from Encyclopaedia Britannica showing the tongue placement during the pronunciation of different words. (source)

But that’s not all. Let’s go back to the vowels. How are we able to differentiate between them? What makes them distinguishable from one another? The answer is formants. What are formants? They are certain frequencies that resonate due to the shape of the cavity of our upper vocal tract. So, depending on the distance between these frequencies, we are able to perceive the difference between the vowels. It’s all about acoustics. So, we have the fundamental frequency, which is the ‘actual’ note we produce and then on top of it we have the formants. There are quite a few distinct formants and are visible in a spectrogram. But we tend to look into the first five. Generally speaking, the base frequency plus a couple of formants are usually enough for us to distinguish a vowel.

graph of formants when producing different sounds
Illustration from Encyclopaedia Britannica showing the resonating frequencies produced during the pronunciation of different words. (source)

There’s a really cool online resource:
the International Phonetic Alphabet.
They have a chart with sounds on their website. It’s great as a reference, since you’re able to consciously understand what you are phonating and how. Not to mention that you will be introduced to all sorts of new sounds. I was impressed by the clicks. I know they exist, I can tell some of them apart, but I had no idea that the alveolar lateral click was a sound resembling the one used by equestrians to communicate with their horse.

Tiny experiment:
Grab your phone, download a spectrogram app and open it. You should be able to see frequencies from 0 to approximately 20,000 Hz in a logarithmic scale. If the scale isn’t logarithmic, it would be better if you changed it from the settings, as the frequencies you will be producing are in the lower end of the spectrum and a linear axis would show your data way too cramped. Now, you will be producing some vowels (not the entire words). Let’s go with ‘a’ from apple, ‘e’ from electron, ‘i’ from initial, ‘oo’ from zoo and ‘o’ from original. Try to produce all sounds on the same tone; try not to change the pitch. Also, try to pronounce them as clearly as possible and maintain the sound for a bit. The end result would be visible on your spectrogram app. You are looking for lines on your spectrogram, not dots. So, you should be able to see different lines that shift from vowel to vowel. Excluding the bottom line that should remain constant, all the others should shift higher and lower; these are the formants. Their relative position is what makes us perceive different vowels.


That was a long one, I know. The bright side is that *fingers crossed* hopefully, we managed to cram what might as well be the curriculum of a few weeks in a phonetics or a music-related field into a few hundred words… But now, we’re getting into the exciting stuff. We could simulate the sound of a vowel by utilizing VSTs to create frequencies that are in the appropriate relative positions. Meaning that we could produce vocaloid-type sounds, aka sounds solely produced by tones that have not been phonated. And in comes this video I was talking about earlier.

*Ding ding ding*

Extra bit of information:
The frequencies above the fundamental frequency are the ones that make the sound of an instrument or a voice unique. They are frequencies that have been amplified due to the shape of the body of the instrument and resonate, thus producing the instrument’s ‘fingerprint’ sound. Basically, that’s how we can differentiate between the sound of a cello and the sound of a piccolo. This is what we call timbre.

But now, let’s think of something parallel to our current discussion… I know, a lot of side tracks on this one. Think of an image captured with ultra high resolution. And then think that you want to recreate it in a lower resolution by manually coloring all pixels one by one. So, if we leave the resolution out of it, the end result would be extremely close to the original. Meaning that you’d know that it’s the same picture, right? Its quality wouldn’t be the same, but still, it’d do the trick. The same ‘recreation’ principle applies to the sound of approximating the human voice, while there is no human voice involved in the production of the sound.

Which means that by creating the individual tones that our voice is made out of, we can approximate its sound. The resolution may not be as good, as the voiceless consonants are a lot trickier to approximate. Why? Mainly because the vocal folds do not vibrate; rather, the air is shaped directly into our mouth.

To state the obvious:
It helps a lot knowing the song before-hand. However, in case you are not familiar with the tune, I strongly believe that you would still be able to tell some vowels apart, or at least identify that there would be vocals and sporadically identify an ‘a’ or an ‘e’.
The formants we produce aren’t restricted into the frequencies we have chosen to identify as ‘notes’ in Western music. Which means that the end result of a ‘talking piano’ has even lower resolution, as quite a few bits and pieces are left out. The higher you are on the register, the greater the distance between two tones, even a mere semitone apart.
Hitting bullseye on the tones is only half the job done. An equally important role is the dynamics, ie. how softly or loudly the sound should be heard.
Also, there is so much clutter throughout the register that could probably give you a migraine. But it’s totally worth it. I guess… If you find these kinds of stuff exciting… (?)

See? There you have it. That was the theory behind an ‘auditory illusion’ that roams free online and is recycled every now and then on YouTube’s trending videos…

There’s no trick to it, just plain-old physics.
We live in a world of technology and science [Dr. Stone reference. Hell yeah, that guy rocks], so we might as well try to understand it.

So keep calm, stay home and watch silly (or educational) videos online.

LibrettistaC out. hehe.


Leave a Reply