On Tuesday of this week, the Internet erupted in a dispute over an audio recording of a word. Or, maybe two words. Nobody knows because nobody can agree. I know what you’re thinking, “We’re so glad Nick decided to write about a viral Internet sensation that we’re already tired of hearing about.” I feel your sarcasm and I reject it. The confusion over the word got me thinking we may be unable to trust our ears. If we cannot trust our hearing, what impact might that have on recorded evidence presented at trial? Or, in a less formal matter (disagreements with a loved one)?
The New York Times did a good job of addressing the Laurel v. Yanny dispute in this article. They created a tool that allowed readers to change the frequency of the audio recording. Move the arrow all the way to the left and your clearly hear the word “Laurel.” Move it all the way to the right and you hear “Yanny.” Regardless of where the arrow was stationed, disagreement exists.
Don’t worry, I did some research for this article. Using a non-scientific methodology, which is definitely NOT generally accepted in the audio forensics community, I came to some totally unreliable conclusions. Asking friends, family, and coworkers what they heard on the recording, I came to some interesting conclusions. More women heard Yanny and more men heard Laurel. I’m certain the margin of error is enormous and I can’t recall if I asked an even number of men and women. It also appears to be some different interpretations based on age. Then of course, some people heard different words, at different times, on different devices. This really caused problems for my non-scientific research. I heard Laurel one day on one device and Yanny another day on another device.
Naturally, all of this made me wonder how we can trust our hearing? How is evidence reliable? What about witness testimony about what was heard? Of course, I also wondered if science could settle the dispute?
Is testimony about what was heard as unreliable as eyewitness identification testimony?
Eyewitness identification used to be considered incredibly strong evidence. In fact, in some US jurisdictions, it is still compelling evidence. From my experience working with expert witnesses and following the science with some interest for the last eight years, I can tell you that eyewitness identification evidence is terribly unreliable. It is frightening how often it is wrong. There are so many variables which can impact the judgement, perceptions, and memories of an eyewitness, that I would not trust it (without some strong corroborating evidence).
So, I wonder if the hearing of an eyewitness is similarly compromised? How do I know if a witness heard three or five gun shots? How do we know the witness heard one collision or two? What about business negotiations? Are we certain we’re all hearing the same thing and agreeing to the same terms being memorialized in the contract?
Typically, I am more inclined to believe recorded evidence because I am biased against eyewitness testimony from the scientific studies I’ve read. Or, should I say, I was more inclined to believe recorded evidence.
After the Laurel/Yanny dispute, I wondered if recorded audio evidence is reliable? If I can hear one thing and others hear something totally different, how can we rely on a recording? For insights on this phenomena, I’ve reached out to an audio forensics expert.
Herbert Joe – Forensic Audio Video Analysis Expert Witness
Herbert Joe is a highly qualified and board certified forensic audio and video examiner. He has three science degrees and two law degrees. He and his partner have been retained in thousands of criminal, civil, and administrative cases throughout the US and internationally. Mr. Joe has worked on many high-profile matters including the Branch Davidian case, State of Florida vs. George Zimmerman; the Associated Press (Osama bin Laden); consultations with Dr. Phil (Manti Teo), CSI: Miami, TMZ (Michael Jackson), the Wall Street Journal, and People Magazine (Mel Gibson). You can learn more about Mr. Joe by visiting his website: forensicscenter.com.
As I normally do for blogs, I posed several questions to Mr. Joe. Please see my questions and his answers below:
Nick: Some listeners hear Laurel and others hear Yanny. Is this a result of the recording or the listener’s hearing?
Mr. Joe: What one hears has a large subjective component, and even then the same listener may hear it differently over time, depending on a host dynamic factors. For examples, what one perceives to hear may depend largely on the mood or emotive state of that person at that time; what one perceives to hear may depend largely on what s/he is expecting or anticipating to hear; what one perceives to hear may depend largely on one’s hearing ability. Clearly, there are many other factors to determine and affect what one hears, what one interprets and what one recalls, all of which may change over time for that person, and may be very different from what another person perceives to hear.
This only scratches the surface of the area of psychoacoustics, speech production and speech perception.
Nick: Is there a correct answer to Laurel or Yanny?
Mr. Joe: Hate to sound like an attorney – as I am one – but the answer to that question depends, depends on how you phrase that question. Is there a correct answer to what one hears? Yes, it’s what one perceives. But if one clearly enunciates either name/word, then there is an objectively correct answer, namely (sorry for the pun), the word that was spoken or played back – regardless of how it was heard, if at all, by the listener(s).
Consider this analogy with light. We know that our eyes are sensitive to light within the (narrow) visible light spectrum, a small part of the entire electromagnetic spectrum. So let’s take a red apple. Sunlight or white light is made up of all the different color lights that we know of, as we learned in school – ROY G BIZ, red, orange, yellow, blue, green, indigo and violet. But that apple is red, whether we perceive it that way or not. It’s red because the skin of that apple absorbs all the colors of the incoming white light except red, which is reflected and that’s why we see red. (If we shine a pure red light on that apple and no other light is present in a closed room, then that apple will not appear because all the red light is absorbed, and since there is no other light frequency to reflect, then there is no light to perceive, i.e., it appears black.
Likewise, sound is merely vibrations of air that propagates from the source (through the air or another medium) and can be heard when they reach a person’s or animal’s ear. That’s the objective part – the frequencies at whatever intensities at any given moment. It’s there whether we can appreciate them or not.
Nick: Is there a way to determine the correct answer?
Mr. Joe: There is a correct answer if the question is whether there are linguistic and acoustic differences between the spoken words “Laurel” or “Yanny.” See answer to question #5, below.
Nick: If listeners are hearing different words, how can recorded evidence be trusted?
Mr. Joe: For the past 31 years, my partner and I have been forensically analyzing audio, acoustic, voice and video evidence in state and Federal courts, in civil, criminal and administrative cases throughout the U.S., as well as many foreign countries. Recorded evidence must be subjected to admissibility standards to be admitted, and the subject to analyses and opinions that go to the weight of the evidence. If the proponent of the audio (or acoustic, voice or video) evidence can provide facts sufficient to support a reasonable jury determination that the recording is an accurate reproduction of the event that it purports to record. Where we often get retained is to show and testify, objectively and with a reasonable degree of scientific certainty, that the recording has been falsified or tampered with in one way or another to render the recording untrustworthy as a whole. Now if the case comes down to an interpretation or dispute of what was said in some recording, we can enhance (digital signal processing) the passage(s) of interest, allow the jurors to hear the enhanced audio (with good quality headphones), provide a reasonably accurate transcript and provide expert testimony thereof. However, the other side can also have their transcript version of the recording, and it is up to the jury to ultimately decide what the recorded evidence says or not say.
Nick: If the Laurel/Yanny recording was presented as evidence at trial, what analysis would you use to prove one word or the other?
Mr. Joe: We had a case in which the entire felony indictment centered on a single, mono-syllabic word. The Government contended that the Defendant said “Shoot the [expletive]!” The Defendant claimed that he said “Shoot me, [expletive]!” The Government contended the former exclamation underscored intent and contentment that an officer was killed. The defense contended that the latter showed his remorse. So, we had to objectively differentiate between the /th/ sound and the /m/ sound with a reasonable degree of scientific certainty – regardless of what perceives to hear. The /th/ sound is known as a fricative because the tip of the tongue is placed just behind the two front (central) incisors to create friction in producing the /th/ sound. The /m/ sound is known as a nasal sound since air bypasses the oral cavity because the lips are closed (and the soft palate drops) and thus passes out through the nasal passages. After enhancing the audio evidence, spectral analyses revealed the 2nd word had higher frequency energy (“the”, as opposed to lower frequency energy, which would indicate the nasal sound /m/); so, that 2nd word was “me” and not “the.” The case was dismissed upon our testimony.
Likewise, phonetically, Laurel begins with the letter “L,” whereas Yanny begins with the letter “Y.” Although the letter “Y” (a/k/a a semivowel) can represent a vowel or a consonant, it is used as a consonant in “Yanny.” Therefore, on the one hand, there are common phonetic features of the consonants “L” and “Y,” e.g., they are both voiced consonants produced by directing air solely with the lungs and diaphragm and actively narrowing the vocal tract upon articulation. In making either of these sounds, air only leaves through the mouth. On the other hand, there is a substantial phonetic difference between these 2 letters. The letter “L” is a “lateral” consonant, as it is made by directing the airstream around the sides of the tongue upon articulation; the letter “Y” is a “central” consonant, because it is made by directing the airstream along the center of the tongue upon articulation.
One can “see” this substantial difference in the raw waveform, as well as the same waveform viewed as a 3-dimensional spectrogram. Below is the waveform of my enunciating “Laurel,” and then “Yanny.” Below that is a spectrogram of the exact same recording. And one certainly should be able to hear and perceive the difference if the sound source is accurate in the enunciation of each.
Nick: The Laurel/Yanny recording is of a robotic voice. Are human voices less susceptible to this type of misinterpretation?
Mr. Joe: First, I’m not sure if I agree with the premise. Human voices naturally have varying degrees of emotions manifested by simultaneous changes in pitch, resonance, fluency, intonation, prosody and duration of the words and speech segments. In contrast, computer-generated, synthetic or robotic speech utilizes an algorithm that translates orthographic strings of letters into the robotic voice; however, synthetic voice is audibly missing emotive components, like the natural variations in pitch, level, and intonation.
But it’s not so much misinterpretation, as it is how the brain perceives the difference: human speech requires little effort by our auditory cortex when perceived; however, synthetic or robotic speech requires more effort when listened to. Without the emotive components in human speech, robotic speech has fewer cues to help our brains with identifying phonemes.
Nick: Do different interpretations of the Laurel/Yanny recording cast doubt on what a witness claims to have heard (ex. witness to a crime, collision, conversation)?
Mr. Joe: This question opens up a whole different Pandora’s Box. Earwitness identification, recall and the like has little to do with synthesized voices (unless of course the subject matter has to do with a synthesized voice). What one hears and perceives at the time of some acoustic event and recalls at a later time is subject to so many factors, e.g., one’s mental state at the time, how traumatic that acoustic event is, etc.
We had a case in which the reliability or trustworthiness of a witness recalling an auditory event years later was at issue. There are generally accepted academic, clinical and forensic studies in the areas of the reliability of earwitness identification. For examples, it is well-established that there is a temporal decay of memory for voices. In one study, after 2 weeks of hearing one’s voice but never seeing that person, reliability is only 68% correct, 35% correct after 3 months and only 13% correct after 5 months (less than a chance guess). The majority of forensically relevant encounters with unknown voices may well occur before the listener forms an intent to memorize.
Nick: We have no context for the Laurel/Yanny recording. Simply two words. Does context play a role in the analysis of a disputed recording? For example, a recording of a business agreement or a family law dispute.
Mr. Joe: Absolutely! Let’s take an example of the phrase “I’m going to kill you.” If that phrase appeared in a transcript with no other context, then 10 different readers may have ten different interpretations (20 if the readers are attorneys, but I digress). If that phrase was spoken in no other context and heard by someone, the emotionality and therefore the intent of that phrase alone may be revealed. If said sarcastically and sassily, then one would likely interpret that phrase without any real concerns. On the other hand, if that phrase was spoken and heard with sheer anger, then one would likely interpret that phrase with much concerns. If that phrase was in the broader context of 2 boxers, for example, being interviewed the night before their championship fight, then the meaning of that phrase is materially different than the same phrase spoken in context of 2 people viciously fighting. Clearly, one can see the context of a word or phrase can make all the difference in what was objectively meant, especially in contrast to a naked phrase with no context and completely subject to interpretation.
And another relevant issue here is the concept of top-down thinking in the context of speech perception. One can unintentionally or purposefully make someone subconsciously biased as to what s/he “should” hear in an anticipated audio recording; likewise, one’s own life experiences color what you think you hear or should hear. Stated another way, it may be equally remarkable if a study using the same “Laurel/Yanny” audio clip, the listener was asked what they hear without mentioning either or any name or word.
By the way, for the applicable analyses as described above, and given my 31 years of experience in critical listening of audio and acoustic evidence, and without any bias or top down thinking, it is clear to me that the word from the May 16, 2018 NYTimes article that the word generated is “Laurel.”
There you have it, folks! Laurel is the word that has baffled the Internet for the last three days. I want to extend a huge thank you to Herbert Joe of Yonovitz & Joe, LLP, for his exquisite scientific analysis of the Laurel/Yanny audio clip. What mystery will the Internet provide next? Only time will tell. When there is a mystery to solve, you can get your forensic scientific answers on this blog! Naturally, you’ll get some non-scientific analysis from yours truly.