Chadwick High School/Peninsula High School
Peer Reviewer: Annasimone Andrawis and Lily Ge
Professional Reviewer: Mark Nimmer
Sound Localization and Speech Detection to Assist the Hearing Impaired
Auditory assistive devices are becoming increasingly ubiquitous, and audio visualization devices represent an innovative method of helping the hearing impaired. The purpose of this project is to evaluate a method of creating an auditory visualization device that displays the
Development and improvement of auditory assistive devices is critical, as 11% of Americans are categorically hearing impaired (Hear-It Organization, 2009). 28.8 million American adults stand to benefit from the use of hearing aids (U.S. Department of Health and Human Services, 2018). Only 28.5% of Americans who are hearing impaired employ the use of hearing aids (Hear-It Organization, 2009). In the period of 2005-2008, the number of hearing impaired Americans increased by 9%, doubling the rate of the population growth of 4.5% in the same period of time (Hear-It Organization, 2009). Sales of hearing aids have been on the rise accordingly, with sales increasing by 5.3% in 2018, exceeding what market experts deem in the “normal” range of 2-4% annual growth (Hearing Review, 2019). Globally, the hearing aid and auditory assistive device retail market is expected to expand with a significant compound annual growth rate of 7.2% during 2019-2028 (Market Watch, 2019).
Audio-visualization devices are a key innovation in how auditory assistive devices improve the lives of the hearing impaired. Visualization devices can show locations on a display from which sound originates in their field of vision, show icons that represent familiar sounds (e.g. ambulance sirens, cars honking, etc), or convert speech to text. Users of audio visualization devices would be able to rely heavily upon the devices to provide cues that are key to navigating their daily environments.
Sound localization refers to the ability to detect the origin of a sound source. Human sound localization is primarily based on two factors: interaural time and intensity differences (Wightman & Kistler, 1992). Sound intensity is more commonly understood, and refers to the “loudness” or power carried by sound waves. This factor is used to understand the distance from which noise is being produced. Interaural time difference refers to the lag time between sound arrival to each ear. This subtle difference can be used to subconsciously calculate the angle from which sound is arriving.
These two factors — sound intensity and interaural time difference — are utilized in the development of audio visualization devices. However, other factors, such as varying frequencies of environmental sounds, also play a role in the creation of a usable device, making this technology even more difficult to make precise. For example, users need to be able to distinguish sounds of different frequencies for various reasons; some examples include the need to identify specific voices and the intensity of the screech of car tires. Multiple sounds are often produced simultaneously, and ambient noise is present in nearly every real-life setting. A device that cannot distinguish between multiple sound sources would render itself unusable, producing an inaccurate location for what it assumes is only one sound.
Despite the clear importance of audio visualization devices, development continues to be very limited. Current prototypes cannot provide a comprehensive analysis of noise and provide little aid in real-world applications. For example, a now-abandoned industry prototype visualized sound using LED lights placed around a pair of glasses to indicate the direction of sound. The simplistic nature of this device prevents the user from understanding other characteristics of sound, including the intensity or source type. The lack of a screen or display interface prevents the device from communicating to its user more specific spatial dimensions of environmental sounds. Other prototypes included spectrograph and “positional ripples” visualization of sound source locations (Ho-Ching, Mankoff, & Landay, 2003). The “positional ripples” display was among the most user-friendly, in that it portrayed sound location, source type, and other suchnecessary factors. However, each of these prototypes proved to be ineffective in spaces with ambient noise due to difficulty in distinguishing the target sound.
The purpose of this project is to find ways to distinguish between multiple sounds occurring at the same time. Methods that currently exist to distinguish between sounds include
PRINCIPLES & THEORY
Cross-correlation is a method of measuring the similarity of two series by evaluating the displacement of one series relative to the other. If two series x(i) and y(i) are evaluated and i = 0, 1, 2, … N-1, the cross-correlation r at delay d can be defined where
|Equation 1. Cross-correlation r at delay d.|
The output is a number which can be used to determine the difference in the distances between two microphones and the sound source.
Another technique used was the FFT-IFFT technique, which converts time domain audio recordings into frequency domain data, and then converts the data back into the original time domain data.
FFT is an algorithm that computes the discrete Fourier transform (DFT) of a sequence, or its inverse (IDFT) (Heideman, Johnson, & Burrus, 1984). The DFT is obtained by decomposing a sequence of values into components of different frequencies (see Equation 2). An FFT rapidly computes such transformations by factorizing the DFT matrix into a product of sparse (mostly zero) factors (Heideman, Johnson, & Burrus, 1984).
Equation 2. Formula for discrete Fourier transform (DFT).
MATERIALS & METHODS
Method 1: Single Source Localization
The first set of data was collected through
Fig. 1A. Recording equipment: experiment setup with microphone locations
Fig. 1B. Picture of recording equipment (note: recording environment not pictured).
From recordings captured by each microphone, the difference in time (?time) of when the sound was captured by each microphone was calculated using cross-correlation function on Matlab, similar to the use of interaural time difference in mammals. This ?time was converted into ?distance by multiplying it with sound velocity. The ?distance was calculated between each set of 2 microphones (mics 1 & 2, mics 1 & 3, mics 1 & 4, mics 2 & 3, mics 2 & 4, and mics 3 & 4), giving a total of six combinations of microphones.
The recording environment was scanned in
Fig. 1C. Position search algorithm: visualization of recording environment scanning.
Fig. 2A. Single sound source localization. Y-axis represents
Fig. 2B. Visualization of actual sound sources. Each number shows the location of the source, and the value of each number shows the calculated angle of error.
Method 2: Frequency Selection Localization
The second set of data involved distinguishing between two sounds occurring simultaneously. A low frequency sound (male voice) and a high frequency sound (female voice) were recorded from different locations at the same time, following the same microphone setup as the Single Source Localization method. Position search was done with the same method as the Single Source Localization case, but with mixed voice sound files.
Fig. 3A. Low frequency sound source localization.
Fig. 3B. High frequency sound source localization.
The results of
Fig. 4A. Depiction of the range given by any error value ?.
Table 1. Single sound source localization. Angle represents the angle of difference between the actual sound source and the calculated sound source.
The low frequency and
Table 2. Low frequency sound source localization.
Table 3. High frequency sound source localization.
The major finding of the present study is that single source localization and frequency selection localization methods did not produce significantly accurate results, but were able to greatly increase accuracy by isolating one frequency range prior to cross correlation and FFT- IFFT method.
In the set of trials titled “Single Source Localization,” the majority of the sound source positions were found with less than 40-degree error (see Figure 4A). Given that the recording was done in a closed room with the potential of sound reflecting off the walls and creating a high level of background noise in the recordings, the error value indicates that while accuracy is a continued prospect for future research, localization is possible. Though more efforts should be taken in future research to reduce the error (i.e. using a recording booth with near-zero audio rebound), current results show that even rudimentary audio cleaning is possible using FFT-IFFT technique.
As expected for the Frequency Selection Localization trials, calculating sound source position with mixed voice sound files did not yield accurate results. The angles of error were generally larger than the single sound source case. This indicates that the sounds occurring simultaneously caused errors when interpreted as one source.
However, upon removing one of the voices (isolating low frequency or high frequency sound), angles of error were greatly reduced. These results suggest that positions of sound sources can be accurately determined using the cross-correlation technique. Frequency-isolation method, calculated using FFT-IFFT technique, yielded much more accurate sound source position calculations and higher resolution of sound. This increase in accuracy may be a result of isolating one voice from the original recordings. The removal of one of the frequencies may have reduced data noise (background static, sound reflecting off of walls, etc.), which may have interfered with sound position calculations in the first Single Source Localization method.
The results support the hypothesis and suggest that frequency selection was successfully achieved in cross-correlation methodologies tested. Evaluating the frequency of audio recordings allowed sound source localization and differentiation between simultaneous sounds of varied frequencies.
A novel method of sound source localization was determined through isolation of sounds of different frequencies. Results indicate that future research could involve the use of this program to create devices for use as an auditory aid. Work has been done on speech-to-text, which can be used alongside sound localization software in order to create a usable device. Future research should focus on
“35 million Americans suffering from hearing loss.” (n.d.). Hear-It Organization. Retrieved from https://www.hear-it.org/35-million-Americans-suffering-from-hearing-loss
“Global Hearing Aid Retail Market To Expand with Significant CAGR of 7.2% During 2019- 2028.” (2019, January 10). Market Watch. Retrieved from https://www.marketwatch.com/press- release/global-hearing-aid-retail-market-to-expand-with-significant-cagr-of-72-during-2019- 2028-2019-01-10
Gorman, B. M. (2014).
Hearing Aids Market Size, Share | Industry Analysis Report, 2019-2025. (n.d.). Retrieved from https://www.grandviewresearch.com/industry-analysis/hearing-aids-market
Heideman, M., Johnson, D., & Burrus, C. (1984). Gauss and the history of the fast
Ho-Ching, F. W., Mankoff, J., & Landay, J. A. (2003). Can you see what I hear? Proceedings of the Conference on Human Factors in Computing Systems – CHI 03.
Kim, K.W., Choi, J.W., & Kim, Y.H. (2013). An assistive device for direction estimation of a sound source. Assistive technology: The official journal of RESNA, 25, 216-21. 10.1080/10400435.2013.768718.
“Quick Statistics About Hearing.” (2018, October 05). Department of Health and Human Services. Retrieved from http://www.nidcd.nih.gov/health/statistics/quick-statistics-hearing
“US Hearing Aid Sales Increase by 5.3% in 2018.” (n.d.). Hearing Review. Retrieved from http://www.hearingreview.com/2019/01/us-hearing-aid-sales-increase-5-3-2018-approaches-4- million-unit-mark/
Wightman, F. L., & Kistler, D. J. (1992). The dominant role of
We would like to acknowledge Dr. Kim for his technical guidance throughout the experimentation and Dr. Nimmer for his role in the development and background information for this project.