Sound Localization and Speech Detection to Assist the Hearing Impaired


Yeji Cho

Anthony Kim

Valmik Ranparia

Sky Shia

Young Kim
Chadwick High School/Peninsula High School

Peer Reviewer: Annasimone Andrawis and Lily Ge

Professional Reviewer: Mark Nimmer

Sound Localization and Speech Detection to Assist the Hearing Impaired


Auditory assistive devices are becoming increasingly ubiquitous, and audio visualization devices represent an innovative method of helping the hearing impaired. The purpose of this project is to evaluate a method of creating an auditory visualization device that displays the real- time location of a sound source on a 3D grid interface. In addition to determining the location of the sound source, a functioning prototype requires the differentiation between audio frequencies for background noise reduction and identifying sound types such as human voices or passing cars. The hypothesis was that if audio frequency selection is used in cross-correlation, then sound source localization of each frequency source position could be achieved. Audio data was recorded and compared using the cross-correlation method. This method yielded a single value representing the calculated sound location. This calculated location was then compared to the actual location. This process was repeated in multiple trials with sounds of both high and low frequencies. The FFT-IFFT technique was used to remove one of the frequencies. The source position of the other frequency was then calculated. Analysis of the one-source results showed that most positions were found within a 40-degree angle of error. Analysis of the frequency- removed data supported the hypothesis and indicated that determining each sound source position can be achieved from one mixed sound signal. The results supported the hypothesis, and the sound localization technique developed in the present study can be further developed and implemented in future research to create a practical, wearable device.


Development and improvement of auditory assistive devices is critical, as 11% of Americans are categorically hearing impaired (Hear-It Organization, 2009). 28.8 million American adults stand to benefit from the use of hearing aids (U.S. Department of Health and Human Services, 2018). Only 28.5% of Americans who are hearing impaired employ the use of hearing aids (Hear-It Organization, 2009). In the period of 2005-2008, the number of hearing impaired Americans increased by 9%, doubling the rate of the population growth of 4.5% in the same period of time (Hear-It Organization, 2009). Sales of hearing aids have been on the rise accordingly, with sales increasing by 5.3% in 2018, exceeding what market experts deem in the “normal” range of 2-4% annual growth (Hearing Review, 2019). Globally, the hearing aid and auditory assistive device retail market is expected to expand with a significant compound annual growth rate of 7.2% during 2019-2028 (Market Watch, 2019).

Audio-visualization devices are a key innovation in how auditory assistive devices improve the lives of the hearing impaired. Visualization devices can show locations on a display from which sound originates in their field of vision, show icons that represent familiar sounds (e.g. ambulance sirens, cars honking, etc), or convert speech to text. Users of audio visualization devices would be able to rely heavily upon the devices to provide cues that are key to navigating their daily environments.

Sound localization refers to the ability to detect the origin of a sound source. Human sound localization is primarily based on two factors: interaural time and intensity differences (Wightman & Kistler, 1992). Sound intensity is more commonly understood, and refers to the “loudness” or power carried by sound waves. This factor is used to understand the distance from which noise is being produced. Interaural time difference refers to the lag time between sound arrival to each ear. This subtle difference can be used to subconsciously calculate the angle from which sound is arriving.

These two factors — sound intensity and interaural time difference — are utilized in the development of audio visualization devices. However, other factors, such as varying frequencies of environmental sounds, also play a role in the creation of a usable device, making this technology even more difficult to make precise. For example, users need to be able to distinguish sounds of different frequencies for various reasons; some examples include the need to identify specific voices and the intensity of the screech of car tires. Multiple sounds are often produced simultaneously, and ambient noise is present in nearly every real-life setting. A device that cannot distinguish between multiple sound sources would render itself unusable, producing an inaccurate location for what it assumes is only one sound.

Despite the clear importance of audio visualization devices, development continues to be very limited. Current prototypes cannot provide a comprehensive analysis of noise and provide little aid in real-world applications. For example, a now-abandoned industry prototype visualized sound using LED lights placed around a pair of glasses to indicate the direction of sound. The simplistic nature of this device prevents the user from understanding other characteristics of sound, including the intensity or source type. The lack of a screen or display interface prevents the device from communicating to its user more specific spatial dimensions of environmental sounds. Other prototypes included spectrograph and “positional ripples” visualization of sound source locations (Ho-Ching, Mankoff, & Landay, 2003). The “positional ripples” display was among the most user-friendly, in that it portrayed sound location, source type, and other suchnecessary factors. However, each of these prototypes proved to be ineffective in spaces with ambient noise due to difficulty in distinguishing the target sound.

The purpose of this project is to find ways to distinguish between multiple sounds occurring at the same time. Methods that currently exist to distinguish between sounds include use of amplitude, frequency, and other physical properties of sound. Devices that can measure the amplitude of sound waves in order to evaluate sounds are already available. Thus, this project focused on frequency as an identifying factor for sounds. It was hypothesized that if frequency selection was used in audio recordings, then the location of each frequency-typed sound source could be identified.


Cross-correlation is a method of measuring the similarity of two series by evaluating the displacement of one series relative to the other. If two series x(i) and y(i) are evaluated and = 0, 1, 2, … N-1, the cross-correlation at delay can be defined where mx and my are the means of the series and (see Equation 1). If is evaluated for all delays, = 0, 1, 2, … N-1, then it results in a cross-correlation series twice the length of the original series.

Equation 1. Cross-correlation r at delay d.

The output is a number which can be used to determine the difference in the distances between two microphones and the sound source.

Another technique used was the FFT-IFFT technique, which converts time domain audio recordings into frequency domain data, and then converts the data back into the original time domain data.

FFT is an algorithm that computes the discrete Fourier transform (DFT) of a sequence, or its inverse (IDFT) (Heideman, Johnson, & Burrus, 1984). The DFT is obtained by decomposing a sequence of values into components of different frequencies (see Equation 2). An FFT rapidly computes such transformations by factorizing the DFT matrix into a product of sparse (mostly zero) factors (Heideman, Johnson, & Burrus, 1984).

Equation 2. Formula for discrete Fourier transform (DFT).


Method 1: Single Source Localization

The first set of data was collected through single source localization methodology. Four microphones were arranged into a tetrahedral formation to ensure that recordings could be used to determine a specific point in the 3D grid (see Figure 1A & 1B). A single sound file was played at nine different locations (30, 60, and 90 cm horizontally from mics 1, 2, and 3) and captured using the four microphones.

Fig. 1A. Recording equipment: experiment setup with microphone locations labelled. Four microphones were placed in formation of the diagram. Three microphones were placed 120 degrees apart from each other. The final microphone was placed at the center.

Fig. 1B. Picture of recording equipment (note: recording environment not pictured).

From recordings captured by each microphone, the difference in time (?time) of when the sound was captured by each microphone was calculated using cross-correlation function on Matlab, similar to the use of interaural time difference in mammals. This ?time was converted into ?distance by multiplying it with sound velocity. The ?distance was calculated between each set of 2 microphones (mics 1 & 2, mics 1 & 3, mics 1 & 4, mics 2 & 3, mics 2 & 4, and mics 3 & 4), giving a total of six combinations of microphones.

The recording environment was scanned in 10 centimeter increments (see Figure 1C). At each 10 centimeter increment, distance between that point and each of the 4 microphones was calculated and compared to the 6 ?distance values. If the values were equal, this meant that the point was a match for the sound source. This method shall be referred to as “Position Search”.

Fig. 1C. Position search algorithm: visualization of recording environment scanning.

Angle of error was determined from two vectors: first was the position of known sound source, and second was the position of calculated sound source. Angles of error were charted for each of the 9 tested sound sources (see Figure 2A & 2B).

Fig. 2A. Single sound source localization. Y-axis represents angle of difference between actual sound source and calculated sound source. X-axis shows each of the nine tested sound sources.

Fig. 2B. Visualization of actual sound sources. Each number shows the location of the source, and the value of each number shows the calculated angle of error.

Method 2: Frequency Selection Localization

The second set of data involved distinguishing between two sounds occurring simultaneously. A low frequency sound (male voice) and a high frequency sound (female voice) were recorded from different locations at the same time, following the same microphone setup as the Single Source Localization method. Position search was done with the same method as the Single Source Localization case, but with mixed voice sound files.

Using FFT technique, mixed voice recordings were converted into frequency domain data. Initially, the high frequency sound portion was removed and converted back to time domain sound file using the IFFT technique, producing a recording with a low frequency isolated sound file. This same single voice extraction process was followed to produce a high frequency isolated sound file. Using the cross-correlation, the angle of error between the exact and calculated sound source positions were determined for (“Quick Statistics,” 2018) the original mixed voice sound file, (Wightman & Kistler, 1992) the low frequency only sound file (see Figure 3A), and (Kim, Choi, & Kim, 2013) the high frequency only sound file (see Figure 3B).

Fig. 3A. Low frequency sound source localization.

Fig. 3B. High frequency sound source localization.


The results of single sound source localization gave a maximum error value of 46.4 degrees and a minimum error value of 15.8 degrees (see Table 1 & Figure 2A). This angle of error was calculated by determining the angle of difference between the three-dimensional coordinates for the actual versus the calculated sound source. This angle, measured in degrees, can be visualized in a three-dimensional space (see Figure 4A).

Fig. 4A. Depiction of the range given by any error value ?.

Table 1. Single sound source localization. Angle represents the angle of difference between the actual sound source and the calculated sound source.

The low frequency and high frequency sounds were evaluated in two distinct sets of data, as each sound came from a different location at the same time. First, the coordinates of each of the sound sources was compared to the calculated coordinate when the two sounds were evaluated together with no frequency limitations. This provided the angle of error titled “original sound.” These error values ranged from 32.5 degrees to 73.8 degrees, with one outlier of 17.5 degrees (see Table 2 & 3). Then limitations were placed on the frequency to run the cross- correlation using only low or high frequency sounds. Again, the angle of error was evaluated, and titled “low frequency sound” or “high frequency sound.” These error values were far smaller, and ranged from 8.9 degrees to 20.1 degrees (see Table 2 & 3). A notable improvement was seen when the frequency was limited, showing that one of the sounds was successfully eliminated from evaluation. This meant that limitations on frequency values allowed cross- correlation to be taken of either the low or high frequency sound.

Table 2. Low frequency sound source localization.

Table 3. High frequency sound source localization.


The major finding of the present study is that single source localization and frequency selection localization methods did not produce significantly accurate results, but were able to greatly increase accuracy by isolating one frequency range prior to cross correlation and FFT- IFFT method.

In the set of trials titled “Single Source Localization,” the majority of the sound source positions were found with less than 40-degree error (see Figure 4A). Given that the recording was done in a closed room with the potential of sound reflecting off the walls and creating a high level of background noise in the recordings, the error value indicates that while accuracy is a continued prospect for future research, localization is possible. Though more efforts should be taken in future research to reduce the error (i.e. using a recording booth with near-zero audio rebound), current results show that even rudimentary audio cleaning is possible using FFT-IFFT technique.

As expected for the Frequency Selection Localization trials, calculating sound source position with mixed voice sound files did not yield accurate results. The angles of error were generally larger than the single sound source case. This indicates that the sounds occurring simultaneously caused errors when interpreted as one source.

However, upon removing one of the voices (isolating low frequency or high frequency sound), angles of error were greatly reduced. These results suggest that positions of sound sources can be accurately determined using the cross-correlation technique. Frequency-isolation method, calculated using FFT-IFFT technique, yielded much more accurate sound source position calculations and higher resolution of sound. This increase in accuracy may be a result of isolating one voice from the original recordings. The removal of one of the frequencies may have reduced data noise (background static, sound reflecting off of walls, etc.), which may have interfered with sound position calculations in the first Single Source Localization method.

The results support the hypothesis and suggest that frequency selection was successfully achieved in cross-correlation methodologies tested. Evaluating the frequency of audio recordings allowed sound source localization and differentiation between simultaneous sounds of varied frequencies.

A novel method of sound source localization was determined through isolation of sounds of different frequencies. Results indicate that future research could involve the use of this program to create devices for use as an auditory aid. Work has been done on speech-to-text, which can be used alongside sound localization software in order to create a usable device. Future research should focus on implementation of this methodology in commercially applicable devices with good visualization of sound localization.


“35 million Americans suffering from hearing loss.” (n.d.). Hear-It Organization. Retrieved from

“Global Hearing Aid Retail Market To Expand with Significant CAGR of 7.2% During 2019- 2028.” (2019, January 10). Market Watch. Retrieved from release/global-hearing-aid-retail-market-to-expand-with-significant-cagr-of-72-during-2019- 2028-2019-01-10

Gorman, B. M. (2014). VisAural:. Proceedings of the 16th International ACM SIGACCESS Conference on Computers & Accessibility – ASSETS 14. doi:10.1145/2661334.2661410

Hearing Aids Market Size, Share | Industry Analysis Report, 2019-2025. (n.d.). Retrieved from

Heideman, M., Johnson, D., & Burrus, C. (1984). Gauss and the history of the fast fourier transform. IEEE ASSP Magazine, 1(4), 14-21. doi:10.1109/massp.1984.1162257

Ho-Ching, F. W., Mankoff, J., & Landay, J. A. (2003). Can you see what I hear? Proceedings of the Conference on Human Factors in Computing Systems – CHI 03. doi:10.1145/642611.642641

Kim, K.W., Choi, J.W., & Kim, Y.H. (2013). An assistive device for direction estimation of a sound source. Assistive technology: The official journal of RESNA, 25, 216-21. 10.1080/10400435.2013.768718.

“Quick Statistics About Hearing.” (2018, October 05). Department of Health and Human Services. Retrieved from

“US Hearing Aid Sales Increase by 5.3% in 2018.” (n.d.). Hearing Review. Retrieved from million-unit-mark/

Wightman, F. L., & Kistler, D. J. (1992). The dominant role of low frequency interaural time differences in sound localization. The Journal of the Acoustical Society of America, 91(3), 1648- 1661. doi:10.1121/1.402445


We would like to acknowledge Dr. Kim for his technical guidance throughout the experimentation and Dr. Nimmer for his role in the development and background information for this project.


Please enter your comment!
Please enter your name here