Kaunas University of Technology
Subject: Engineering
ISSN: 1392-1215
eISSN: 2029-5731
SEARCH WITHIN CONTENT
Mehmet Cem Catalbas ^{*} / Simon Dobrisek
Keywords :
Citation Information : Elektronika ir Elektrotechnika. VOLUME 23 , ISSUE 4 , Pages 63-69 , ISSN (Online) 2029-5731, DOI: 10.5755/j01.eie.23.4.18724, August 2017 © 2017.
License : (CC-BY-4.0)
Received Date : 19-November-2016 / Accepted: 04-May-2017 / Published Online: 2017
In this paper, the generic idea of sound localization process is examined. Sound source localization is the determination of exact locations of sound sources using some appropriate signal processing techniques [1], [2]. The location of a sound source can be determined by 2D or 3D observation space [3], [4]. Generally, 2D location information is adequate for some of the basic applications [5]. Furthermore, some applications such as robotics, drones, etc. require the information on 3D positions. Before discussing the concept of sound location, we need to explain some of the basic parameters of sound signals. The basic definition of a sound signal is an air pressure changing with time. The pressure change can be easily converted to electrical signals via microphones which are three basic types based on the application area [6]. The first one based on the generic principle of generator systems is dynamic microphones.
The vibration of the air can be easily transformed into an electric signal via vibrating diaphragm and the coil structure. The second type is capacitor microphones affected by the changes in capacitance due to the changes of sound pressure. The last type is special purpose microphones that are specialized for different areas. In this work, the dynamic and basic microphone structure is used. As known, the sound signals vary with the time. The processing of such changes can be useful for different types of signal processing application such as sound recognition and localization, voice-based emotion detection, trajectory estimation etc. [7], [8]. The aim of this paper is to investigate the main sound localization techniques and also the whole sound process of sound source localization. Moreover, a novel sound database that can be used for developing moving sounds source localization systems is presented. A part of the presented database is used to investigate several selected sound localization techniques.
In this section, some terms on sound processing are explained. The sound localization is the determination of sound sources via signal processing techniques. In literature, this concept is implemented in different ways. Generally, the sound localization techniques are inspired by the hearing system of a human. The main terms that are related to sound localization are as follows:
The parameter of ITD is commonly used on basic works [9]. The negative part of ITD is the sensitivity to ambient noise and it also doesn’t provide the exact location of the sound source. This method gives an idea about azimuth angle of the sound source. The ILD method is related with sound pressure level [10]. This parameter also gives an idea about the location of the sound source via sound level difference. Besides, this parameter is more sensitive to environment and hardware condition. The ITD and ILD parameters can be used together. This combination is provided to determine the exact location of sound source [11]. ILD and ITD parameters are shown in Fig. 1 and Fig. 2, respectively.In this section, we explain the basic parameters of our test database. In the presented research, one of the scenarios is tested by our databases. The scenario involves walking off one person and saying the ID numbers of location, simultaneously. These data sets are obtained from four inductive types of microphones and a sound recording system. The sampling rate of the dataset is f_{s} = 44100 Hz and Scarlett18i8 desktop interface model is used for sound recording. The 2D coordinates of microphones and locations are shown in Fig. 3.
The second microphone is selected as the origin of the coordinate system to reference point. The coordinates of microphones and locations are shown in Table I and Table II, respectively.
Location | X(cm) | Y(cm) | Z(cm) |
---|---|---|---|
1 | 370 | 0 | 153 |
2 | 188 | 0 | 153 |
3 | 370 | 161 | 172 |
4 | 123 | 161 | 172 |
5 | 0 | 161 | 172 |
6 | 492 | 345 | 172 |
7 | 123 | 345 | 172 |
8 | 0 | 345 | 172 |
9 | 492 | 529 | 172 |
10 | 123 | 529 | 172 |
11 | 370 | 660 | 172 |
12 | 123 | 660 | 172 |
The 3D modelling of the test room and the environment is shown in Fig. 4.
The location of the speaker and distances are shown in Fig. 5. The distances between ceiling and floor, mount of the speaker to microphone, and ceiling and microphones are defined as follows d_{cf}, d_{z}, and d_{cm}, respectively.
The front view of the recording environment used in sound source localization is shown in Fig. 6.
In this section, we explain some methods of the sound localization process. The most common and basic one is ITD based sound localization which is used by the time lag between the receivers of sound. Several algorithms have been developed to estimate TDOA in the ideal propagation situation. The best known of these algorithms is time domain based cross-correlation [12]. The basic definition of cross correlation is shifted one waveform to other waveform and determined to provide maximum similarity point for obtained time lag between the signals. The cross-correlation formula for continuous time domain is shown in (1). The parameter of τ is the shift parameter for continuous signal
The test sound signal is integrated with the reference sound signal and the test signal is shifted according to time to analyse the similarity. At the end of this process, the amount of delay between two signals is obtained according to time. Besides, time-based cross-correlation algorithm is not robust enough to environmental noise and echo effects. Hence, Generalized Cross-Correlation (GCC) algorithm is preferred by researchers on sound localization process [13]. The brief explanation of cross-correlation and GCC method is as follows
Assuming that there is only one (unknown) sound source in the field, the output of receiver n(n = 1, 2,..., N) can be written as
where α_{n}, which satisfies 0 ≤ α_{n} ≤ 1, is an attenuation factor due to propagation effects, D_{n} corresponds to the propagation time from unknown sound source to the receiver n, and s(k) which is often speech from either a talker or loudspeaker is broadband in nature. The b_{n}(k) represents Gaussian random noise and it is uncorrelated with both sound source signal and noises at the other sensors. The function of cross correlation between two signals is shown in (4):
The parameter of TDOA between received sound signals can be calculated by
where p ∈ [−τ_{max}, τ_{max}] and τ_{max} is the maximum possible delay about signals. In digital implementation of TDOA calculation, we need some approximations. Supposing at time instant t, we have a set of observation samples of x_{n}, {x_{n} (t), x_{n} (t + 1),..., x_{n} (t + k − 1),..., x_{n} (t + K − 1)}, n = 1,2, corresponding CCF can be estimated by (6) and (7), respectively:
Another way to estimate cross-correlation between signals is the use of Discrete Fourier Transform (DFT) and the Inverse Discrete Fourier Transform (IDFT) as shown in (8) and (9):
where k' = 0,1,..., K − 1, w_{K'} is the angular frequency
In the frequency domain, DFT of x_{n}(k) is formed as [14]:
which is a Gaussian random variable. It can be illustrated by:
where n = 1, 2,..., N the power spectral densities of s(k), b_{n}(k), x_{1}(k), and x_{2}(k) are shown in (15)–(18) respectively:
The commonly used weighting functions in GCC method are shown in Table III.
In this paper, we have a moving sound source for localization. Excitation Source Information (ESI) based time-delay estimation algorithm is used for determining the time lag between sound channels [14], [15].
The optimal GCC method is determined by some experimental studies for this database. The error percentage is equal to the difference between the real number of sample delay and estimated sample delay. According to this information, we calculate the error percentage versus Signal-to-Noise Ratio (SNR) to determine the robustness of GCC methods. This comparison is implemented by our sound recording database. The result of this comparison is shown in Fig. 7. The approach of ESI has more successful results than other GCC methods on our sound dataset as shown in Fig. 7. The mean error percentage values for different GCC approach is illustrated in Table IV.
Method Name | Mean Error (%) |
---|---|
Cross correlation | 14.505 |
ROHT Filter | 13.101 |
SCOT | 20.511 |
SCOT-Modified | 15.052 |
ESI | 12.627 |
The accuracy of sample delay estimation has a vital importance for determining the exact location of sound sources. The user only wants to know the distance between microphones and time difference between received sounds [16], [17]. By means of this information, the azimuth angle of the sound source can be calculated easily. The illustration of ITD based on the determination of azimuth angle is shown in Fig. 8.
One of the most important parameters for sound localization is the speed of the sound. This parameter directly affects the performance of sound source localization.
The speed of sound is highly sensitive to environment conditions. The generic application of sound localization techniques, sound speed is selected as a constant (V_{s} ≈∼ 343 meter/second). In addition, the speed of sound can be calculated as a function of environment temperature. The calculation of sound speed is shown in (19). The estimation of azimuth angle via ITD is also given in (20). In this equation, delay between samples, time of sample and distance between receivers are represented by s_{d}, T_{s}, and d(meter), respectively [18]. The parameter of C_{t} represents ambient temperature in Celsius:
The ITD based sound localization is illustrated in Fig. 9. The intersection points of azimuth angles provide the exact sound source location. Generally, the estimation of correct delay between sound channels is very hard due to environmental noise and echo effects. Since there will be a deviation between estimated and actual locations. It depends on the estimation success of the algorithm. In this approach, the users have to know only the delay between sound channel and locations of microphones.
Another approach for sound source localization is the time difference of arrival. The generic idea of Time Difference of Arrival (TDOA) is to determine the relative arrive time differences between receivers. This approach can be easily applied to the time difference between receiving channels and exact location of receivers [19], [20]. This approach is commonly used in the area of military, sound localization, GSM, and wireless sensor networks etc. The TDOA based 3D sound source localization can be defined as optimization problem [21].
The optimal estimated location is the value minimizing the expression in (21)
where (x_{i}, y_{i}, z_{i}) – location of microphones, d_{i} – distance between sound source and microphones, (x_{s}, y_{s}, z_{s}) – location of sound the source, M – number of microphones (M ≥ 3).
This is a quadratic and unconstrained optimization problem to solve [22]. The open form of (21) is defined as in (22)–(25) for four microphones:
where (x_{se}, y_{se}, z_{se}) – estimated location of sound source.
In this work, four microphones are used for sound source localization and six different combinations are obtained for this relationship. This combination based distances are shown in (26)
These distances can be easily transformed to sample delay difference s_{ij} between microphones as (27) and V_{s} is easily calculated depending on ambient temperature via (19). The sampling frequency (f_{s}) is 44100 Hz for this study
The TDOA approach allows determining the exact location of the sound source using the delays between the audio channels. The realization of the positioning problem with the TDOA approach is equivalent to an optimization problem. The solution of TDOA problem is performed by modified Levenberg-Marquardt algorithm [22]–[24].
In this section, we mention about our approach for 3D moving sound source localization with some of the data mining and signal processing methods. The process of 3D sound source localization consists of several steps. In the first step, signals which increase SNR ratio for better signal representation and correct time delay estimation are smoothed by Savitzky-Golay filter [25]. The sample result of Savitzky-Golay filter is shown in Fig. 10. As shown in the figure, Savitzky-Golay filter is suitable for signal smoothing and the user can obtain original signal without greatly distorting the signal.
The second stage of the process is to improve difference between the sound source signal and environment noises. We use a threshold for the filtered signal in order to clarify info of locations. Local maxima points are determined by adaptive thresholding method [26]. The results of filtered signal and threshold applied signal are shown in Fig. 11.
After this stage, it is required to determine a reference point for calculating correct time delay between sound channels. The k-medoids based clustering is applied to each sound signal outputs with thresholding applied [27]. Centroid points are also determined for all sound channels and calculated mean points for centroid. K-medoids algorithm is utilized for two purposes in this work. The first one is to determine the reference point for all sound channels and the other one is to obtain the exact number of sound source location.
The determination of exact location number is shown in Fig. 12. This graphic is obtained by a number of locations (cluster) versus total distances between medoids and observation points for summation of all sound channels. As shown in the figure, there is an obvious elbow point, which gives an idea about an exact number of location adaptively. As expected, the number of sound location is N=10 in this study.
Furthermore, the centers of clusters which are our mask to obtain a reference for all the locations for sound signals are obtained by clustering. These reference points are calculated as shown in (28). The centroid points of sound channels are defined as c_{1i}, c_{2i}, c_{3i}, and c_{4i}. The number of sound channel is followed as c_{n} = 4. The size of sample windows can be easily selected by these reference points. As known the number of location is calculated as N = 10:
where i = 1, 2,3,..., N – 1. The size selection of sample windows (W) is given in (29). The differences between reference points are calculated and a small safe merge is added to avoid overlap on sample window. The size of sample window and the reference line are represented in Fig. 13.
The estimated reference lines and sample windows are also shown in Fig. 14.
In the proposed algorithm, 3D moving sound source sequence localization can be described as 7 steps as follows:
Step 1. Implementation of Savitzky-Golay pre-filter for input sound signal smoothing.
Step 2. Determination of local maxima points about sound signals via adaptive threshold algorithm.
Step 3. Data clustering algorithm is performed on calculated local maxima to determine optimal masking parameters.
Step 4. Determine an exact number of location for localization process.
Step 5. Obtain time delay between sound channels via excitation source information based time-delay estimation.
Step 6. Implementation of TDOA algorithm via modified Levenberg-Marquardt and determination of sound source location points in 3D space as coordinates (x_{s}, y_{s}, z_{s}).
In this paper, only ITD parameter is used for determining sound source localization. The parameter of ITD is much more robust and reliable when compared to a parameter of ILD. The difference between the outputs of our algorithm and the results of real coordinates is given in Table V.
The mean errors in 2D and 3D distance estimation are 59.814 cm and 87.902 cm, respectively. These results are acceptable since the sound source is moved during the whole process and the reference points for the time delay are estimated adaptively.
The 2D visualization of sound source localization results is shown in Fig. 15. The 3D illustration of localization results is shown in Fig. 16.
In this paper, we presented an algorithm to determine the moving sound sequence source localization for static systems. The generic idea of this study is to define suitable observation screen via basic data mining methods and implemented on sound sources. The excitation source information based on time-delay estimation algorithm is used for determining the time lag between sound channels. We describe some of the basic terms of sound signals and sound source localization methods. Besides, TDOA based on sound source localization is utilized for sound sequence localization. The results of axis Z are not good enough since all microphones are on the same plane. Also, 2D results are very acceptable for moving sound sources. The solution of TDOA problem is performed by modifying Levenberg-Marquardt algorithm in this work. After some processes, the clustering algorithm is used to obtain an exact number of locations of sound source adaptively. In addition, we share a new example for sound source localization, separation, and determination for different types of scenarios and explain all of the specifications about this database. It is observed that this paper is very helpful for researchers worked on sound processing and localization especially.