Speaker’s age estimation based on voice recording is an important research issue, mostly because of the huge popularity of mobile and desktop devices. A voice sample is easy to collect and transfer using mobile or web services. The proposed procedure has a broad spectrum of potential applications, especially in cases where only a speech signal is available and there is a need for being able to determine the speaker’s age. Such a situation may occur where one is suspected of concealing the real age. What is more, the voice acquisition is very simple and non-intrusive.
The access restrictions of some web pages may be granted on the basis of a speech sample; access to some content may be automatically limited when the user is under an age threshold. Furthermore, some automatic computer system adjustment might be performed, in particular in educational games and programs .
Automatic speech recognition systems (ASR) could adjust their actions to the user’s age and be therefore customised without external (in particular parental) supervision.
A human-robot interaction could be improved, if the machine had been programmed to adapt its behaviour to the user’s age .
Age estimation of young people is a very important issue which should be addressed with special attention. It is very difficult to accurately evaluate their age due to the fact that during adolescence, the children’s bodies are changing rapidly and so are their voices. Furthermore, during adolescence, due to the large differentiation in the tempo of growth and maturation, the structure of the vocal tract differs between early, average and late matured boys and girls who are in the same chronological age .
Previous works focused mainly on the detection of the age group (child, adolescent, adult, senior) and gender, rather than on determining the chronological age of the subject. In , a method was proposed to estimate the speaker’s age more precisely with an accuracy of 1 year. Data for evaluation were limited to the age interval of 7–12 years.
Many approaches to speaker’s age estimation were proposed in the literature. Most of them are based on mel frequency cepstral coefficients (MFCC) as the voice signal features , , , , , , . In study , formants and harmonics were used. For age classification, the most popular approaches employ support vector machine and its modifications , , , , , , , . The Gaussian mixture models , , , , , , ,  and hidden Markov models were also often used for this purpose.
In the previously mentioned studies, the age of the person was assessed as belonging to a specific age group, e.g. a child, an adult or a senior. The aim of this work was to create a method for evaluating the exact chronological age of adolescents on the basis of their voice signal optionally supplemented with information about a person’s height.
Materials and methods
Voice samples were collected from pupils of elementary, secondary and high school. Prior to the experiments, parents’ written authorisation for performing the measurements was obtained. During all measurement sessions, 187 children and adolescents were recorded including 98 boys and 89 girls. The mean age for the whole group was 13.25±2.68 (median of age was 12.90). For girls only, the mean age was 13.42±2.76 (median was 13.40) and for boys only it was 13.21±2.82 (median was 12.70). In Table 1, the age distribution of the pupils is presented.
Chronological age of the pupils.
|Chronological age in completed years||Number of girls||Number of boys||Total number|
The samples were recorded in 16-bit resolution with 44,100 Hz sampling rate and saved in the WAVE audio format. They were gathered using a system which consisted of a microphone, Sontronics STC-80 (SONOTRONICS, Dorset, UK), a microphone pre-amplifier, IMG Stageline MPA–202 (MONACOR INTERNATIONAL GmbH & Co. KG, Bremen, Germany), and a computer.
The voice samples were gathered according to the following protocol: the first task was to introduce himself/herself aloud. Next, the examined children articulated the vowels a, e, i, o and u in extended phonation. The vowels were viewed on the computer screen for 3 s. This stage was repeated 3 times. The vowels appeared in a random order. There was a 2-s break between the vowels for taking a breath. The whole procedure lasted about 2 min.
Because body height is a biological feature changing unidirectionally during development, and also easily measurable, voice recordings were supplemented with height measurements in order to check if it could improve the accuracy of the proposed solution.
Voice features extraction
A set of parameters commonly used in speech analysis [such as period, fundamental frequency (F0), formants, mean autocorrelation, jitter, shimmer, noise-to-harmonic ratio (NHR), harmonic-to-noise ratio (HNR), linear prediction coefficients (LPC) and MFCC (1)], were completed with features used in non-speech analysis [such as tristimulus and total harmonic distortion (THD)].
Period is the length of a repeatable fragment of speech signal. It enables to distinguish data between female (shorter period) and male (longer period) voice. During puberty, the voice period elongates in both genders; however, it is more noticeable for boys (voice mutation). A more-often used feature is the fundamental frequency (F0), which is the inverse of the period; therefore, they should be interpreted analogically. In speech analysis, one of the most widespread features is a set of formants. Formants are characteristic frequencies that refer to resonances in the vocal tract during speech signal production. Each phone is characterised by a set of — three to five formants; however, it differs in specified ranges depending on the length of the vocal tract. During puberty, when a whole body grows, the size of the vocal tract lengthens, the larynx lowers and vocal folds get thicker, which could be noticed in formant values. The mean autocorrelation value depends on the periodicity of the signal. It is computed as the mean of autocorrelation coefficients computed for each frame of the signal. For a stable, adult voice (quasi-periodic signal), the mean autocorrelation is near to 1, while for voice data collected during puberty (unstable voice), it reaches lower values. Other parameters which refer to voice stability are jitter and shimmer. They describe short-term changes of some characteristics in the determined signal. The first one corresponds to the absolute difference between the length of the following periods of signal divided by the average period. The second one describes differences in amplitude values. Stability of the voice could also be also with the NHR, which is the proportion of energy in the noise and periodic parts of the signal. More complex features are LPC and MFCC. Both of them are based on a spectral envelope, which changes for different phones and for different shapes of vocal tract; therefore, the values of LPC and MFCC are significant in the assessment of voice maturity. Basic speech features were computed using the PRAAT software  with the following settings:
- –Window length – 30 ms,
- –Time step – 10 ms,
- –Pre-emphasis from – 50 Hz,
- –Pitch range – 75–600 Hz
Besides those commonly used features described earlier, two other parameters were computed. The first one is tristimulus – a set of three values, which enable objective assessment of the timbre of the sound. It is expected that during voice mutation, changes in timbre are noticeable. The tristimulus assesses the amplitude of the fundamental (a1), of the sum of the second (a2), third (a3) and fourth (a4) partials, and in the sum of partials higher than the fourth in relation to the aggregate amplitude of the established part of the signal :
where ak are the amplitudes of particular harmonics, and the number of designated harmonics (N) was established as 10 for exposing the low-frequency harmonics, which change during puberty.
The second non-commonly used parameter in speech analysis is THD. Because it describes the level of harmonics of a waveform, it is a different measure of the voice timbre. THD can be defined as the harmonic content of a waveform divided by its fundamental :
Both tristimulus and THD were computed using the Matlab environment.
In order to select features that will the best predictors, the following procedure was applied. For each feature, the coefficient of variation cv was calculated using the following equation:
where: σ is the standard deviation of the feature, and μ is the mean value of the feature.
Next, all features for which the value of the cv was lower than 0.1 were eliminated from further analysis. From the remaining set, the best parameters were chosen using the least absolute shrinkage and selection operator (LASSO) and a 10-fold cross-validation scheme .
The predictor selection was performed for all features and for each vowel separately.
For age estimation, the random forest (RF) algorithm for regression was employed . In this approach, each tree gives a numerical output instead of a class label. The result of prediction of the forest is the mean of the answers from all trees. The RF was chosen due to the fact that it is a very popular and effective tool for prediction problems. Moreover, it prevents overfitting, which may occur in the case of a large feature vector.
The age estimation was performed on whole data sets and subsets. Separate subsets for each gender were formed. In addition, subsets containing parameters for each vowel including or excluding height were created. The database used to train and test the proposed solution consisted of 187 measurements and 236 coefficients (47 parameters for every vowel and height).
A 10-fold cross-validation was carried out 100 times, and the obtained results were averaged. The predictor selection procedure was applied in every training set in cross-validation steps. All experiments were conducted in the Matlab environment. The number of trees in the RF was set to 100 and the other values were default parameters.
Table 2 presents the list of the best predictor selected using the method described in the “Introduction” section. Due to the fact that 189 parameters were chosen for the whole data set, only the features that were selected for every vowel were listed. The lowest number of coefficients was obtained for the vowel “a” (32), and the highest for vowels “o” and “u” (37).
List of selected features.
|Vowel||Selected features||No. of features|
|All||Average frequency of F1, Jitter, Shimmer, HNR,|
Standard deviation of signal period, all tristimuluses MFCC coefficients no 1, 3, 4, 6, 8, 10, 11 and 12 LPC coefficients no. 4, 7, 9, 11, 14–16, THD
|a||Average frequency of F0 and formants F1, F2, F3, mean and stdev of signal period, Jitter, Shimmer, NHR, HNR, MFCC coefficients no. 1–4 and 6–12, LPC coefficients no. 2, 3, 5, 6, 8, 9, 13, 15–17, THD||32|
|e||Average frequency of F0 and formants F1, F2, F3, stdev of signal period, Jitter, Shimmer, NHR, HNR, all MFCC coefficients, LPC coefficients no. 1, 2, 6, 9, 10–16, THD, tristimulus1||34|
|i||Average frequency of F0 and formants F1, F2, stdev of signal period, Jitter, Shimmer, NHR, HNR, all MFCC coefficients, LPC coefficients no. 2, 5, 6, 10, 11, 13–16 THD, tristimulus1 and 3||34|
|o||Average frequency of F0 and formants F1, F2, F3, mean and stdev of signal period, Jitter, Shimmer, NHR, HNR, all MFCC coefficients, LPC coefficients no. 2, 3, 5–9, 11–16, THD, tristimulus1||37|
|u||Average frequency of F0 and formant F1, mean and stdev of signal period, Jitter, Shimmer, NHR, HNR, all MFCC coefficients, LPC coefficients no. 1, 4–16, THD, tristimulus3||37|
Age prediction results
Age estimation error using voice and height features (girls and boys).
|Features||Mean absolute error in years (in days)||Standard deviation of absolute error in years (in days)|
|Features for vowel “o”+height||0.35 (128)||0.21 (75)|
|Voice+height||0.40 (134)||0.26 (95)|
|Features for vowel “e”+height||0.51 (187)||0.29 (105)|
|Features for vowel “i”+height||0.51 (184)||0.31 (116)|
|Features for vowel “u”+height||0.52 (193)||0.29 (105)|
|Features for vowel “a”+height||0.66 (240)||0.32 (116)|
|Features for vowel “u”||0.79 (289)||0.55 (201)|
|Voice||0.81 (296)||0.35 (128)|
|Features for vowel “a”||0.95 (347)||0.51 (187)|
|Features for vowel “i”||1.01 (369)||0.57 (209)|
|Features for vowel “e”||1.18 (431)||0.49 (179)|
|Features for vowel “o”||1.31 (479)||0.58 (212)|
Age estimation error using voice and height features (boys).
|Features||Mean absolute error in years (in days)||Standard deviation of absolute error in years (in days)|
|Voice+height||0.37 (136)||0.28 (102)|
|Features for vowel “a”+height||0.42 (156)||0.27 (100)|
|Features for vowel “i”+height||0.43 (157)||0.29 (106)|
|Features for vowel “e”+height||0.46 (171)||0.24 (90)|
|Features for vowel “u”+height||0.56 (204)||0.29 (104)|
|Features for vowel “o”+height||0,58 (212)||0.38 (137)|
|Features for vowel “i”||0.75 (276)||0.36 (129)|
|Voice||0.79 (289)||0.48 (177)|
|Features for vowel “e”||0.86 (314)||0.46 (170)|
|Features for vowel “a”||0.90 (328)||0.44 (162)|
|Features for vowel “u”||0.92 (336)||0.52 (190)|
|Features for vowel “o”||1.04 (377)||0.53 (193)|
Age estimation error using voice and height features (girls).
|Features||Mean absolute error in years (in days)||Standard deviation of absolute error in years (in days)|
|Features for vowel “i”+height||0.40 (144)||0.18 (65)|
|Features for vowel “o”+height||0.59 (214)||0.25 (90)|
|Voice+height||0.70 (255)||0.32 (115)|
|Features for vowel “u”+height||0.73 (265)||0.41 (150)|
|Features for vowel “a”+height||0.78 (285)||0.25 (91)|
|Features for vowel “e”+height||0.87 (319)||0.48 (174)|
|Voice||1.19 (435)||0.31 (115)|
|Features for vowel “e”||1.41 (515)||0.78 (285)|
|Features for vowel “o”||1.20 (437)||0.42 (154)|
|Features for vowel “u”||1.41 (515)||0.73 (267)|
|Features for vowel “a”||1.54 (563)||0.77 (282)|
|Features for vowel “i”||1.55 (564)||0.74 (273)|
The best results for all measurements were obtained while employing all the selected voice features and heights. Excluding height from the feature set worsened the results; also, when only features of one vowel were included into computation, the absolute error increased. Using only one vowel for both boys and girls resulted in accuracy between approximately 10 and 16 months. When all vowels were used, the mean absolute error was about 10 and 5 months, respectively, for all features and for all features with height.
In boys, when only one vowel combined with height was employed for the computations, the error varied from 5 months and 16 days for the vowel “a” and up to 7 months for the vowel “o”. Lower values were obtained when height was excluded from the feature set and they were in the range from 9 months for the vowel “i” to 1 year and 12 days for the vowel “o”. Again, the highest accuracy was obtained when all selected voice features and height were used (3 months and 16 days).
When the RF algorithm input was the girls’ data only, the best results were worse than those for boys and for the joined group (8.5 months for all features). Furthermore, they had greater standard deviations than boys. When only one vowel without height was used, the mean absolute error varied from 1 year 2 months 12 days (for vowel “o”) up to 1 year 6 months 19 days.
The standard deviation of the absolute error did not exceed 10 months. In boys, the range of the standard deviation of the absolute error was from 3 to 6 months and it was the smallest among all groups. The lowest values were noticed for girls only when the vowel “i” and height were used and it was equal to 2 months and 5 days. However, the highest values (9.5 months) were also noticed for girls in the mixed group where values ranged from 3 months 5 days to over 6 months. When the girls’ data were included in the calculations, the lowest standard deviation of the absolute error was obtained when the vowel “a” with height was used. Employing the boys’ data into the computations led to obtaining the highest standard deviation values for the vowel “o” with height excluded from the calculations. If considering girls only, it can be noticed that the vowel “a” gave the highest standard deviation.
For children under 16, when only voice features were used, the median of the absolute error did not exceed 1 year. The lowest median error (about 6 months) was achieved for children at age 12 and the highest median value (over 1.5 year) was obtained for adolescents at age 18. In most cases, including height into computation decreased the median error by half.
The interquartile range (IQR) of the absolute error for all age groups was not higher than 11 months and it was obtained for children aged 9. The lowest value (less than 4 months) was achieved for children aged 18. Again, in almost all cases, including height in the feature set reduced the IQR value.
The lowest absolute error (0.37 year±0.28) was obtained for boys only when all selected features were included into prediction. In all cases, the achieved accuracy was higher for boys than for girls, which results from the fact that the change of voice with age is larger for men than for women. The highest median of absolute error was observed for 17 and 18 years’ adolescents, which may lead to the conclusion that after puberty, the age cannot be predicted accurately, due to low differences in voice features and body height between these two groups. The chronological age of younger children (between 9 and 12 years) were predicted more accurately than the older ones. Puberty in boys starts usually at age 13–14 and the changes in voice and body height are very rapid, which results in higher error rates.
Using only one vowel for estimation resulted in a higher error. It cannot clearly assess which of the vowels gave the best results, so any of them can be used in the situation where lower accuracy is sufficient.
Adding height to the computations improved the results significantly, which suggests that it is strongly correlated with age. The relationship between these two parameters is presented in Figure 3, separately for boys and girls. The correlation is higher for boys (0.90) than for girls (0.80). This is due to the fact that in case of boys, the body height changes significantly even in late adolescence (17–18 years), while girls usually stop growing about the age of 15.
The other anthropometric features carry no information content that goes beyond that available from voice.
In this paper, a novel method for estimating the age of adolescents using voice features and RF has been presented. The obtained results suggest that the proposed solution is accurate. Additionally, it employs very simple sounds (vowels) only, so it is a language-independent approach. Moreover, in the described methodological approach, universal developmental events (e.g. growth spurt, changes in the speech apparatus – mutation) have been included, which proceed in the same way and with a fixed sequence of phenomena, regardless of the ethnic differences between the subjects. Therefore, the presented method, based on common directional changes in the construction of the vocal tract occurring with age, can be used to assess the age of children and adolescents from various cultural backgrounds.
This method may also be useful in life sciences research. It can be used, for example, by paediatricians and endocrinologists focusing on developmental ages. The method will allow them to observe changes occurring in the voice assessing during adolescence, the assessment of the child’s development, their degree of sexual maturity and even early effectiveness of the treatment or the dose of the drug used.
Physical anthropologists will gain a new, useful and user-friendly method of biological age assessment, which will allow the determining of delay or acceleration of maturation in relation to the child’s chronological age, or it can be used on a larger scale to assess the tempo of growth and development of groups of children in population studies as a measure of the quality of life in the population, as well as menarche or body height, or the frequency of obesity.
We would like to thank Bruce Turner for the English language corrections.
Russell M, Series RW, Wallace JL, Brown C, Skilling A. The STAR system: an interactive pronunciation tutor for young children. Comput Speech Lang 2000;14:161–75.
Kim HJ, Bae K, Yoon HS. Age and gender classification for a home-robot service. In: RO-MAN 2007 – The 16th IEEE International Symposium on Robot and Human Interactive Communication; 2007:122–6.
Bugdol MD, Bugdol MN, Lipowicz AM, Mitas AW, Bienkowska MJ, Wijata AM. Prediction of menarcheal status of girls using voice features. Comput Biol Med 2018;100:296–304.
Mirhassani SM, Zourmand A, Ting HN. Age estimation based on children’s voice: a Fuzzy-based decision fusion strategy. Sci World J 2014;2014:9.
Muller C, Burkhardt F. Combining short-term cepstral and long-term pitch features for automatic recognition of speaker age. In: Interspeech 2007, 8th Annual Conference of the International Speech Communication Association, Antwerp, Belgium; 2007:2277–80.
Metze F, Ajmera J, Englert R, Bub U, Burkhardt F, Stegmann J, et al. Comparison of four approaches to age and gender recognition for telephone applications. In: ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing – Proceedings, Honolulu, HI, USA. vol. 4; 2007:IV1089–92. DOI: 10.1109/ICASSP.2007.367263.
Mahmoodi D, Marvi H, Taghizadeh M, Soleimani A, Razzazi F, Mahmoodi M. Age estimation based on speech features and support vector machine. In: CEEC’11, 3rd Computer Science and Electronic Engineering Conference, Colchester, UK; 2011:60–4. DOI: 10.1109/CEEC.2011.5995826.
Van Heerden C, Barnard E, Davel M, Van Der Walt C, Van Dyk E, Feld M, et al. Combining regression and classification methods for improving automatic speaker age recognition. In: ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing – Proceedings, Dallas, TX, USA; 2010:5174–7. DOI: 10.1109/ICASSP.2010.5495006.
Li M, Han KJ, Narayanan S. Automatic speaker age and gender recognition using acoustic and prosodic level information fusion. Comput Speech Lang 2013;27:151–67.
Barkana BD, Zhou J. A new pitch-range based feature set for a speaker’s age and gender classification. Appl Acoust 2015;98:52–61.
Iseli M, Shue YL, Alwan A. AGE- and gender-dependent analysis of voice source characteristics. In: ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing – Proceedings, Toulouse, France. vol. 1; 2006:I389–92. DOI: 10.1109/ICASSP.2006.1660039.
Bocklet T, Maier A, Bauer JG, Burkhardt F, Nöth E. Age and gender recognition for telephone applications based on GMM supervectors and support vector machines. In: ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing – Proceedings, Las Vegas, NV, USA; 2008:1605–8. DOI: 10.1109/ICASSP.2008.4517932.
Dobry G, Hecht RM, Avigal M, Zigel Y. Supervector dimension reduction for efficient speaker age estimation based on the acoustic speech signal. IEEE T Acoust Speech 2011;19:1975–85.
Meinedo H, Trancoso I. Age and gender classification using fusion of acoustic and prosodic features. In: Interspeech 2010, 11th Annual Conference of the International Speech Communication Association, Makuhari, Chiba, Japan; 2010:2818–21.
Minematsu N, Sekiguchi M, Hirose K. Automatic estimation of one’s age with his/her speech based upon acoustic modeling techniques of speakers. In: ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing – Proceedings, Orlando, FL, USA. vol. 1; 2002:I/137–40. DOI: 10.1109/ICASSP.2002.5743673.
Datta AK, Singh SS, Ranjan S, Soubhik C, Kartik M, Anirban P. Signal Analysis of Hindustani Classical Music. Singapore: Springer; 2017.
Shmilovitz D. On the definition of total harmonic distortion and its effect on measurement interpretation. IEEE T Power Deliver 2005;20:526–8.
Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning. Springer Series in Statistics. New York, NY, USA: Springer New York Inc.; 2001.