Theoretical Research Projects
-
Project Title:
Research and Development of Artificial Intelligence in Extraction and Identification of Spoken Language Biomarkers for Screening and Monitoring of Neurocognitive Disorders
(ONGOING)
Project Code: RGC/TBRS T45-407/19N-1
Schedule: 2020/01/01 - 2024/12/31
Description:
Population ageing is a global concern. According to WHO, our world's population aged 60+ will nearly double to 22% by 2050, while Hong Kong's population aged 65+ will rise to 35%. Ageing is accompanied by various high-burden geriatric syndromes, which escalate public healthcare expenditures. This situation, coupled with a shrinking workforce and narrowing tax base, jeopardizes our society's sustainability. Neurocognitive disorders (NCD) – including age-related cognitive decline, mild cognitive impairment, and various types of dementia – are particularly prominent in older adults. Dementia has an insidious onset followed by gradual, irreversible deterioration in memory, communication, judgment, and other domains; care costs are estimated at USD 1 trillion today and are expected to double by 2030. This presents a dire need for better disease screening and management. NCD diagnoses and monitoring are largely conducted by clinical professionals face-to-face using neuropsychological tests. Such testing is limited due to clinician shortages; capturing snapshots of cognition that ignore intra-individual variability; subjective recall of cognitive functioning; inter-rater variability in assessment; and language/cultural biases. To address these issues, we will develop an automated, objective, highly accessible evaluation platform based on inexpensively acquirable biomarkers for NCD screening and monitoring. Platform accessibility enables active, remote monitoring, and the generation of patient alerts for prompt treatment between clinical visits. Collecting individualized "big data" over time enables flagging of subtle changes in cognition for early detection of cognitive decline. These actions will prevent under-diagnosis, enhance disease management, delay institutionalization, and lower care costs. NCD often manifests in communicative impairments. Hence, we target spoken language biomarkers – non-intrusive alternatives to blood tests and brain scans for NCD screening and monitoring. Spoken language can be easily captured remotely. Speech event records (e.g. latencies, dysfluencies) at millisecond resolutions enable sensitive cognitive assessments. We will develop Artificial Intelligence (AI)-driven technologies to automatically extract spoken language biomarkers. Our work is novel in its comprehensive dimensional coverage of conversational spoken language dialogs (from hesitations to dialog coherence), using fit-for-purpose deep learning techniques for feature extraction and selection. Our systems will be highly adaptable across environments to ensure consistent, objective NCD assessments. Our research will offer unprecedented data and technological support for early NCD diagnoses and timely clinical care. This aligns with WHO's plan of making dementia a public health and social care priority at national and international levels. We aim to control the overwhelming burden of NCD through AI-enabled healthcare that better supports patients and caregivers in Hong Kong.
Research Assistants: Ranzo Huang
-
Project Title:
End-to-end Automatic Sign Language Recognition and Translation of the Hong Kong Sign Language
(COMPLETED)
Project Code: GRF 16200118
Schedule: 2019/01/01 - 2022/06/30
Description:
To help bridge the communication gap between the deaf minority and the hearing majority, the PI would like to develop a vision-based large-vocabulary automatic sign language recognition and translation (ASLRT) system in this project. Although there have been research efforts made on sign language recognition, they have usually only produced the sign glosses as output and do not translate them into the natural language of the signer's origin. The PI believes that the translation from a sign language to a natural language is important to facilitate the communication between the deaf and the hearing community.
Research Assistants: Niu Zhe, Ronglai Zuo
-
Project Title:
Building friendly HCI for aged people in Asian countries
(COMPLETED)
Project Code: ASPIRE2019#4
Schedule: 2019/01/01 - 2021/08/31
Description:
The project team consists of experts with leading achievements in speech recognition, speaker recognition, natural language processing, and spoken dialog system. Although substantial advances have been made in related areas for the general population, most of the works were investigated for the general adults and it is not clear if they will also work effectively for the aged people. E.g., an automatic speech recognizer that works well for the general population may not work for the elderly as the latter tend to speak more slowly with a lower pitch and perhaps sometimes with hesitation. Thus, a lot of work remains to be done for developing dedicated HCI techniques for aged people. Building friendly HCI for the elderly has long-term impact and needs long-term research and collaboration among the Asian universities.
Research Assistants: Yingke Zhu, Xinyuan Yu
-
Project Title:
Training Big and Deep Neural Networks for Automatic Speech Recognition
(COMPLETED)
Project Code: GRF 16215816
Schedule: 2017/01/01 - 2020/06/30
Description:
In this project, the PI will investigate detailed acoustic modeling through robust training of DNN acoustic models with a large number of tied states (some of which may be distinct states) --- in the order of tens of thousands --- for better ASR performance. The major research issues include (a) how to train such big and deep neural networks (BDNNs) robustly? (b) How to initialize these models for DNN training? (c) How to regularize the model training as there will be many model parameters in such BDNNs? (d) How to prevent the model size and training/decoding time of such BDNNs from increasing linearly with the number of output units?
Research Assistants: Yingke Zhu and Hengguang Huang
-
Project Title:
Speaker-Adaptive Denoising Deep Networks for SNR-Aware PLDA Speaker Verification
(Co-investigator; COMPLETED)
Project Code: GRF 15206815
Schedule: 2015/09/01 - 2018/08/31
Description:
The recent increase in identity theft has drawn the public attention on the potential risk of using remote services. In fact, frauds and identity theft are on the rise in Hong Kong and mainland China because of the popularity of using mobile phones to access remote services. To save operation cost, insurance companies and banks in this region also encourage their customers to use remote services. While remote services bring great convenience and benefit to customers, the services have also attracted criminals to carry out fraudulent activities. Traditionally, authentication of these services relies on usernames and passwords. This authentication method, however, is no longer adequate for safeguarding the security of remote services. The recent advance in speaker recognition technology has demonstrated that voice biometrics could be an important part of the authentication process. For example, using speaker recognition technologies, the identity of a caller can be verified in a live call and financial transactions can be stopped if fraud is detected. Voice biometrics can also help call centers to reduce the risk of leaking customers' information caused by social engineering fraudulence. This project aims to develop a special form of deep neural networks (DNNs) called denoising deep autoencoder for robust text-independent speaker verification. Speaker specific features, referred to as bottleneck features, are extracted from the bottleneck layer of the autoencoder. Unlike conventional autoencoders, our denoising deep autoencoders are able to reconstruct the clean speech signals even if the input signals are contaminated with noise and reverberation effects, which result in robust bottleneck features. Another advantage of our autoencoders is that they are adapted to the characteristics of individual speakers, which enrich the speaker-specific information in the bottleneck features. To make the conventional probabilistic linear discriminant analysis (PLDA) more resilience to varying acoustic environments, two new PLDA models are proposed. These models either make explicit use of SNR information during training and scoring or incorporate the SNR variability into the generative model. The proposed bottleneck features can be readily applied to these new PLDA models. The proposed methods will be evaluated on the latest dataset provided by the NIST. The proposed work will provide insight into deep learning and the extraction of speaker-dependent features from deep neural networks. It will also address the SNR and reverberation variability in speaker verification systems. The proposed framework is also valuable to other problem domains where channel and session variabilities are detrimental to performance.
Research Assistants: Yingke Zhu, and Ivan Fung
-
Project Title:
Deep Multitask Learning for Automatic Speech Recognition
(COMPLETED)
Project Code: GRF 16206714
Schedule: 2014/08/01 - 2018/01/31
Description:
The PI would like to extend the concept of deep learning to deep multitask learning using deep neural networks (DNNs) in the context of ASR. Multitask learning (MTL) is a machine learning approach that aims at improving the generalization performance of a learning task by jointly learning multiple related tasks together. It has been shown that if the multiple tasks are related and if they share some internal representation, then through learning them together they are able to transfer knowledge to one another. As a result, the common internal representation thus learned generalizes better for future unseen data. Current acoustic modeling tries to learn phoneme likelihoods given the acoustic feature inputs in isolation. But humans do not learn the sounds of a spoken language in isolation, but together with other cues such as their graphemes, the lexical context, and so forth. By identifying the related tasks and learning them together with the primary task of learning phone-state posteriors, the PI believes that the latter can be learned better. MTL can be readily implemented by an ANN, and the new deep learning approach should give even better results.
The project hopes to shed new light on what other tasks may help pronunciation learning in a language. The results will further be utilized to improve current
speech recognition technology.
Research Assistants: Dongpeng Chen, Wei Li, Yingke Zhu, and Hengguang Huang
-
Project Title:
Improving the Robustness and Efficiency of Exemplar-based Automatic Speech Recognition by Robust Hashing and Locality-sensitive Hashing
(COMPLETED)
Project Code: HKUST/CERG 616513
Schedule: 2013/10/01 - 2017/03/31
Description:
Hidden Markov modeling (HMM) has been the dominant technology for automatic speech recognition (ASR) for the last two decades. In spite of its success, it is well-known that HMM has its limitations due to its independence assumptions on observations and state transitions. As a result, the decoded results from an HMM are not guaranteed to be realistic trajectories of the sequence data, and HMM does not model long-span dependency among the observations in a time series. Furthermore, in 2005, Prof. Roger K. Moore of the University of Sheffield performed an analysis on the data requirements of current state-of-the-art ASR systems and human listeners over their lifetime, and concluded that if the performance of current ASR systems is to reach that of humans, they will require two to three orders of magnitude more data than a human being (which means roughly a million hours of speech training data)! The implication is that as we still cannot acquire and process such a large amount of data, one needs to make better use of the training data and to start thinking beyond HMM.
Research Assistants: Dongpeng Chen and Tom Ko
-
Project Title:
An Investigation of Using Linear Programming Method for Discriminative
Training Problems in Automatic Speech Recognition
(COMPLETED)
Project Code: HKUST/CERG 617008
Schedule: 2008/09/01 - 2011/02/28
Description:
Linear weighting functions are commonly found in the decoding procedure of automatic speech recognition (ASR). For instance, in a multi-stream hidden Markov model (HMM) (which is used in discrete HMM system, multi-band ASR, or audio-visual ASR), the state log-likelihood is usually computed as a linear combination of the per-stream state log-likelihoods; in recognition, the recognition score of a test utterance is a linearly weighted sum of the acoustic score and language score. It is known that these weights cannot be determined by maximum-likelihood estimation. Instead, discriminative training is commonly employed by minimizing the classification errors (MCE), by maximizing the mutual information (MMI) , by maximizing the entropy (MAXENT), etc.
In this project, we will estimate these linear functions by casting those
problems into linear programming (LP) problems.
Research Assistants: Guo Li YE and Tom Ko
-
Project Title: A Study on Using Kernel Methods to Improve Speaker and Noise Adaptation for Robust Speech Recognition
(COMPLETED)
Project Code: HKUST/CERG 617507
Schedule: 2008/01/01 - 2010/06/30
Description:
Robustness against mis-matched speakers, channels, and environment noise is important for successful deployment of speech recognition systems. One solution is to adapt acoustic models using samples from the testing environment. In the past few years, we pioneered the use of kernel methods to improve several eigenspace-based speaker adaptation algorithms, namely, eigenvoice and eigenspace-based MLLR. In the machine learning community, it has already been shown that the use of kernel methods can provide elegant nonlinear generalizations of many existing linear algorithms. In fast speaker adaptation when only a few second of adaptation speech is available, we have shown that our new methods outperform their linear counterparts, and the conventional MAP and MLLR.
In this project, we seek to build on our past experience and strength, and attempt to improve other adaptation methods that are not based on an eigenspace by kernel method. The reason is that we would like to consider, besides speaker mis-match, channel and noise mis-match, and an eigenspace, which is built from training data, may not be suitable for the new test data when there is a mismatch in channel or noise. We will start with kernelizing MLLR and the reference speaker weighting method.
Research Assistants: Kimo LAI
-
Project Title: High Density Discrete Hidden Markov Modeling
(COMPLETED)
Project Code: HKUST/CERG 617406
Schedule: 2006/09/01 - 2009/02/28
Description:
We re-visit the use of discrete hidden Markov model (DHMM) in automatic speech recognition (ASR). The motivation is that with the recent advances in semiconductor memory and its falling price, and the availability of very large speech corpora (of hundreds to thousands of hours of speech), we now may be able to estimate discrete density with much larger codebook size, without having to worry about its storage space. We would call our new model "high density discrete hidden Markov model" (HDDHMM). Our new HDDHMM is different from traditional discrete HMM in two aspects: (1) the codebook size will be greatly increased to thousands, tens of thousands, or even more; and (2) for a d-dimensional acoustic vector, the discrete codeword can be determined in O(d) time. As in traditional DHMM, HMM state likelihoods can still be determined quickly by O(1) table lookup.
Research Assistants: Benny NG and Guo Li YE
-
Project Title: Joint Optimization of the Frequency-domain and Time-domain Transformations in Deriving Generalized Static and Dynamic MFCCs
Project Code: DAG05/06.EG43
Schedule: 2006/01/01 - 2008/12/31
Description:
Traditionally, static mel-frequency cepstral coefficients (MFCCs) are derived by discrete cosine transformation (DCT), and dynamic MFCCs are derived by linear regression. Their derivation may be generalized as a frequency-domain transformation of the log filter-bank energies (FBEs) followed by a time-domain transformation. In this project, we consider sequences of log FBEs as a set of spectrogram images, and investigate an image compression technique to jointly optimize the two transformations.
Research Assistants: Yiu-Pong LAI
-
Project Title: Kernel Eigenvoice for Fast Speaker Adaptation in Automatic Speech Recognition
(COMPLETED)
Project Code: DAG04/05.EG09
Schedule: 2005/01/01 - 2008/10/31
Description:
To improve eigenspace-based speaker adaptation methods by the use of kernel methods.
Research Assistants: Kimo Lai, Roger HSIAO
-
Project Title: Towards Multi-Modal Human-Computer Dialog Interactions with Minimally Intrusive Biometric Security Functions
(COMPLETED)
Project Code: CA02/03.EG04
Schedule:2003/6/28 - 2006/6/27
Description:
To improve noise/channel robustness of automatic speech recognition in a multi-modal human-computer dialogue with biometric support.
Some Collaborators:
Dr. Manhung Siu, EEE Department of
HKUST;
Prof. P. C. Ching, EEE
Department;
Dr. Tan Lee, EEE Department;
Dr. Helen Meng,
System Engineering Department of CUHK.
Research Assistants: Kimo Lai, Roger HSIAO, Simon HO, Siu-Man CHAN
-
Project Title: Asynchronous Multi-Band Continuous Speech Recognition using HMM Composition
(COMPLETED)
Project Code: DAG01/02.EG33
Schedule: 2002/2/15 - 2006/06/30
Description:
It is an extension from the Project titled "Subband-Based Robust Speech Recognition" with an emphasis on continuous speech recognition.
Research Assistants: Yik Cheung TAM, Ivan CHAN, Franco HO
-
Project Title: Discriminative Training of Non-parametric Auditory
Filters for Automatic Speech Recognition
(COMPLETED)
Project Code: HKUST6201/02E
Schedule: 2002/9/1 - 2005/8/31
Description:
To design the auditory-based filters in filter-bank-based spectral analysis in a data-driven approach with the objective to minimize classification errors. The major difference between our proposal and a few past efforts is that we will make fewer assumptions of the functional form of the filters and we do not assume independence between feature extraction parameters and model parameters.
Research Assistants: Yik Cheung TAM, Roger HSIAO, Ying-Hung AU
-
Project Title: Subband-Based Robust Speech Recognition
(COMPLETED)
Project Code: DAG00/01.EG09
Schedule: 2001/01/01 - 2003/12/31
Description:
Recently, multi-band ASR was proposed as another viable solution for robustness against band-limited noise. In this approach, the full frequency band is divided into subbands and a separate speech recognizer is built for each band. During recognition, some decision logic is used to recombine decisions from each individual subband recognizer. Two main issues are: (1) should the subbands combined synchronously or asynchronously? (2) how to optimally combine the subband decisions? In this project, we suggest that decisions from subband continuous speech recognizers can be effectively, efficiently, and optimally recombined under the hidden Markov modeling (HMM) framework using asynchronous HMM composition.
Research Assistants: Yik Cheung TAM, Ivan CHAN, Franco HO
-
Project Title: Developing a Cantonese Broadcast News Speech Corpus
(COMPLETED)
Project Code: DAG97/98.EG39
Schedule: 1998/6/30 - 2001/7/29
Description:
There has been a lack of Cantonese speech corpora for developing Cantonese automatic speech recognition system. The development of a speech corpus is used to be very tedious and time consuming --- from the design to the actual acquisition and later verification. On the other hand, radio stations nowadays have a huge archive of high quality news reporting tapes with electronic transcriptions. In this project, Commercial Radio (Hong Kong) has kindly agreed to share their broadcast news with us to develop a
Cantonese Broadcast News Corpus. The corpus will be ready in mid 2001.
Research Assistants: WONG Kwok Man