Broadcast News Segmentation Using Automatic Speech Recognition System Combination With Rescoring And Noun Unification

Loading...
Thumbnail Image
Date
2015-07
Authors
Ali Khalaf, Zainab
Journal Title
Journal ISSN
Volume Title
Publisher
Universiti Sains Malaysia
Abstract
Broadcast news keeps viewers informed about the latest developments, events and issues occurring in the world. Nowadays, broadcast news can be easily accessed online. There is a rapid growth in the amount of news broadcasted from the traditional mass media such as radio, television, and cable television that are made available on the Internet. Besides that, with the availability of mobile phones with a good camera, it has allowed users to record interesting videos and shared them with everyone. Now more than before, there is a need for systems capable of accessing and searching the contents of the broadcast news effectively and quickly. To allow the searching for the spoken contents in broadcast news, the spoken contents have to be first converted to text. Automatic processing of broadcast news sources requires automatic speech recognition (ASR) system in order to decode speech into a written text transcription. Typical ASR transcription is an unstructured document that includes only words, without further formatting (i.e. punctuations, and capitalization). Moreover, ASR system produces substantial errors due to several factors that are degrading the ASR performance. Thus, these problems reduce the performance of a high-level processing such as searching, summarizing, and translation. Spoken document segmentation (SDS) is a system that decodes broadcast news to transcription and then segments the transcription to the logical unit before allows subsequent high-level processing to be carried out. Manual news transcription and topic segmentation are too expensive and take a long time. Hence, without an SDS system, access to audio archives and searches within them would أن يّ أّريد أّمانا يّا اّبن فّاطمة .ّ.. مّستمسكا بًّيدي مّن طّارق اّلزم نّ من فّاطم وّبنيها ثّم وّالدها... وّالمرتضى حّيدرٌ أّعني أّبا اّلحس نّ be restricted to the limited number of textual documents that have been manually transcribed and segmented by humans. Multiple hypotheses are useful because the single best recognition output still has numerous errors, even for state-of-the-art systems. Two ASR system combination approaches are proposed for automatic transcribing Malay broadcast news. These approaches combine the hypotheses produced by parallel automatic speech recognition (ASR) systems. Each ASR system uses different language models, one which is generic domain model and another is domain specific model. The main idea is to take advantage of different ASR knowledge to improve ASR decoding result. The proposed approaches are compared with a conventional combination approach, the recognizer output voting error reduction (ROVER). The proposed approaches reduce the decoding error from 34.5% to 30.6% and 30.1, and these approaches are better than the conventional ROVER approach. Moreover, identifying the topic boundaries in ASR transcription is a challenge because of the errors generating from ASR system as well as the absence of overt punctuation and formatting. Thus, the traditional topic segmentation approaches (e.g. TextTiling algorithm) cannot work properly with these documents that result from ASR system. To address the decoding errors in ASR transcripts that can cause significant difficulties in word matching and interlinked relationships in topic segmentation, two approaches are proposed: noun unification and modified TextTiling approach. Noun unification is based on phonological information to identify similarly pronounced nouns, unified the nouns and then in turn is used for topic segmentation. A modified TextTiling text segmentation algorithm is based on an apriori algorithm. The results collected from topic segmentation provide the evidence that noun unification and the modified TextTiling algorithm give better performance compared to the original TextTiling algorithm. The modified TextTiling with noun unification achieved an F-measure of 0.71; without the noun unification, it achieved an F-measure of 0.62.
Description
Keywords
Broadcast news
Citation