Speech Recognition Supported by Lip Analysis

Waqqas ur Rehman Butt

Abstract

Computers have become more pervasive than ever with a wide range of devices and multiple ways of interaction. Traditional ways of human computer interaction using keyboards, mice and display monitors are being replaced by more natural modes such as speech, touch, and gesture. The continuous progress of technology brings to an irreversible change of paradigms of interaction between human and machine. They are now used in daily life in many devices that have revolutionized the way users interact with machines. In fact new PCs, tablets and smartphones are moving increasingly toward a direction that will bring in a short time to have interaction paradigms so advanced that will be completely transparent to users. The various modes of human-machine interaction, through voice recognition are without doubt one of the most considered. Many attempts have been in recent years to automate the process of voice communication with which they interact between themselves persons.

A number of researchers have revealed that a speech reading system is beneficial complement to an audio speech recognition system by using of visual cues of the speakers, such as face in noisy environment. However, robust and precise extraction of visual features is a challenging problem in object recognition, due to high variation in pose, lighting and facial makeup. Most of the existing approaches use constraints such as the use of reflective marker on subjects lips, lip movements recorded with a fixed camera position (head mounted camera) and lip segmentation in organized illumination conditions. Furthermore, there is no common consensus about the visual features selection and their significance for a particular phoneme.

Speech is the natural procedure of communication. Therefore speech would be an apparently preferred option for human computer interaction. In the past years, development in technology, combined with a significant reduction in cost, has led to the pervasive use of automated speech recognition in variety of systems such as telephony, human-computer interaction and robotics. 

Visual speech cues are prospective source of speech information and they are apparently not affected in noisy acoustic environmental condition and cross talking between speakers. Visual information of a speaker is the key component of Speech Recognition system such as outside area of mouth, mouth gestures and facial expressions.

The major problem to develop an accurate and robust speech recognition system is to find the precise visual feature extraction method. Sometime hearer observes improper from speaker because of the incompatible effect of visual features. These visual features have great role in the lip reading process. These interpretations gave a motivation for developing a computer speech recognition system.

I propose a speech recognition system using face detection, lip extraction and tracking with some pre-processing techniques to overwhelmed the pose/lighting variation problems. The proposed approach is useful for face/lip detection and tracking in sequence of images and to augment global facial features to improve the recognition performance.

 

The Proposed approach consists of four major parts, firstly detecting/localizing human faces, lips and define the lip region of interest in the first frame, secondly apply some pre-processing to overwhelmed the inference triggered by illumination effects, shadow and teeth appearance, thirdly create contour line with using sixteen key points with geometric constraint and stored the coordinates of these constraints. Finally track the lip contour with their coordinates in the following frames. The proposed method not only adapts to the lip movement, but also robust in contrast to the appearance of teeth, shadows and low contrast environment. Extensive experiments show the encouraging results and the effectiveness of the proposed method in comparison with the existing methods. However, several factors were found during the experiments which may lead to an increase of the error rate. The key challenge for the recognition system is to get precise results with different environmental conditions and disturbing visual domain effects, such as illumination, shadow and teeth.

 Three pre-processing steps, namely illumination equalization, teeth detection and shadow removal developed, aiming at investigating edge information and global statistical characteristics which are sensitive to the uneven illuminations and susceptible to the complex appearance in presence of teeth and shadow. In contrast, the proposed method, which is aimed at local region analysis, can successfully avoid the complex appearance (e.g. low contrast, shadow, moustaches and teeth). The high average extraction performance is reached. Experimental results show also some unsatisfactory results due to very low contrast and bad low resolution camera.

A standard video camera (Logitech) is used to record English alphabets uttered by users is applied. Proposed method is an easy to implement and a computationally efficient algorithm that is capable of locating face and mouth and lips feature points throughout an entire image sequence. The extracted feature parameters are suitable for speech recognition and can greatly improve recognition accuracy.

An approach to detect and track lip boundaries in a precise way is presented. The basic idea of this new approach is that not only it highlights the lips but also avoids other factors, (such as false lip pixels) and recovers from failures. The new approach is implemented in the lip tracking module. Using this lip tracking module from the lip boundary lines a feature vector of 16 points lip model of the speaker’s lips, stores the coordinates of these points and tracks these coordinates during the utterance by the speaker and tracked in every image of the image sequence. The strength of the new approach has also been evaluated by testing the system in noisy real world facial image sequences. Experiments have shown that outliers detecting and better predicting ROIs can further reduce the number of frames with locating or tracking failures.

 

 

Computers have become more pervasive than ever with a wide range of devices and multiple ways of interaction. Traditional ways of human computer interaction using keyboards, mice and display monitors are being replaced by more natural modes such as speech, touch, and gesture. The continuous progress of technology brings to an irreversible change of paradigms of interaction between human and machine. They are now used in daily life in many devices that have revolutionized the way users interact with machines. In fact new PCs, tablets and smartphones are moving increasingly toward a direction that will bring in a short time to have interaction paradigms so advanced that will be completely transparent to users. The various modes of human-machine interaction, through voice recognition are without doubt one of the most considered. Many attempts have been in recent years to automate the process of voice communication with which they interact between themselves persons.

A number of researchers have revealed that a speech reading system is beneficial complement to an audio speech recognition system by using of visual cues of the speakers, such as face in noisy environment. However, robust and precise extraction of visual features is a challenging problem in object recognition, due to high variation in pose, lighting and facial makeup. Most of the existing approaches use constraints such as the use of reflective marker on subjects lips, lip movements recorded with a fixed camera position (head mounted camera) and lip segmentation in organized illumination conditions. Furthermore, there is no common consensus about the visual features selection and their significance for a particular phoneme.

Speech is the natural procedure of communication. Therefore speech would be an apparently preferred option for human computer interaction. In the past years, development in technology, combined with a significant reduction in cost, has led to the pervasive use of automated speech recognition in variety of systems such as telephony, human-computer interaction and robotics. 

Visual speech cues are prospective source of speech information and they are apparently not affected in noisy acoustic environmental condition and cross talking between speakers. Visual information of a speaker is the key component of Speech Recognition system such as outside area of mouth, mouth gestures and facial expressions.

The major problem to develop an accurate and robust speech recognition system is to find the precise visual feature extraction method. Sometime hearer observes improper from speaker because of the incompatible effect of visual features. These visual features have great role in the lip reading process. These interpretations gave a motivation for developing a computer speech recognition system.

I propose a speech recognition system using face detection, lip extraction and tracking with some pre-processing techniques to overwhelmed the pose/lighting variation problems. The proposed approach is useful for face/lip detection and tracking in sequence of images and to augment global facial features to improve the recognition performance.

 

The Proposed approach consists of four major parts, firstly detecting/localizing human faces, lips and define the lip region of interest in the first frame, secondly apply some pre-processing to overwhelmed the inference triggered by illumination effects, shadow and teeth appearance, thirdly create contour line with using sixteen key points with geometric constraint and stored the coordinates of these constraints. Finally track the lip contour with their coordinates in the following frames. The proposed method not only adapts to the lip movement, but also robust in contrast to the appearance of teeth, shadows and low contrast environment. Extensive experiments show the encouraging results and the effectiveness of the proposed method in comparison with the existing methods. However, several factors were found during the experiments which may lead to an increase of the error rate. The key challenge for the recognition system is to get precise results with different environmental conditions and disturbing visual domain effects, such as illumination, shadow and teeth.

 Three pre-processing steps, namely illumination equalization, teeth detection and shadow removal developed, aiming at investigating edge information and global statistical characteristics which are sensitive to the uneven illuminations and susceptible to the complex appearance in presence of teeth and shadow. In contrast, the proposed method, which is aimed at local region analysis, can successfully avoid the complex appearance (e.g. low contrast, shadow, moustaches and teeth). The high average extraction performance is reached. Experimental results show also some unsatisfactory results due to very low contrast and bad low resolution camera.

A standard video camera (Logitech) is used to record English alphabets uttered by users is applied. Proposed method is an easy to implement and a computationally efficient algorithm that is capable of locating face and mouth and lips feature points throughout an entire image sequence. The extracted feature parameters are suitable for speech recognition and can greatly improve recognition accuracy.

An approach to detect and track lip boundaries in a precise way is presented. The basic idea of this new approach is that not only it highlights the lips but also avoids other factors, (such as false lip pixels) and recovers from failures. The new approach is implemented in the lip tracking module. Using this lip tracking module from the lip boundary lines a feature vector of 16 points lip model of the speaker’s lips, stores the coordinates of these points and tracks these coordinates during the utterance by the speaker and tracked in every image of the image sequence. The strength of the new approach has also been evaluated by testing the system in noisy real world facial image sequences. Experiments have shown that outliers detecting and better predicting ROIs can further reduce the number of frames with locating or tracking failures.

 

Key Words: Computer Vision, Image Analysis, Illumination Equalization, Image Segmentation, Lip Dection and Tracking, Video and Image Sequence Analysis

Keywords

Computer Vision, Image Analysis, Illumination Equalization, Image Segmentation, Lip Dection and Tracking, Video and Image Sequence Analysis

Full Text:

PDF (418Kb)
Copyright (c) 2016 Waqqas ur Rehman Butt