Textual Echo Cancellation

Paper: arXiv

Authors: Shaojin Ding, Ye Jia, Ke Hu, Quan Wang

Abstract: In this paper, we propose Textual Echo Cancellation (TEC) - a framework for cancelling the text-to-speech (TTS) playback echo from overlapped speech recordings. Such a system can largely improve speech recognition performance and user experience for intelligent devices such as smart speakers, as the user can talk to the device while the device is still playing the TTS signal responding to the previous query. We implement this system by using a novel sequence-to-sequence model with multi-source attention that takes both the microphone mixture signal and the source text of the TTS playback as inputs, and predicts the enhanced audio. Experiments show that the textual information of the TTS playback is critical to the enhancement performance. Besides, the text sequence is much smaller in size compared with the raw acoustic signal of the TTS playback, and can be immediately transmitted to the device and the ASR server even before the playback is synthesized. Therefore, our proposed approach effectively reduces Internet communication and latency compared with alternative approaches such as acoustic echo cancellation (AEC).

System architecture:

Random audio samples from LibriTTS testing set

Data configurations under the two condition:

Meaning of the columns in the table below:

  1. Microphone signal: The microphone mixture audio input to TEC model. It's generated by summing the user's speech with a reverberated TTS playback (i.e., interfering speech).
  2. TTS playback: The TTS playback. Only used for AEC baseline system.
  3. TTS playback source text: The source text of TTS playback. Only used for TEC system.
  4. TEC: The output from the proposed TEC model.
  5. AEC-NLMS: Signal processing AEC method based on normalized least mean square (NLMS).
  6. Vanilla-Seq2seq: The output from the baseline NoSideInput model.
  7. AEC-Seq2seq: The output from the baseline AEC model.
  8. User's speech: The clean user's speech audio, which is the ground truth.

Single interfering voice condition

Microphone signal TTS playback (only for AEC) TTS playback source text (only for TEC) TEC (proposed) AEC-NLMS Vanilla-Seq2seq AEC-Seq2seq User's speech (ground truth)
The Presidential vehicle in use in Dallas, described in chapter 2,
Someone sitting on the box facing the window would have his palm in this position if he placed his hand alongside his right hip.
The state side contained twelve good-sized rooms,
Marina Oswald appeared before the Commission again on June 11, 1964,
The bills were sent as a matter of form to the drawer to have the date added, and the forgery was at once detected.

Multiple interfering voices condition

Microphone signal TTS playback (only for AEC) TTS playback source text (only for TEC) TEC (proposed) AEC-NLMS Vanilla-Seq2seq AEC-Seq2seq User's speech (ground truth)
It is a job creation scheme.
The Government yesterday announced its proposed new pension account.
The first thing is to assess the damage that has been done.
I am the head.
A new school will be built.