This research paper discusses a new neural speech codec named TiCodec which has been designed to improve the efficiency and effectiveness of language model-based Text-to-Speech (TTS) models. Traditional TTS models suffer from excessive token sequences that can negatively impact prediction accuracy. TiCodec addresses this issue by encoding time-invariant information into a separate code, reducing the amount of frame-level information that needs encoding and thereby decreasing the number of tokens. The study finds that TiCodec can enhance the quality of reconstructed speech with fewer tokens and improve the similarity, naturalness, and word error rate of synthesized speech.
Publication date: 4 Oct 2023
Project Page: Not provided
Paper: https://arxiv.org/pdf/2310.00014