Training powerful text to speech models requires sufficiently powerful hardware. A recent study published by OpenAI drives the point home — it found that since 2012, the amount of compute used in the largest runs grew by more than 300,000 times. In pursuit of less demanding models, researchers at IBM developed a new lightweight and modular method for speech synthesis. They say it’s able to synthesize high-quality speech in real time by learning different aspects of a speaker’s voice, making it possible to adapt to new speaking styles and voices with small amounts of data.

The IBM team’s system consists of three interconnected parts: a prosody feature predictor, an acoustic feature predictor, and a neural vocoder. The prosody prediction bit learns the duration, pitch, and energy of speech samples, toward the goal of better representing a speaker’s style. As for the acoustic feature production, it creates representations of the topplay speaker’s voice in the training or adaptation data, while the vocoder generates speech samples from the acoustic features.
All components work together to adapt synthesized voice to a target speaker via retraining, based on a small amount of data from the target speaker. In a test involving volunteers asked to listen and rate the quality of pairs of synthesized and natural voice samples, the team reports that the model maintained high quality and similarity to the original speaker for voices trained on as little as five minutes of speech.
The work served as the basis for IBM’s new Watson TTS service, which can be heard here. (Select “V3” voices from the dropdown menu.) Here’s a sample:
The new research comes months after IBM scientists detailed natural language processing techniques that cut down AI speech recognition training time from a week to 11 hours. Separately, in May, an IBM team took the wraps off of a novel system that achieves “industry-leading” results on broadcast news captioning tasks.