Time passes quickly in the gaming world. Ten years ago, Ezio Auditore da Firenze was running across the rooftops of medieval Italy for the first time. Nathan Drake was fleeing snow zombies in Uncharted 2, and we’d just discovered the joys of the shooter looter in Borderlands. Today we’re two years on from the last of the Uncharted games, we’ve seen a staggering nine further Assassin’s Creed games, and we’re just a few months away from the third Borderlands game with a brand-new cast of vault hunters.
When you think about gaming in 2009 vs today, one thing becomes apparent; games are a lot more complex and immersive. It’s often graphics that reviewers focus on, but sounds are just as important – there’s a world of difference between Halo’s kid-friendly gunfire and the harsh reality of DayZ.
Major NPCs vs. the Crowd
Speech is similar; the other characters that you interact with during a game can be some of the most memorable parts of a game. Unfortunately, this often comes down to individual voice actors, like Patrick Stewart’s gravitas way back in Oblivion, the earnest maturity of Ashly Burch’s Aloy in Horizon: Zero Dawn or the sheer zaniness of Tiny Tina in the Borderlands franchise – incredibly, also voiced by Ashly Burch.
But outside of a series like Fable, featuring all-English voice actors, or the cockney orcs in the Shadow of Mordor series, speech can often be an afterthought. Voice actors in games are infamously badly paid and expected to voice tens if not hundreds of hours of dialogue. Unfortunately, until recently, there hasn’t been a way to make this process easier or faster, unless you wanted to use a synthetic voice that sounded like the late Stephen Hawking.
The Science of Speech Synthesis
In recent years, this has all changed. In the 1990s, many synthetic voice systems used a technique called unit selection, whereby voice actors would spend a long time – around thirty hours – in a recording studio, and an algorithm would break down their speech into its component units. This would then be reassembled into artificial speech and ‘smoothed’ so that the algorithm could pronounce almost any word. This process produced high-quality sound but took approximately three months from end to end.
Later speech synthesis technology attempted to accelerate and improve this process without requiring so much time up front. The system known as parametric modelling, for example, uses statistical models to pick speech units from a pre-existing database of sounds; it’s faster – only required two or three hours of recording – but doesn’t preserve a lot of the individuality or tone from the original recording.
Deep Neural Network (DNN) systems use neural networks, where certain elements are ‘weighted’ and trained repeatedly to improve synthesised speech. These systems use ‘correct’ examples of speech patterns to fine-tune the individual elements of the synthesised speech based on parts that are right or wrong.
Technology has now started to move beyond this: what we are seeing today is an evolution of this system in WaveNet-style synthesis, whereby the underlying structure and flow of speech itself is picked apart by a neural network. This results in a much smoother, cleaner experience. However, it can be quite in-depth and time-consuming to generate because of the large amount of analysis – although that said, accelerating compute time is a much easier process than improving algorithms!
The Role of Speech
With these innovations, we have often questioned why Apple and Google have focused on creating just one voice, rather than a cornucopia of wonderfully varied voices, like we have in real life. Technical considerations aside, it’s important to remember that this doesn’t just apply to in-game speech – synthesised speech is also useful during the development process; for example, as a placeholder when developers are coding a different part of an encounter with NPCs. It’s entirely possible that some encounters will be removed in the finished version of the game, but using synthesised speech can remove the need to have voice actors work (and bill) more than they have to, whilst still allowing developers to flesh out new encounters and even chapters.
In-game, using synthesised speech can also give greater variety to non-player characters; it was a huge hit in the film Captain America: The Winter Soldier when Steve Rogers’ notebook of ‘things to catch up on’ was localised to each DVD release. For example, in the UK, it included The Beatles, Rocky and the 1966 World Cup Final, compared to Australia’s Steve Irwin and Tim Tams. There’s no reason why the Scottish release of the next Elder Scrolls game couldn’t feature mostly local accents.
Using synthetic speech would also allow developers far greater flexibility with scripts; it’s much easier to write a few lines of extra dialogue compared to having an actor come in and read the same line twenty times – with this kind of system, we could have seen ‘swords in the arm’ as well as ‘arrows in the knee’ in the dialogue of the 2011 hit Skyrim.
Personalised Voices: Into the Future
Finally, what of gamers themselves? Many protagonists of games are the ‘strong and silent’ types, but many more are chattier – and it can be a tense moment for games producers when they decide to give their character a voice. For example, in the Dead Space series, leading character Isaac Clarke was completely silent in the first game, but granted a voice in the second, further adding depth to the tormented engineer’s journey. Synthetic speech technology could either allow players to customise their character’s voice at the start of a game, or with voice cloning even permit individual players to add their own voice to the game itself!
All of these developments present a great opportunity for games studios throughout the creation and production process. However, with greater demands on actors’ time, greater demands for high quality production and only small increases in the costs of games themselves, developers should seize every possible opportunity to increase flexibility and decrease costs. Synthetic speech is progressing fast, and if it can restore a voice to the likes of Jamie Dupree, an American News Actor and Peter Scott-Morgan, a roboticist with Motor Neurone Disease, then the opportunities in the virtual world of gaming are even more profound.
Credit: Matthew at CereProc