Transcription Accuracy: Adobe Sensei vs Transcriptive A.I.

Speechmatics, one of the A.I. engines we support, recently released a new speech model which promised much higher accuracy. Transcriptive Rough Cutter now supports that if you choose the Speechmatics option. Also with Premiere able to generate transcripts with Adobe Sensei, we get a lot of questions about how it compares to Transcriptive Rough Cutter.

So we figured it was a good time to do a test of the various A.I. speech engines! (Actually we do this pretty regularly, but only occasionally post the results when we feel there’s something newsworthy about them)

You can read about the A.I. testing methodology in this post if you’re interested or want to run your own tests. But, in short, Word Error Rate is what we pay most attention to. It’s simply:

NumberOfWordsMissed / NumberOfWordsInTranscript

where NumberOfWordsMissed = the number of words in the corrected transcript that the A.I. failed to recognize. If instead of the word ‘Everything’ the A.I. produced ‘Even ifrits sing’, it still missed just one word. In the reverse situation, it would count as three missed words.

We also track punctuation errors, but those can be somewhat subjective, so we put less weight on that.

What’s the big deal between 88% and 93% Accuracy?

Every 1% of additional accuracy means roughly 15% less incorrect words. A 30 minute video has, give or take, about 3000 words. So with Speechmatics you’d expect to have, on average, 210 missed words (7% error rate) and with Adobe Sensei you’d have 360 missed words (12% error rate). Every 10 words adds about 1:15 to the clean up time. So it’ll take about 18 minutes more to clean up that 30 minute transcript if you’re using Adobe Sensei.

Every additonal 1% in accuracy means 3.5 minutes less of clean up time (for a 30 minute clip). So small improvements in accuracy can make a big difference if you (or your Assistant Editor) needs to clean up a long transcript.

Of course, the above are averages. If you have a really bad recording with lots of words that are difficult to make out, it’ll take longer to clean up than a clip with great audio and you’re just fixing words that are clear to you but the A.I. got wrong. But the above numbers do give you some sense of what the accuracy value means back in the real world.

The Test Results!

All the A.I.s are great at handling well-recorded audio. If the talent is professionally mic’d and they speak well, you should get 95% or better accuracy. It’s when the audio quality drops off that Transcriptive and Speechmatics really shine (and why we include them in Transcriptive Rough Cutter). And I 100% encourage you to run your own tests with your own audio. Again, this post outlines exactly how we test and you can easily do it yourself.

Speechmatics New is the clear winner, with a couple first place finishes, no last place finishes, and at 93.3% rate overall (you can find the spreadsheet with results and the audio files further down the post). One caveat… Speechmatics takes about 5x as long to process. So a 30 minute video will take about 3 minutes with Transcriptive A.I. and 15-20 minutes with Speechmatics. If you select Speechmatics in Transcriptive, you’re getting the new A.I. model.

Adobe Sensei is the least accurate with two last place finishes and no first places, for a 88.3% accuracy overall. Google, which is another A.I. service we evaluate but currently don’t use, is all over the place. Overall, it’s 80.6%, but if you remove the worst and best examples, it’s a more pedestrian 90.3%. No idea why it failed so badly on the Bill clip, but it’s a trainwreck. The Bible clip is from a public domain reading of the bible, which I’m guessing was part of Google’s training corpus. You rarely see that kind of accuracy unless the A.I. was trained on it. Anyways, this inconsistency is why we don’t use it in Transcriptive.

Here are the clips we used for this test:

Bill Clip

Zoom clip

Bible clip

Scifi clip

Flower clip

Here’s the spreadsheet of the results (SM = Speechmatics, Green means best performance, Orange means worst). Again, mostly we’re focused on the Word Accuracy. Punctuation is a secondary consideration:

Anarchyjim

Transcription Accuracy: Adobe Sensei vs Transcriptive A.I.

Leave a Reply

Wherein Jim Tierney rants and opines about After Effects, Premiere Pro, Final Cut Pro, and other nonsense