How do We Test Speech-to-Text Services for Accuracy?

Transcriptive-A.I. doesn’t use a single A.I. services on the backend. We don’t have our own A.I., so like most companies that offer transcription we use one of the big companies (Google, Watson, Speechmatics, etc).

We initially started off with Speechmatics as the ‘high quality’ option. And they’re still very good (as you’ll see shortly), but not always. However, since we had so many users that liked them, we still give you the option to use them if you want.

However, we’ve now added Transcriptive-A.I. This uses whatever A.I. services we think is best. It might use Speechmatics, but it might also use one of a dozen other services we test.

Since we encourage users to test Transcriptive-A.I. against any service out there, I’ll give you some insight on how we test the different services and choose which to use behind the scenes.

Usually we take 5-10 audio clips of varying quality that are about one minute long. Some very well recorded, some really poorly recorded and some in between. The goal is to see which A.I. works best overall and which might work better is certain circumstances.

When grading the results, I save out a plain text file with no timecode, speakers or whatever. I’m only concerned about word accuracy and, to a lesser degree, punctuation accuracy. Word accuracy is the most important thing. (IMO) For this purpose, Word 2010 has an awesome Compare function to see the difference between the Master transcript (human corrected) and the A.I. transcript. Newer versions of Word might be better for comparing legal documents, but Word 2010 is the best for comparing A.I. accuracy.

Also, let’s talk about the rules for grading the results. You can define what an ‘error’ is however you want. But you have to be consistent about how you apply the definition. Applying them consistently matters more than the rules themselves. So here are the rules I use:

1) Every word in the Master transcript that is missed counts as one error. So ‘a reed where’ for ‘everywhere’ is just one error, but ‘everywhere’ for ‘every hair’ is two errors.
2) ah, uh, um are ignored. Some ASRs include them, some don’t. I’ll let ‘a’ go, but if an ‘uh’ should be ‘an’ it’s an error.
3) Commas are 1/2 error and full stops (period, ?) are also 1/2 error but there’s an argument for making them a full error.
4) If words are correct but the ASR tries to separate/merge them (e.g. ‘you’re’ to ‘you are’, ‘got to’ to ‘gotta’, ‘because’ to ’cause) it does not count as an error.

That’s it! We then add up the errors, divide that by the number of words that are in the clip, and that’s the error rate!

Leave a Reply

Your email address will not be published. Required fields are marked *