Testing The Accuracy of Artificial Intelligence (A.I.) Services

When A.I. works, it can be amazing. BUT you can waste a lot of time and money when it doesn’t work. Garbage in, garbage out, as they say. But what is ‘garbage’ and how do you know it’s garbage? That’s one of the things, hopefully, I’ll help answer.

Why Even Bother?

It’s a bit tedious to do the testing, but being able to identify the most accurate service will save you a lot of time in the long run. Cleaning up inaccurate transcripts, metadata, or keywords is far more tedious and problematic than doing a little testing up front. So it really is time well spent.

One caveat… There’s a lot of potential ways to use A.I., and this is only going to cover Speech-to-Text because that’s what I’m most familiar with due to Transcriptive and getting A.I. transcripts in Premiere. But if you understand how to evaluate one use, you should, more or less, be able to apply your evaluation method to others. (i.e. for testing audio, you want varying audio quality among your samples. If testing images you want varying quality (low light, blurriness, etc) among your samples)

At Digital Anarchy, we’re constantly evaluating a basket of A.I. services to determine what to use on the backend of Transcriptive. So we’ve had to come up with a methodology to fairly test how accurate they are. Most of the people reading this are in a bit different situation… testing solutions from various vendors that use A.I. instead of testing the A.I. directly. However, since different vendors use different A.I. services, this methodology will still be useful for you in comparing the accuracy of the A.I. at the core of the solutions. There may be, of course, other features of a given solution that may affect your decision to go with one or the other, but at least you’ll be able to compare accuracy objectively.

Here’s an outline of our method:

  1. Always use new files that haven’t been processed before by any of the A.I. services.
  2. Keep them short. (1-2min)
  3. Choose files of varying quality.
  4. Use a human transcription service to create the ‘test master’ transcript.
    • Have someone do a second pass to correct any human errors.
  5. Create a set of rules on word/punctuation errors for what counts as an error (or 1/2 or two).
    • If you change them halfway through the test, you need to re-test everything.
  6. Apply them consistently. If something is ambiguous, create a rule for how it will be handled and alway apply it that way.
  7. Compare the results and may the best bot win.

May The Best Bot Win : Visualizing

Accuracy rates for different A.I. services

The main chart compares each engine on a specific file (i.e. File #1, File # 2, etc), using both word and punctuation accuracy. This is really what we use to determine which is best, as punctuation matters. It also shows where each A.I. has strengths and weaknesses. The second, smaller chart shows each service from best result to worst result, using only word accuracy. Every A.I. will eventually fall off a cliff in terms of accuracy. This chart shows you the ‘profile’ for each service and can be a little bit clearer way of seeing which is best overall, ignoring specific files.

First it’s important to understand how A.I. works. Machine Learning is used to ‘train’ an algorithm. Usually millions of bits of data that have been labeled by humans are used to train it. In the case of Speech-to-Text, these bits are audio files with a human transcripts. This allows the A.I. to identify which audio waveforms, the word sounds, go with which bits of text. Once the algorithm has been trained, we can then send audio files to the algorithms and it makes it’s best guess as to which word every waveform corresponds to.

A.I. algorithms are very sensitive to what they’ve been trained on. The further you get away from what they’ve been trained on, the more inaccurate they are. For example, you can’t use an English A.I. to transcribe Spanish. Likewise, if an A.I. has been trained on perfectly recorded audio with no background noise, as soon as you add in background noise it goes off the rails. In fact, the accuracy of every A.I. eventually falls off a cliff. At that point it’s more work to clean it up than to just transcribe it manually.

Always Use New Files

Any time you submit a file to an A.I. it’s possible that the A.I. learns from that file. So you really don’t want to use the same file over and over and over again. To ensure you’re getting unbiased results it’s best to use new files every time you test.

Keep The Test Files Short

First off, comparing transcripts is tedious. Short transcripts are better than long ones. Secondly, if the two minutes you select is representative of an hour long clip, that’s all you need. Transcribing and comparing the entire hour won’t tell you anything more about the accuracy. The accuracy of two minutes is usually the same as the accuracy of the hour.

Of course, if you’re interviewing many different people over that hour in different locations, with different audio quality (lots of background noise, no background noise, some with accents, etc)… two minutes won’t be representative of the entire hour.

Chose Files of Varying Quality

This is critical! You have to choose files that are representative of the files you’ll be transcribing. Test files with different levels of background noise, different speakers, different accents, different jargon… whatever issues usually occur in the dialog typically in your videos. ** This is how you’ll determine what ‘garbage’ means to the A.I. **

Use Human Transcripts for The ‘Test Master’

Send out the files to get transcribed by a person. And then have someone within your org (or you) go over them for errors. There usually are some, especially when it comes to jargon or names (turns out humans aren’t perfect either! I know… shocker.). These transcripts will be the what you compare the A.I. transcripts against, so they need to be close to perfect.  If you change something after you start testing, you need to re-test the transcripts you’ve already tested.

Create A Set of Rules And Apply Them Consistently

You need to figure out what you consider one error, a 1/2 error or two errors. In most cases it doesn’t matter exactly what you decide to do, only that you do it consistently. If a missing comma is 1/2 an error, great! But it ALWAYS has to be a 1/2 error. You can’t suddenly make it a full error just because you think it’s particularly egregious. You want to remove judgement out of the equation as much as possible. If you’re making judgement calls, it’s likely you’ll choose the A.I. that most resembles how you see the world. That may not be the best A.I. for your customers. (OMG… they used an Oxford Comma! I hate Oxford commas! That’s at least TWO errors!).

And NOW… The Moment You’ve ALL Been Waiting For…

Add up the errors, divide that by the number of words, put everything into a spreadsheet… and you’ve got your winner!

It’s a bit tedious to do the testing, but being able to identify the most accurate service will save you a lot of cleanup time in the long run. So it really is time well spent.

Hopefully this post has given you some insights into how to test whatever type of A.I. services you’re looking into using. And, of course, if you haven’t checked out Transcriptive, our A.I. transcript plugin for Premiere Pro, you need to!Thanks for reading and please feel free to ask questions in the comment section below!

Leave a Reply

Your email address will not be published. Required fields are marked *