Tag Archives: training

A.I. Speech-to-Text: How to make sure your data isn’t being used for training

We get a fair number of questions from Transcriptive users that are concerned the A.I. is going to use their data for training.

First off, in the Transcriptive preferences, if you select ‘Delete transcription jobs from server’ your data is deleted immediately. This will delete everything from the A.I. service’s servers and from the Digital Anarchy servers. So that’s an easy way of making sure your data isn’t kept around and used for anything.

However, generally speaking, the A.I. services don’t get more accurate with user submitted data. Partially because they aren’t getting the ‘positive’ or corrected transcript.

When you edit your transcript we aren’t sending the corrections back to the A.I. (some services are doing this… e.g. if you correct YouTube’s captions, you’re training their A.I.)

So the audio by itself isn’t that useful. What the A.I. needs to learn is the audio file, the original transcript AND the corrected transcript. So even if you don’t have the preference checked, it’s unlikely your audio file will be used for training.

This is great if you’re concerned about security BUT it’s less great if you really WANT the A.I. to learn. For example, I don’t know how many videos I’ve submitted over the last 3 years saying ‘Digital Anarchy’. And still to this day I get: Dugal Accusatorial (seriously), Digital Ariki, and other weird stuff. A.I. is great when it works, but sometimes… it definitely does not work. And people want to put this into self-driving cars? Crazy talk right there.

┬áIf you want to help the A.I. out, you can use the Speech-to-Text Glossary (click the link for a tutorial). This still won’t train the A.I., but if the A.I. is uncertain about a word, it’ll help it select the right one.

How does the glossary work? The A.I. analyzes a word sound and then comes up with possible words for that sound. Each word gets a ‘confidence score’. The one with the highest score is the one you see in your transcript. In the case above, ‘Ariki’ might have had a confidence of .6 (out 0 to 1, so .6 is pretty low) and ‘Anarchy’ might have been .53. So my transcript showed Ariki. But if I’d put Anarchy into the Glossary, then the A.I. would have seen the low confidence score for Ariki and checked if the alternatives matched any glossary terms.

So the Glossary can be very useful with proper names and the like.

But, as mentioned, nothing you do in Transcriptive is training the A.I. The only thing we’re doing with your data is storing it and we’re not even doing that if you tell us not to.

It’s possible that we will add the option in the future to submit training data to help train the A.I. But that’ll be a specific feature and you’ll need to intentionally upload that data.