28 February

Speech Recognizer API

In iOS 10 we can use SFSpeechRecognizer API, which allows transcription in real-time or using pre-recorded audio files. The outcome of such transcription is not only a text, but also alternative interpretations of the audio, length of spoken words and level of accuracy of recognized words (range 0.0 - 1.0). API allows for the analysis of more than 50 languages. Using SFSpeechRecognizer API in an application is trivial, it boils down to four steps.

Adding appropriate keys along with their descriptions to the file info.plist
  • a. NSSpeechRecognitionUsageDescription - a key telling you what we will use SFSpeechRecognizer for in the application
  • b. NSMicrophoneUsageDescription - in case you use the microphone to analyze live speech
Calling request for permission SFSpeechRecognizer
SFSpeechRecognizer.requestAuthorization { authStatus in
            /*
                The callback may not be called on the main thread. Add an
                operation to the main queue to update the record button's state.
            */
            OperationQueue.main.addOperation {
                switch authStatus {
                    case .authorized:
                         //..
                    case .denied:
                         //..
                    case .restricted:
                         //..
                    case .notDetermined:
                         //..
                }
            }
        }
Creating speech recognition request

There are two types of such a request: SFSpeechURLRecognitionReqest - used to analyze the speech from a previously recorded audio file SFSpeechAudioBufferRecognitionRequest - used to analyze live speech (using a microphone) For both requests SFSpeechRecognizer class will be needed to perform an analysis of the speech.

//..
let speechRecognizer = SFSpeechRecognizer(locale: Locale(identifier: "pl-PL"))
//..
SFSpeechURLRecognitionReqest

Handling such a request comes down to a few lines of code:

let fileURL = URL(fileURLWithPath: Bundle.main.path(forResource: "audio", ofType: ".mp3")!)
let request = SFSpeechURLRecognitionRequest(url: fileURL)
SFSpeechAudioBufferRecognitionRequest

Handling analysis of live speech requires a little more work. You should additionally use AVAudioEngine to capture speech from the microphone of the device. The next and final element which is necessary for speech recognition is: SFSpeechRecognitionTask.

//..
recognitionRequest = SFSpeechAudioBufferRecognitionRequest()
        
        guard let inputNode = audioEngine.inputNode else { fatalError("Audio engine has no input node") }
        guard let recognitionRequest = recognitionRequest else { fatalError("Unable to created a SFSpeechAudioBufferRecognitionRequest object") }
        
        // Configure request so that results are returned before audio recording is finished
        recognitionRequest.shouldReportPartialResults = true
        
        // A recognition task represents a speech recognition session.
        // We keep a reference to the task so that it can be cancelled.
        recognitionTask = speechRecognizer.recognitionTask(with: recognitionRequest) { result, error in
            //..
        }
        
        let recordingFormat = inputNode.outputFormat(forBus: 0)
        inputNode.installTap(onBus: 0, bufferSize: 1024, format: recordingFormat) { (buffer: AVAudioPCMBuffer, when: AVAudioTime) in
            self.recognitionRequest?.append(buffer)
        }
        
        audioEngine.prepare()
        try audioEngine.start()

SFSpeechRecognitionResult includes bestTranscription and alternative transcripts.