Speech Recognition With Deep Learning | Highlights and Annotations by Gistr.

This video discusses speech recognition using sequence-to-sequence models. It explains the shift from phoneme-based systems to end-to-end deep learning approaches, enabled by larger datasets (10,000+ hours of audio). Two model architectures are presented: attention models and Connectionist Temporal Classification (CTC) models. CTC handles the disparity between long audio inputs and shorter text outputs by allowing repeated characters and blank characters, which are collapsed during processing. The video concludes by previewing a simpler keyword detection system for smaller datasets. So one of the most exciting trends in speech recognition is that once upon a time, speech recognition systems used to be built using phonemes. And these were, I want to say, hand-engineered basic units of sound So the quick brown fox would be represented as phonemes. I'm going to simplify a bit, but you say there has a duck and sound and quick as a cup. And what a sound and linguist used to write out these basic units of sound and try the Greek language down into these basic units of cell. So, brown, right? These aren't the official phonemes, which are written with more complicated notation, But, but linguist used to hypothesize that writing down audio in terms of these basic units of sound called phonemes would be the best way to do speech recognition. So how do you build a speech recognition system? In the last video, we talked about the attention, so one thing you could do is actually do that, where, on the horizontal axis, you take in different time frames of the audio input and then you have an attention model, we tried to output the transcript like the quick brown fox or whatever was said. One other method that seems to work well is to use the CDC costs of speech recognition. CT-C stands for connectionist temporal classification and is due to Alex Gray, San Diego Fernandez, Faustina Gomez and Urine Schmidhuber. So here's the idea. let's say the audio clip was was of someone saying the quick brown fox. We're going to use a we're going to use the neural network structured like this with an equal number of input X's and output y's And I've drawn a simple of one uni-directional for only RNN for this but in practice this will usually be a bi--directional LS TM or bi--directional GRU and usually a deeper model But notice that the number of time steps here is very large. And in speech recognition usually the number of input time steps is much bigger than the number of output time steps. So for example if you have 10 seconds of audio and your features come at a hundred hertz. so under samples per second then a 10-second audio clip would end up with a thousand inputs. right? so this 100 Hertz times 10 seconds ends up with a thousand inputs, but your output might not have a thousand alphabets might not have a thousand characters. So what did you do? The CTC cost function allows the RNN to generate an output like this. TtT does a special character called a blank character which in the writers are underscore here. Each blank E-E-E blank blank, blank and then maybe a space right like this. So that's a space and then blank blank blank QQ Q, blank, blank. And this is considered a correct output for the first part of the space quick with the cue. And the basic rule for the CTC cost function is to collapse repeated characters not separated by blank. So to be clearer, I'm using this underscore to denote the special blank character. Speech Recognition: The process of automatically converting spoken language into text. Audio Clip: A segment of recorded sound. Spectrogram: A visual representation of the frequencies of sound over time. It shows the intensity of different frequencies at different times. Filter Bank Outputs: Similar to a spectrogram, a pre-processing step that analyzes audio data by separating it into different frequency bands. Phonemes: The basic units of sound in a language. They are the smallest units of speech that can distinguish one word from another. Connectionist Temporal Classification (CTC): A loss function used in training sequence-to-sequence models, especially for speech recognition. It allows the model to output a sequence of characters that can be aligned to the input audio, even if the lengths of the input and output sequences are different. It handles variable-length sequences by introducing "blank" characters. RNN (Recurrent Neural Network): A type of neural network designed to work with sequential data, like audio or text. It has memory of previous inputs, allowing it to consider context. LSTM (Long Short-Term Memory): A specific type of RNN architecture better at handling long-range dependencies in sequential data than standard RNNs. GRU (Gated Recurrent Unit): Another type of RNN architecture, similar to LSTM, designed to address the vanishing gradient problem in RNNs. Bi-directional RNN: An RNN that processes the input sequence in both forward and backward directions, allowing it to capture context from both past and future inputs. Hertz (Hz): A unit of frequency, measuring cycles per second. In audio, it represents the number of sound waves per second. Blank Character: A special character used in CTC to represent the absence of a character or to allow for variable-length outputs. Trigger Word Detection/Keyword Detection: Identifying specific words or phrases within an audio stream. A spectrogram visually represents audio data, plotting time on the horizontal axis and frequency on the vertical axis. The intensity of color at each point indicates the energy (loudness) of specific frequencies at specific times. Raw audio, representing air pressure changes over time, is pre-processed to create this spectrogram. This pre-processing step mimics the human ear's frequency analysis. Spectrograms are crucial in speech recognition because they transform raw audio into a format more easily processed by machine learning algorithms. Historically, speech recognition systems relied on phonemes—hand-engineered basic units of sound—to represent words. This involved linguists meticulously breaking down speech into these units. In contrast, end-to-end deep learning bypasses this step. These systems directly input audio and output transcripts, learning the complex mapping between audio and text without explicit phoneme representation. This eliminates the need for manual phoneme definition and allows for more accurate and efficient speech recognition.