Python AI Voice Assistant & Agent - Full Tutorial | Highlights and Annotations by Gistr.

This YouTube tutorial demonstrates building a voice-activated AI assistant similar to OpenAI's ChatGPT voice mode, but with added agent functionality. Using LiveKit (a sponsor, but free and open-source), the creator shows how to connect a Python-based AI agent to a user interface, enabling voice input, speech-to-text conversion, LLM processing (using OpenAI, but adaptable to others), and text-to-speech output. The tutorial further expands functionality by adding custom Python functions to the agent, allowing it to manage and update simulated room temperatures. This segment guides viewers through setting up a virtual environment, installing necessary dependencies (including LiveKit agents, plugins, and OpenAI integration), and creating essential project files (env file, main.py, and api.py). It lays the groundwork for the coding process, explaining the purpose of each file and dependency. well, it's going to start the agent force. So let's go Python 3. main.py. Start. And this should start the agent And we should get some output coming here in just a second. Okay, so you can see that the agent has started. So now that the agent has started, what we need to do is actually set up a new room. And this will then trigger the live kit server to send a request here to this agent. So the agent can connect to it and then can start listening for any of the voice requests or any of the voice messages that we sent. So the way that we do that is the following. First of all, I have a link in the description to the live kit agent frameworks um, kind of Github reposit repository And this is specifically for Python. And it will give you some more information on how all of this works. But what it will also specify here is a URL for the hosted playground. Now, the hosted playground is what I showed you before in the demo. And this is something that we can use to connect to our agent without having to build our own front end. So what I'm going to do is click on this URL here and it's going to bring me to this agents playground. kit. io. Now, you're going to have to sign into live kit. In order for this to work. You can see that I'm already signed in and I'll leave this link in the description so you can just click on it rather than having to type it in. Now, what we're going to do is press on the project associated with the agent that we're building. So in this case, I believed I called it AI voice tutorial. So I'm going to click on that and I'm going to connect to this project. Now when I do that, it should connect my video as well as my audio and I can turn off my video if I want. So let's do that. And you can see that my microphone is going now notice that it will show me. Let me just mute this for a second that I have a room connected, True. and I also have an agent connected and it also shows me the room that I'm inside of. And if we go back here, we can see that we've actually connected to this room. So it should, uh, in a second here, if I turn on my microphone, actually allow me to talk and get a response from the agent. So let me do this. Hey, how's it going? Okay, sorry for the cut there but I just actually disconnected and then reconnected And now my agent is connected and it's working and you can see it's picking up what I'm saying. So as soon as I stop talking now it should actually grab all of this audio, send it to my agent and then give me some kind of response. Okay, so you can see it handled that audio. It's going to take a second obviously and then it should give us some kind of response. Sounds like everything is working. How can I assist you further? Perfect. So I just muted my mic because obviously it's weird when I'm commentating and trying to talk to the assistant. But the point is now, this is connected. So it shows you right the room, the participant, the room connected the fact that the agent is connected. If you're getting any weird errors, sometimes you can just disconnect and then reconnect and then the agent will pick this up. Also, obviously you could shut off the agent by hit contrl C on your keyboard and then rebooting This segment focuses on building a Python class to add agent functionality, specifically temperature control. The presenter explains the use of enums, decorators (`@llm.callable`), and type annotations to create functions callable by the LLM. The code demonstrates how to define functions to get and set temperatures in different zones, illustrating a practical application of extending the agent's capabilities beyond simple text responses.This segment shows the integration of the newly created temperature control functions into the main application and demonstrates their functionality. The presenter tests the agent's ability to both get and set temperatures in different rooms, showcasing the successful integration of custom functions and the agent's ability to handle multiple requests and respond appropriately. The segment also includes a brief look at the logs to verify the function calls. This segment walks through the Python code for building the AI voice assistant. It covers importing necessary libraries, loading environment variables, defining functions, connecting to the LiveKit server, and setting up voice activity detection, speech-to-text, and text-to-speech functionalities. The code integrates OpenAI's language model and demonstrates how to send and receive messages. here and break down what we're about to do we're going to be building an AI voice assistant now this voice assistant is going to work by connecting to something known as the live kit server.04:59Now the live kit server is going to be hosted by live kit but if we wanted to, we could host it OurselF, using their open-source code. Now, this live kit server is what's responsible for the transmission of data. So it's going to take, for example, the voice data that comes from our client or our front end and send that to our backend or our AI agent, where we can then process it and then come up with some kind of response.05:20So the server is kind of an integral part here and inside of this environment variable file, we're going to put a few different kind of keys or tokens that we need to connect to that server. Then we're going to have a front end.05:31Now, the front end is like the user-facing application. That's what you saw in the demo where it shows us a preview of our video. It kind of gives us the text logs of what's the agent is saying. And that's actually pre--made by live kit.05:43Now, obviously we can build our own front end, but just to save us some time here, we're going to use one provided by live kit, that allows us to really quickly test our application. So quick summary, we have a front end or a client that connects to the live kit server, and then our agent.05:57So the AI we're going to build in this video here, will connect to the server as well. The server is going to handle the transmission of data between those two different components. And that's kind of the main nice advantage here of using live kit.06:08It do really low latency transmission of data, so we can get responses back super quickly. So what we need to do is now go to the live kit website and create a new application. So I'm going toChapter| Getting Environment Secrets06:19press on try live kit. I'm going to make a new account and then I'm going to create a new app. Now I already have an account and I already have an application, but let's make a new do is go to settings and from settings. I'm going to press on keys. Now from key, I'm going to create a new key and I want to give this some kind of description. So i'm just going to call this tutorial because I'll delete it afterwards. Now it's going to generate this key for us and we need to copy these three pieces of data into our environment variable file so let's start with the websocket urL let's go here and let's type in the correct variable now it's really important that you type this correctly so just follow along with me the first variable we're going to have is the live kit uncore url and we're going to make that equal to a string and we're going to copy in this w uh or websocket s url now the next variable that we'll have will be the live kit aior secret let's make sure we spell secret correctly okay that looks good and then after that we're going to have the live kit uncore aior key this will be equal to a string as well so let's now go and grab our secret again i'm going to delete this after so don't think that you can copy it I'm going to paste that right here into the secret and then I'm going to grab the key so the API key like that and paste that for the API key okay so make sure you have these three variables and then we're almost done but we do need to get our open AAI API key so we're going to make one more variable here and this is going to be open aore. AIOR key this will be equal LLM (Large Language Model): A type of artificial intelligence that can understand and generate human-like text. Examples include those from OpenAI, Llama, etc. LiveKit: An open-source, real-time communication platform that enables low-latency streaming of audio and video. It's used to power the voice mode functionality in the example. Virtual Environment: An isolated space for installing and managing project dependencies, preventing conflicts between different projects. In this case, it's used to manage Python packages. LiveKit Agents: A component of LiveKit specifically designed for building AI agents that can interact with real-time audio and video streams. LiveKit Plugins: Extensions for LiveKit that add specific functionalities, such as speech-to-text (STT), text-to-speech (TTS), and voice activity detection (VAD). Voice Activity Detection (VAD): A technology that identifies when a person is speaking in an audio stream, allowing for more efficient processing and reduced latency. Solero is an example of a VAD model used. Speech-to-Text (STT): Technology that converts spoken language into written text. OpenAI's STT model is used in the example. Text-to-Speech (TTS): Technology that converts written text into spoken language. Async IO: A programming paradigm that allows concurrent execution of tasks, improving performance in applications that handle multiple operations simultaneously. Auto Subscribe: A LiveKit feature that automatically subscribes to audio or video tracks in a room. Job Context: In the context of LiveKit Agents, this refers to the information and environment provided to an agent for a specific task. Worker Options: Configuration settings for the worker process that runs the LiveKit agent. Agent Functionality: The ability of an AI to interact with its environment, perform actions, and maintain state. In this case, it involves controlling simulated temperatures in different rooms. Enum: A data type that defines a set of named constants, used here to represent different zones (rooms) in the temperature control example.