This paper presents the design and implementation of a real-time AI voice assistant using ESP32 and cloudbased artificial intelligence. The system captures user voice input through an INMP441 I2S microphone and
transmits the audio data to a Python-based server using WebSocket communication. The server processes
the audio using Speech-to-Text (STT), generates intelligent responses using AI models, and converts the
output into speech using Text-to-Speech (TTS). The system employs Voice Activity Detection (VAD) to
identify speech segments and eliminate unnecessary processing of silence or noise, thereby improving
efficiency. The final output is played through a Bluetooth speaker, enabling real-time interaction. The
proposed system is cost-effective, scalable, and suitable for applications such as smart assistants, IoT
automation, and accessibility tools. Experimental results demonstrate that the system achieves accurate
speech recognition and response generation with an average response time of less than three seconds under
normal conditions.
