The goal of the project is to develop a comprehensive voice interaction subsystem for an autonomous robot. The newly developed voice communication module will enable the natural exchange of information between users and the robot, which will significantly increase its usability in applications such as assistance to the elderly, support in public institutions or orientation in complex indoor environments. The proposed subsystem includes the integration of modules for audio processing, speech-to-text transcription, generating responses using large language models (LLm, e.g. GPT-4), and reproducing responses using a custom speech synthesizer. These modules will work in an integrated manner, creating an overall solution ready for practical use.
Specific objectives of the project:
- Develop and implement methods for converting speech to text and vice versa;
- Investigate the performance of speech recognition algorithms in terms of accuracy (WER), speed (RTF) and noise immunity;
- Develop test scenarios and analyze results using Python Speech Recognition Toolkit;
Development and implementation of methods to synthesize speech from text;
- Development of a voice synthesizer based on deep learning models;
- Integration of the voice interaction subsystem with the Semantic Space Orientation System;
The project fills an important research gap in analyzing the effectiveness of speech recognition and voice cloning systems in complex application environments. The work will advance the understanding of the performance of speech processing algorithms under acoustic interference. In addition, the project will enable the development of an off-the-shelf AI-based voice generation solution.
Sematic Spatial Orientation System - a subsystem of voice interaction and speech processing
Historia zmian
The goal of the project is to develop a comprehensive voice interaction subsystem for an autonomous robot. The newly developed voice communication module will enable the natural exchange of information between users and the robot, which will significantly increase its usability in applications such as assistance to the elderly, support in public institutions or orientation in complex indoor environments. The proposed subsystem includes the integration of modules for audio processing, speech-to-text transcription, generating responses using large language models (LLm, e.g. GPT-4), and reproducing responses using a custom speech synthesizer. These modules will work in an integrated manner, creating an overall solution ready for practical use.
Specific objectives of the project:
- Develop and implement methods for converting speech to text and vice versa;
- Investigate the performance of speech recognition algorithms in terms of accuracy (WER), speed (RTF) and noise immunity;
- Develop test scenarios and analyze results using Python Speech Recognition Toolkit;
Development and implementation of methods to synthesize speech from text;
- Development of a voice synthesizer based on deep learning models;
- Integration of the voice interaction subsystem with the Semantic Space Orientation System;
The project fills an important research gap in analyzing the effectiveness of speech recognition and voice cloning systems in complex application environments. The work will advance the understanding of the performance of speech processing algorithms under acoustic interference. In addition, the project will enable the development of an off-the-shelf AI-based voice generation solution.