Playing jazz with cat sounds: NVIDIA announces next-generation AI voice synthesis technology "Fugatto".
Fugatto 1 - Foundational Generative Audio Transformer Opus 1 (Fugatto 1 - Foundational Generative Audio Transformer Opus 1)
November 26, 2024
The newly announced AI audio model 'Fugatto' by NVIDIA is an innovative technology that enables the creation of unprecedented sounds. This model can not only transform existing audio but also generate completely new sound effects, bringing new possibilities to music production and gaming development.
Fugatto: Infinite acoustic world brought by unique synthesis technology.
Fugatto: Infinite acoustic world brought by unique synthesis technology.
The ComposableART (Audio Representation Transformation) system, which forms the core of Fugatto, is an innovative technology that pioneers a new horizon in audio synthesis. This system independently controls the combination of instructions and tasks, enabling the generation of audio outputs beyond the range of previous training data.
Noteworthy is the ability to intricately combine different acoustic characteristics. The research team has successfully utilized advanced mathematical techniques such as instructions and frame indices, weighted combinations of vector fields between models to create non-existent sound effects like 'metallic moaning factory machinery' or 'trumpet barking like a dog'.
Noteworthy is the ability to intricately combine different acoustic characteristics. The research team has successfully utilized advanced mathematical techniques such as instructions and frame indices, weighted combinations of vector fields between models to create non-existent sound effects like 'metallic moaning factory machinery' or 'trumpet barking like a dog'.
Another remarkable point is the method of treating each acoustic characteristic not dichotomously but as a continuous scale. When combining an acoustic guitar with the sound of flowing water, adjusting the weight of each element finely can create completely different sound effects. Moreover, this continuous control is also possible in adjusting speaker's emotional expressions and voice qualities.
Fugatto also highly achieves existing audio processing tasks. It can detect individual notes from MIDI data and replace them with various vocal qualities for singing, detect the beat of a song and rhythmically place sound effects like drums, dog barks, clock sounds, etc. This functionality hints at a wide range of applications including music prototyping, dynamic scoring for video games, and international advertising production.
Music producer and composer Ido Zmishlany has compared the role of AI in music to how the electric guitar birthed rock 'n' roll and samplers birthed hip-hop, stating that AI will open a new chapter in music. However, NVIDIA positions Fugatto not as a replacement for artists' creativity but rather as a new expressive tool. This stance aims to guide the coexistence of technological innovation and artistic creativity.
Music producer and composer Ido Zmishlany has compared the role of AI in music to how the electric guitar birthed rock 'n' roll and samplers birthed hip-hop, stating that AI will open a new chapter in music. However, NVIDIA positions Fugatto not as a replacement for artists' creativity but rather as a new expressive tool. This stance aims to guide the coexistence of technological innovation and artistic creativity.
Innovative learning methods and consideration for safety.
In the development of Fugatto, NVIDIA's research team faced the challenging task of finding significant relationships between speech and language. While traditional language models can infer various ways of handling instructions from text data itself, generalizing the characteristics and properties from speech data was extremely difficult.
To address this challenge, the research team adopted a proprietary multi-layered learning approach. Firstly, they utilized a large-scale language model to generate Python scripts and created various templates and free-form instructional texts describing various speech 'personas'. These include diverse characteristics such as 'standard', 'youth-oriented', 'for people in their 30s', and 'professional'. Additionally, they generated both absolute instructions like 'synthesize a bright voice' and relative instructions like 'increase the brightness of this voice'.
In the construction of the training dataset, innovative methods were also adopted. Utilizing existing speech understanding models, they created 'synthetic captions' for training clips and quantified characteristics such as gender, emotion, and voice quality in natural language. Additionally, acoustic properties like fundamental frequency variance and reverb were quantified using sound processing tools.
To address this challenge, the research team adopted a proprietary multi-layered learning approach. Firstly, they utilized a large-scale language model to generate Python scripts and created various templates and free-form instructional texts describing various speech 'personas'. These include diverse characteristics such as 'standard', 'youth-oriented', 'for people in their 30s', and 'professional'. Additionally, they generated both absolute instructions like 'synthesize a bright voice' and relative instructions like 'increase the brightness of this voice'.
In the construction of the training dataset, innovative methods were also adopted. Utilizing existing speech understanding models, they created 'synthetic captions' for training clips and quantified characteristics such as gender, emotion, and voice quality in natural language. Additionally, acoustic properties like fundamental frequency variance and reverb were quantified using sound processing tools.
For comparative relational learning, they utilized datasets where elements were varied while keeping one element fixed, such as reading the same text with different emotions or playing the same phrase with different instruments. This approach allowed the model to learn subtle differences like the features of 'brighter' voices or the tonal differences between saxophone and flute.
The final dataset constructed through this complex learning process became vast, containing over 20 million samples and over 0.05 million hours of speech data. The model, trained using 32 NVIDIA Tensor Cores and 0.25 billion parameters, demonstrated high reliability scores in various voice quality tests.
The final dataset constructed through this complex learning process became vast, containing over 20 million samples and over 0.05 million hours of speech data. The model, trained using 32 NVIDIA Tensor Cores and 0.25 billion parameters, demonstrated high reliability scores in various voice quality tests.
However, NVIDIA remains cautious about the public release of Fugatto. Bryan Catanzaro has pointed out the risks inherent in generative technology and emphasized the need to prevent undesirable uses. Furthermore, legal risks related to copyright, such as major music companies like Sony, Warner, and Universal suing AI music generation startups for copyright infringement, and cases like actress Scarlett Johansson suing OpenAI for unauthorized voice replication, also require careful attention.
Thus, the development of Fugatto serves as an attempt to balance technological innovation and responsible deployment, providing important guidelines for the societal implementation of AI technology. NVIDIA's approach to exploring the potential of speech generation technology while properly controlling its influence may become a model case for future AI development.
Thus, the development of Fugatto serves as an attempt to balance technological innovation and responsible deployment, providing important guidelines for the societal implementation of AI technology. NVIDIA's approach to exploring the potential of speech generation technology while properly controlling its influence may become a model case for future AI development.
The debut of Fugatto is a groundbreaking event that brings a completely new sound expression. However, behind its innovation lie copyright issues and ethical challenges. In particular, voice replication and conversion technologies entail risks such as impersonation and the spread of misinformation. NVIDIA's cautious approach is appropriate, but democratizing this technology will also be an important challenge. Ultimately, the evolution of technology is unstoppable. What's crucial is to properly control its power and leverage it as a new possibility for creative expression.
Disclaimer: Community is offered by Moomoo Technologies Inc. and is for educational purposes only.
Read more
Comment
Sign in to post a comment