| # EnCodec: High Fidelity Neural Audio Compression |
|
|
| AudioCraft provides the training code for EnCodec, a state-of-the-art deep learning |
| based audio codec supporting both mono stereo audio, presented in the |
| [High Fidelity Neural Audio Compression][arxiv] paper. |
| Check out our [sample page][encodec_samples]. |
|
|
| ## Original EnCodec models |
|
|
| The EnCodec models presented in High Fidelity Neural Audio Compression can be accessed |
| and used with the [EnCodec repository](https://github.com/facebookresearch/encodec). |
|
|
| **Note**: We do not guarantee compatibility between the AudioCraft and EnCodec codebases |
| and released checkpoints at this stage. |
|
|
|
|
| ## Installation |
|
|
| Please follow the AudioCraft installation instructions from the [README](../README.md). |
|
|
|
|
| ## Training |
|
|
| The [CompressionSolver](../audiocraft/solvers/compression.py) implements the audio reconstruction |
| task to train an EnCodec model. Specifically, it trains an encoder-decoder with a quantization |
| bottleneck - a SEANet encoder-decoder with Residual Vector Quantization bottleneck for EnCodec - |
| using a combination of objective and perceptual losses in the forms of discriminators. |
|
|
| The default configuration matches a causal EnCodec training with at a single bandwidth. |
|
|
| ### Example configuration and grids |
|
|
| We provide sample configuration and grids for training EnCodec models. |
|
|
| The compression configuration are defined in |
| [config/solver/compression](../config/solver/compression). |
|
|
| The example grids are available at |
| [audiocraft/grids/compression](../audiocraft/grids/compression). |
|
|
| ```shell |
| # base causal encodec on monophonic audio sampled at 24 khz |
| dora grid compression.encodec_base_24khz |
| # encodec model used for MusicGen on monophonic audio sampled at 32 khz |
| dora grid compression.encodec_musicgen_32khz |
| ``` |
|
|
| ### Training and valid stages |
|
|
| The model is trained using a combination of objective and perceptual losses. |
| More specifically, EnCodec is trained with the MS-STFT discriminator along with |
| objective losses through the use of a loss balancer to effectively weight |
| the different losses, in an intuitive manner. |
|
|
| ### Evaluation stage |
|
|
| Evaluations metrics for audio generation: |
| * SI-SNR: Scale-Invariant Signal-to-Noise Ratio. |
| * ViSQOL: Virtual Speech Quality Objective Listener. |
|
|
| Note: Path to the ViSQOL binary (compiled with bazel) needs to be provided in |
| order to run the ViSQOL metric on the reference and degraded signals. |
| The metric is disabled by default. |
| Please refer to the [metrics documentation](../METRICS.md) to learn more. |
|
|
| ### Generation stage |
|
|
| The generation stage consists in generating the reconstructed audio from samples |
| with the current model. The number of samples generated and the batch size used are |
| controlled by the `dataset.generate` configuration. The output path and audio formats |
| are defined in the generate stage configuration. |
|
|
| ```shell |
| # generate samples every 5 epoch |
| dora run solver=compression/encodec_base_24khz generate.every=5 |
| # run with a different dset |
| dora run solver=compression/encodec_base_24khz generate.path=<PATH_IN_DORA_XP_FOLDER> |
| # limit the number of samples or use a different batch size |
| dora grid solver=compression/encodec_base_24khz dataset.generate.num_samples=10 dataset.generate.batch_size=4 |
| ``` |
|
|
| ### Playing with the model |
|
|
| Once you have a model trained, it is possible to get the entire solver, or just |
| the trained model with the following functions: |
|
|
| ```python |
| from audiocraft.solvers import CompressionSolver |
| |
| # If you trained a custom model with signature SIG. |
| model = CompressionSolver.model_from_checkpoint('//sig/SIG') |
| # If you want to get one of the pretrained models with the `//pretrained/` prefix. |
| model = CompressionSolver.model_from_checkpoint('//pretrained/facebook/encodec_32khz') |
| # Or load from a custom checkpoint path |
| model = CompressionSolver.model_from_checkpoint('/my_checkpoints/foo/bar/checkpoint.th') |
| |
| |
| # If you only want to use a pretrained model, you can also directly get it |
| # from the CompressionModel base model class. |
| from audiocraft.models import CompressionModel |
| |
| # Here do not put the `//pretrained/` prefix! |
| model = CompressionModel.get_pretrained('facebook/encodec_32khz') |
| model = CompressionModel.get_pretrained('dac_44khz') |
| |
| # Finally, you can also retrieve the full Solver object, with its dataloader etc. |
| from audiocraft import train |
| from pathlib import Path |
| import logging |
| import os |
| import sys |
| |
| # uncomment the following line if you want some detailed logs when loading a Solver. |
| logging.basicConfig(stream=sys.stderr, level=logging.INFO) |
| # You must always run the following function from the root directory. |
| os.chdir(Path(train.__file__).parent.parent) |
| |
| |
| # You can also get the full solver (only for your own experiments). |
| # You can provide some overrides to the parameters to make things more convenient. |
| solver = train.get_solver_from_sig('SIG', {'device': 'cpu', 'dataset': {'batch_size': 8}}) |
| solver.model |
| solver.dataloaders |
| ``` |
|
|
| ### Importing / Exporting models |
|
|
| At the moment we do not have a definitive workflow for exporting EnCodec models, for |
| instance to Hugging Face (HF). We are working on supporting automatic convertion between |
| AudioCraft and Hugging Face implementations. |
|
|
| We still have some support for fine tuning an EnCodec model coming from HF in AudioCraft, |
| using for instance `continue_from=//pretrained/facebook/encodec_32k`. |
|
|
| An AudioCraft checkpoint can be exported in a more compact format (excluding the optimizer etc.) |
| using `audiocraft.utils.export.export_encodec`. For instance, you could run |
|
|
| ```python |
| from audiocraft.utils import export |
| from audiocraft import train |
| xp = train.main.get_xp_from_sig('SIG') |
| export.export_encodec( |
| xp.folder / 'checkpoint.th', |
| '/checkpoints/my_audio_lm/compression_state_dict.bin') |
| |
| |
| from audiocraft.models import CompressionModel |
| model = CompressionModel.get_pretrained('/checkpoints/my_audio_lm/compression_state_dict.bin') |
| |
| from audiocraft.solvers import CompressionSolver |
| # The two are strictly equivalent, but this function supports also loading from non already exported models. |
| model = CompressionSolver.model_from_checkpoint('//pretrained//checkpoints/my_audio_lm/compression_state_dict.bin') |
| ``` |
|
|
| We will see then how to use this model as a tokenizer for MusicGen/Audio gen in the |
| [MusicGen documentation](./MUSICGEN.md). |
|
|
| ### Learn more |
|
|
| Learn more about AudioCraft training pipelines in the [dedicated section](./TRAINING.md). |
|
|
|
|
| ## Citation |
| ``` |
| @article{defossez2022highfi, |
| title={High Fidelity Neural Audio Compression}, |
| author={Défossez, Alexandre and Copet, Jade and Synnaeve, Gabriel and Adi, Yossi}, |
| journal={arXiv preprint arXiv:2210.13438}, |
| year={2022} |
| } |
| ``` |
|
|
|
|
| ## License |
|
|
| See license information in the [README](../README.md). |
|
|
| [arxiv]: https://arxiv.org/abs/2210.13438 |
| [encodec_samples]: https://ai.honu.io/papers/encodec/samples.html |
|
|