MusicGen

Paused

App Files Files Community

MusicGen / docs /ENCODEC.md

reach-vb

Stereo demo update (#60)

5325fcc over 2 years ago

preview code

raw

history blame contribute delete

6.8 kB

	# EnCodec: High Fidelity Neural Audio Compression

	AudioCraft provides the training code for EnCodec, a state-of-the-art deep learning
	based audio codec supporting both mono stereo audio, presented in the
	[High Fidelity Neural Audio Compression][arxiv] paper.
	Check out our [sample page][encodec_samples].

	## Original EnCodec models

	The EnCodec models presented in High Fidelity Neural Audio Compression can be accessed
	and used with the [EnCodec repository](https://github.com/facebookresearch/encodec).

	Note: We do not guarantee compatibility between the AudioCraft and EnCodec codebases
	and released checkpoints at this stage.


	## Installation

	Please follow the AudioCraft installation instructions from the [README](../README.md).


	## Training

	The [CompressionSolver](../audiocraft/solvers/compression.py) implements the audio reconstruction
	task to train an EnCodec model. Specifically, it trains an encoder-decoder with a quantization
	bottleneck - a SEANet encoder-decoder with Residual Vector Quantization bottleneck for EnCodec -
	using a combination of objective and perceptual losses in the forms of discriminators.

	The default configuration matches a causal EnCodec training with at a single bandwidth.

	### Example configuration and grids

	We provide sample configuration and grids for training EnCodec models.

	The compression configuration are defined in
	[config/solver/compression](../config/solver/compression).

	The example grids are available at
	[audiocraft/grids/compression](../audiocraft/grids/compression).

	```shell
	# base causal encodec on monophonic audio sampled at 24 khz
	dora grid compression.encodec_base_24khz
	# encodec model used for MusicGen on monophonic audio sampled at 32 khz
	dora grid compression.encodec_musicgen_32khz
	```

	### Training and valid stages

	The model is trained using a combination of objective and perceptual losses.
	More specifically, EnCodec is trained with the MS-STFT discriminator along with
	objective losses through the use of a loss balancer to effectively weight
	the different losses, in an intuitive manner.

	### Evaluation stage

	Evaluations metrics for audio generation:
	* SI-SNR: Scale-Invariant Signal-to-Noise Ratio.
	* ViSQOL: Virtual Speech Quality Objective Listener.

	Note: Path to the ViSQOL binary (compiled with bazel) needs to be provided in
	order to run the ViSQOL metric on the reference and degraded signals.
	The metric is disabled by default.
	Please refer to the [metrics documentation](../METRICS.md) to learn more.

	### Generation stage

	The generation stage consists in generating the reconstructed audio from samples
	with the current model. The number of samples generated and the batch size used are
	controlled by the `dataset.generate` configuration. The output path and audio formats
	are defined in the generate stage configuration.

	```shell
	# generate samples every 5 epoch
	dora run solver=compression/encodec_base_24khz generate.every=5
	# run with a different dset
	dora run solver=compression/encodec_base_24khz generate.path=<PATH_IN_DORA_XP_FOLDER>
	# limit the number of samples or use a different batch size
	dora grid solver=compression/encodec_base_24khz dataset.generate.num_samples=10 dataset.generate.batch_size=4
	```

	### Playing with the model

	Once you have a model trained, it is possible to get the entire solver, or just
	the trained model with the following functions:

	```python
	from audiocraft.solvers import CompressionSolver

	# If you trained a custom model with signature SIG.
	model = CompressionSolver.model_from_checkpoint('//sig/SIG')
	# If you want to get one of the pretrained models with the `//pretrained/` prefix.
	model = CompressionSolver.model_from_checkpoint('//pretrained/facebook/encodec_32khz')
	# Or load from a custom checkpoint path
	model = CompressionSolver.model_from_checkpoint('/my_checkpoints/foo/bar/checkpoint.th')


	# If you only want to use a pretrained model, you can also directly get it
	# from the CompressionModel base model class.
	from audiocraft.models import CompressionModel

	# Here do not put the `//pretrained/` prefix!
	model = CompressionModel.get_pretrained('facebook/encodec_32khz')
	model = CompressionModel.get_pretrained('dac_44khz')

	# Finally, you can also retrieve the full Solver object, with its dataloader etc.
	from audiocraft import train
	from pathlib import Path
	import logging
	import os
	import sys

	# uncomment the following line if you want some detailed logs when loading a Solver.
	logging.basicConfig(stream=sys.stderr, level=logging.INFO)
	# You must always run the following function from the root directory.
	os.chdir(Path(train.__file__).parent.parent)


	# You can also get the full solver (only for your own experiments).
	# You can provide some overrides to the parameters to make things more convenient.
	solver = train.get_solver_from_sig('SIG', {'device': 'cpu', 'dataset': {'batch_size': 8}})
	solver.model
	solver.dataloaders
	```

	### Importing / Exporting models

	At the moment we do not have a definitive workflow for exporting EnCodec models, for
	instance to Hugging Face (HF). We are working on supporting automatic convertion between
	AudioCraft and Hugging Face implementations.

	We still have some support for fine tuning an EnCodec model coming from HF in AudioCraft,
	using for instance `continue_from=//pretrained/facebook/encodec_32k`.

	An AudioCraft checkpoint can be exported in a more compact format (excluding the optimizer etc.)
	using `audiocraft.utils.export.export_encodec`. For instance, you could run

	```python
	from audiocraft.utils import export
	from audiocraft import train
	xp = train.main.get_xp_from_sig('SIG')
	export.export_encodec(
	xp.folder / 'checkpoint.th',
	'/checkpoints/my_audio_lm/compression_state_dict.bin')


	from audiocraft.models import CompressionModel
	model = CompressionModel.get_pretrained('/checkpoints/my_audio_lm/compression_state_dict.bin')

	from audiocraft.solvers import CompressionSolver
	# The two are strictly equivalent, but this function supports also loading from non already exported models.
	model = CompressionSolver.model_from_checkpoint('//pretrained//checkpoints/my_audio_lm/compression_state_dict.bin')
	```

	We will see then how to use this model as a tokenizer for MusicGen/Audio gen in the
	[MusicGen documentation](./MUSICGEN.md).

	### Learn more

	Learn more about AudioCraft training pipelines in the [dedicated section](./TRAINING.md).


	## Citation
	```
	@article{defossez2022highfi,
	title={High Fidelity Neural Audio Compression},
	author={Défossez, Alexandre and Copet, Jade and Synnaeve, Gabriel and Adi, Yossi},
	journal={arXiv preprint arXiv:2210.13438},
	year={2022}
	}
	```


	## License

	See license information in the [README](../README.md).

	[arxiv]: https://arxiv.org/abs/2210.13438
	[encodec_samples]: https://ai.honu.io/papers/encodec/samples.html