You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

`reazon-research/japanese-avhubert-large_noise_pt`

This is a pretrained AVHuBERT (Audio-Visual Hidden Unit BERT) Large model on Japanese audio-visual corpus with noise augmentaion. AVHuBERT is a self-supervised model for AVSR (Audio-Visual Speech Recognition), robust for a noisy environment by leveraging both audio and visual inputs.

This model is pretrained on approximately 2,250h of Japanese audio-visual dataset with noise augmentation using environment noise, music, babble noise, and speech mixup.

Usage

Please install dependencies first.

$ pip install git+https://github.com/reazon-research/ReazonSpeech.git#subdirectory=pkg/avsr

Using `transformers` directly

from transformers import AutoFeatureExtractor, AutoModel

extractor = AutoFeatureExtractor.from_pretrained("reazon-research/japanese-avhubert-large_noise_pt", trust_remote_code=True)
model = AutoModel.from_pretrained("reazon-research/japanese-avhubert-large_noise_pt", trust_remote_code=True)

inputs = extractor(raw_audio="path/to/audio", raw_video="path/to/video")
# If mouth extraction is not performed, you can add `extract_mouth=True`
inputs = extractor(raw_audio="path/to/audio", raw_video="path/to/video", extract_mouth=True)

outputs = model(**inputs)

Using `reazonspeech.avsr` package

If you do not want to use trust_remote_code, please install reazonspeech.avsr.

from reazonspeech.avsr import AVHubertFeatureExtractor, AVHubertModel

extractor = AVHubertFeatureExtractor.from_pretrained("reazon-research/japanese-avhubert-large_noise_pt")
model = AVHubertModel.from_pretrained("reazon-research/japanese-avhubert-large_noise_pt")

inputs = extractor(raw_audio="path/to/audio", raw_video="path/to/video")
# If mouth extraction is not performed, you can add `extract_mouth=True`
inputs = extractor(raw_audio="path/to/audio", raw_video="path/to/video", extract_mouth=True)

outputs = model(**inputs)

Citation

@misc{reazon-research/japanese-avhubert-large_noise_pt,
  title={japanese-avhubert-large_noise_pt},
  author={Sasaki, Yuta},
  url = {https://huggingface.co/reazon-research/japanese-avhubert-large_noise_pt},
  year = {2025}
}

@article{shi2022avhubert,
    author  = {Bowen Shi and Wei-Ning Hsu and Kushal Lakhotia and Abdelrahman Mohamed},
    title = {Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction},
    journal = {arXiv preprint arXiv:2201.02184}
    year = {2022}
}

@article{shi2022avsr,
    author  = {Bowen Shi and Wei-Ning Hsu and Abdelrahman Mohamed},
    title = {Robust Self-Supervised Audio-Visual Speech Recognition},
    journal = {arXiv preprint arXiv:2201.01763}
    year = {2022}
}