You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

reazon-research/japanese-avhubert-large_noise_pt

This is a pretrained AVHuBERT (Audio-Visual Hidden Unit BERT) Large model on Japanese audio-visual corpus with noise augmentaion. AVHuBERT is a self-supervised model for AVSR (Audio-Visual Speech Recognition), robust for a noisy environment by leveraging both audio and visual inputs.

This model is pretrained on approximately 2,250h of Japanese audio-visual dataset with noise augmentation using environment noise, music, babble noise, and speech mixup.

Usage

Please install dependencies first.

$ pip install git+https://github.com/reazon-research/ReazonSpeech.git#subdirectory=pkg/avsr

Using transformers directly

from transformers import AutoFeatureExtractor, AutoModel

extractor = AutoFeatureExtractor.from_pretrained("reazon-research/japanese-avhubert-large_noise_pt", trust_remote_code=True)
model = AutoModel.from_pretrained("reazon-research/japanese-avhubert-large_noise_pt", trust_remote_code=True)

inputs = extractor(raw_audio="path/to/audio", raw_video="path/to/video")
# If mouth extraction is not performed, you can add `extract_mouth=True`
inputs = extractor(raw_audio="path/to/audio", raw_video="path/to/video", extract_mouth=True)

outputs = model(**inputs)

Using reazonspeech.avsr package

If you do not want to use trust_remote_code, please install reazonspeech.avsr.

from reazonspeech.avsr import AVHubertFeatureExtractor, AVHubertModel

extractor = AVHubertFeatureExtractor.from_pretrained("reazon-research/japanese-avhubert-large_noise_pt")
model = AVHubertModel.from_pretrained("reazon-research/japanese-avhubert-large_noise_pt")

inputs = extractor(raw_audio="path/to/audio", raw_video="path/to/video")
# If mouth extraction is not performed, you can add `extract_mouth=True`
inputs = extractor(raw_audio="path/to/audio", raw_video="path/to/video", extract_mouth=True)

outputs = model(**inputs)

Citation

@misc{reazon-research/japanese-avhubert-large_noise_pt,
  title={japanese-avhubert-large_noise_pt},
  author={Sasaki, Yuta},
  url = {https://huggingface.co/reazon-research/japanese-avhubert-large_noise_pt},
  year = {2025}
}

@article{shi2022avhubert,
    author  = {Bowen Shi and Wei-Ning Hsu and Kushal Lakhotia and Abdelrahman Mohamed},
    title = {Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction},
    journal = {arXiv preprint arXiv:2201.02184}
    year = {2022}
}

@article{shi2022avsr,
    author  = {Bowen Shi and Wei-Ning Hsu and Abdelrahman Mohamed},
    title = {Robust Self-Supervised Audio-Visual Speech Recognition},
    journal = {arXiv preprint arXiv:2201.01763}
    year = {2022}
}

License

CC-BY-NC-4.0

Downloads last month
-
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for enactic/japanese-avhubert-large_noise_pt

Finetunes
2 models

Collections including enactic/japanese-avhubert-large_noise_pt

Papers for enactic/japanese-avhubert-large_noise_pt