zhoubolei/scene_parse_150
Updated • 2.06k • 31
This model is a UPerNet semantic segmentation model with a FiRE-ViT (Vision Transformer with Rotary Position Embeddings) backbone, trained on the ADE20K dataset.
| Metric | Value |
|---|---|
| mIoU | 24.38% |
| mAcc | 33.57% |
| aAcc | 71.33% |
from mmseg.apis import init_model, inference_model
config_file = 'upernet_fire_vit_tiny_512x512_ade20k.py'
checkpoint_file = 'best_mIoU_iter_40000.pth'
# Initialize the model
model = init_model(config_file, checkpoint_file, device='cuda:0')
# Inference on an image
result = inference_model(model, 'demo.jpg')
The model was trained with the following configuration:
If you use this model, please cite:
@misc{rope-vit-segmentation,
author = {VLG IITR},
title = {UPerNet with FiRE-ViT for Semantic Segmentation},
year = {2026},
publisher = {Hugging Face},
}
This model is released under the Apache 2.0 license.