MIMIC: Masked Image Modeling with Image Correspondences
Abstract
Dense pixel-specific representation learning at scale has been bottlenecked due to the unavailability of large-scale multi-view datasets. Current methods for building effective pretraining datasets heavily rely on annotated 3D meshes, point clouds, and camera parameters from simulated environments, preventing them from building datasets from real-world data sources where such metadata is lacking. We introduce a pretraining dataset-curation approach that does not require any additional annotations. Our method allows us to generate multi-view datasets from both real-world videos and simulated environments at scale. Specifically, we experiment with two scales: MIMIC-1M with 1.3M and MIMIC-3M with 3.1M multi-view image pairs and train models with different masked image modeling objectives. Through our comprehensive experimental analysis we show that: Representations trained on our automatically generated MIMIC-3M outperform those learned from expensive crowdsourced datasets (ImageNet-1K) and those learned from synthetic environments (Multiview-Habitat) on three dense geometric tasks: depth estimation on NYUv2 (↑1.7%), and surface normal estimation on Taskonomy (↓2.05%), and depth estimation on Taskonomy (↓7.5%) and performs on-par with Multiview-Habitat on Taskonomy edges and curvature tasks. Larger dataset (MIMIC-3M) improves performance, which is promising since our curation method can arbitrarily scale to produce even larger datasets. The code and instructions to download, access, and use MIMIC-3M can be found here.
Cite
Text
Marathe et al. "MIMIC: Masked Image Modeling with Image Correspondences." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024. doi:10.1109/CVPRW63382.2024.00076Markdown
[Marathe et al. "MIMIC: Masked Image Modeling with Image Correspondences." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024.](https://mlanthology.org/cvprw/2024/marathe2024cvprw-mimic/) doi:10.1109/CVPRW63382.2024.00076BibTeX
@inproceedings{marathe2024cvprw-mimic,
title = {{MIMIC: Masked Image Modeling with Image Correspondences}},
author = {Marathe, Kalyani and Bigverdi, Mahtab and Khan, Nishat and Kundu, Tuhin and Howe, Patrick and S, Sharan Ranjit and Bhattad, Anand and Kembhavi, Aniruddha and Shapiro, Linda G. and Krishna, Ranjay},
booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
year = {2024},
pages = {718-727},
doi = {10.1109/CVPRW63382.2024.00076},
url = {https://mlanthology.org/cvprw/2024/marathe2024cvprw-mimic/}
}