Generating Construction Safety Observations via CLIP-Based Image-Language Embedding
Abstract
Safety inspections are standard practices to prevent accidents from happening on construction sites. Traditional workflows require an inspector to document the violations through photos and textual descriptions explaining the specific incident with the objects, actions, and context. However, the documentation process is time-consuming, and the content is inconsistent. The same violation could be captioned in various ways, making the safety analysis tricky. Research has investigated means to improve the documentation process efficiency through applications with standardized forms and develop language understanding models to analyze the safety reports. Nevertheless, it is still challenging to streamline the entire documentation process and accurately compile the reports into meaningful information. We propose an image-language embedding model that automatically generates textual safety observations through the Contrastive Language-Image Pre-trained (CLIP) fine-tuning and CLIP prefix captioning designed based on the construction safety context. CLIP can obtain the contrastive features to classify the safety attribute types for images, and CLIP prefix captioning generates the caption from the given safety attributes, images, and captions. The framework is evaluated through a construction safety report dataset and could create reasonable textual information for safety inspectors.
Cite
Text
Tsai et al. "Generating Construction Safety Observations via CLIP-Based Image-Language Embedding." European Conference on Computer Vision Workshops, 2022. doi:10.1007/978-3-031-25082-8_24Markdown
[Tsai et al. "Generating Construction Safety Observations via CLIP-Based Image-Language Embedding." European Conference on Computer Vision Workshops, 2022.](https://mlanthology.org/eccvw/2022/tsai2022eccvw-generating/) doi:10.1007/978-3-031-25082-8_24BibTeX
@inproceedings{tsai2022eccvw-generating,
title = {{Generating Construction Safety Observations via CLIP-Based Image-Language Embedding}},
author = {Tsai, Wei Lun and Lin, Jacob J. and Hsieh, Shang-Hsien},
booktitle = {European Conference on Computer Vision Workshops},
year = {2022},
pages = {366-381},
doi = {10.1007/978-3-031-25082-8_24},
url = {https://mlanthology.org/eccvw/2022/tsai2022eccvw-generating/}
}