Semantic Aware Video Transcription Using Random Forest Classifiers
Abstract
This paper focuses on transcription generation in the form of subject, verb, object (SVO) triplets for videos in the wild, given off-the-shelf visual concept detectors. This problem is challenging due to the availability of sentence only annotations, the unreliability of concept detectors, and the lack of training samples for many words. Facing these challenges, we propose a Semantic Aware Transcription (SAT) framework based on Random Forest classifiers. It takes concept detection results as input, and outputs a distribution of English words. SAT uses video, sentence pairs for training. It hierarchically learns node splits by grouping semantically similar words, measured by a continuous skip-gram language model. This not only addresses the sparsity of training samples per word, but also yields semantically reasonable errors during transcription. SAT provides a systematic way to measure the relatedness of a concept detector to real words, which helps us understand the relationship between current visual detectors and words in a semantic space. Experiments on a large video dataset with 1,970 clips and 85,550 sentences are used to demonstrate our idea.
Cite
Text
Sun and Nevatia. "Semantic Aware Video Transcription Using Random Forest Classifiers." European Conference on Computer Vision, 2014. doi:10.1007/978-3-319-10590-1_50Markdown
[Sun and Nevatia. "Semantic Aware Video Transcription Using Random Forest Classifiers." European Conference on Computer Vision, 2014.](https://mlanthology.org/eccv/2014/sun2014eccv-semantic/) doi:10.1007/978-3-319-10590-1_50BibTeX
@inproceedings{sun2014eccv-semantic,
title = {{Semantic Aware Video Transcription Using Random Forest Classifiers}},
author = {Sun, Chen and Nevatia, Ram},
booktitle = {European Conference on Computer Vision},
year = {2014},
pages = {772-786},
doi = {10.1007/978-3-319-10590-1_50},
url = {https://mlanthology.org/eccv/2014/sun2014eccv-semantic/}
}