6533b85efe1ef96bd12bf8d3
RESEARCH PRODUCT
Conception d'architectures compactes pour la détection spatiotemporelle d'actions en temps réel
Yu Liusubject
Artificial intelligenceApprentissage profond[INFO.INFO-TS] Computer Science [cs]/Signal and Image ProcessingDeep learningDétection d'actionsIntelligence artificielleAction detectiondescription
This thesis tackles the spatiotemporal action detection problem from an online, efficient, and real-time processing point of view. In the last decade, the explosive growth of video content has driven a broad range of application demands for automating human action understanding. Aside from accurate detection, vast sensing scenarios in the real-world also mandate incremental, instantaneous processing of scenes under restricted computational budgets. However, current research and related detection frameworks are incapable of simultaneously fulfilling the above criteria. The main challenge lies in their heavy architectural designs and detection pipelines to extract pertinent spatial and temporal context, such as incorporating 3D Convolutoinal Neural Networks (CNN) or explicit motion cues (e.g., optical flow). We hypothesize that reasoning actions' spatiotemporal pattern can be realized much more efficiently (down to feasible deployment on resource-constrained devices) without significantly compromising detection quality.To this end, we propose three action detection architectures coupling various spatiotemporal modeling schemes with compact 2D CNNs. We start by accelerating frame-level action detection by allocating bottom-up feature extraction to only a sparse set of video frames while approximating the rest. This is realized by spatially warping CNN features under the guidance of relative motion between successive frames, which we later expand to align-and-accumulate observations over time for modeling temporal variations of actions. Following the frame-level approach, we subsequently explore a multi-frame detection paradigm to concurrently process video sequences and predict the underlying action-specific bounding boxes (i.e., tubelets). Modeling of an action sequence is decoupled into multi-frame feature aggregation and trajectory tracking for enhanced classification and localization, respectively. Finally, we devise a flow-like motion representation that can be computed on-the-fly from raw video frames, and extend the above tubelet detection approach into two-CNN pathways to jointly extract actions' static visual and dynamic cues. We demonstrate that our online action detectors progressively improve and obtain a superior mix of accuracy, efficiency, and speed performance than state-of-the-art methods on public benchmarks.
year | journal | country | edition | language |
---|---|---|---|---|
2022-01-01 |