A Data Association Algorithm for People Re-Identification in Photo Sequences
In this paper, a new system is presented to support the user in the face annotation task. Every time a photo sequence becomes available, the system analyses it to detect and cluster faces in set corresponding to the same person. We propose to model the problem of people re-identification in photos as a data association problem. In this way, the system takes advantage from the assumption that each person can appear at most once in each photo. We propose a fully automated method for grouping facial images, the method does not require any initialization neither a priori knowledge of the number of persons that are in the photo sequence. We compare the results obtained with our method and with s…
Object Matching in Distributed Video Surveillance Systems by LDA-Based Appearance Descriptors
Establishing correspondences among object instances is still challenging in multi-camera surveillance systems, especially when the cameras’ fields of view are non-overlapping. Spatiotemporal constraints can help in solving the correspondence problem but still leave a wide margin of uncertainty. One way to reduce this uncertainty is to use ap- pearance information about the moving objects in the site. In this paper we present the preliminary results of a new method that can capture salient appearance characteristics at each camera node in the network. A Latent Dirichlet Allocation (LDA) model is created and maintained at each node in the camera network. Each object is encoded in terms of the…
3D skeleton-based human action classification: A survey
In recent years, there has been a proliferation of works on human action classification from depth sequences. These works generally present methods and/or feature representations for the classification of actions from sequences of 3D locations of human body joints and/or other sources of data, such as depth maps and RGB videos.This survey highlights motivations and challenges of this very recent research area by presenting technologies and approaches for 3D skeleton-based action classification. The work focuses on aspects such as data pre-processing, publicly available benchmarks and commonly used accuracy measurements. Furthermore, this survey introduces a categorization of the most recent…
Multi-modal Medical Image Registration by Local Affine Transformations
Image registration is the process of finding the geometric transformation that, applied to the floating image, gives the registered image with the highest similarity to the reference image. Registering a pair of images involves the definition of a similarity function in terms of the parameters of the geometric transformation that allows the registration. This paper proposes to register a pair of images by iteratively maximizing the empirical mutual information through coordinate gradient descent. Hence, the registered image is obtained by applying a sequence of local affine transformations. Rather than adopting a uniformly spaced grid to select image blocks to locally register, as done by s…
Depth-Aware Multi-object Tracking in Spherical Videos
This paper deals with the multi-object tracking (MOT) problem in videos acquired by 360-degree cameras. Targets are tracked by a frame-by-frame association strategy. At each frame, candidate targets are detected by a pre-trained state-of-the-art deep model. Associations to the targets known till the previous frame are found by solving a data association problem considering the locations of the targets in the scene. In case of a missing detection, a Kalman filter is used to track the target. Differently than works at the state-of-the-art, the proposed tracker considers the depth of the targets in the scene. The distance of the targets from the camera can be estimated by geometrical facts pec…
Online Multi-Person Tracking by Tracker Hierarchy
Tracking-by-detection is a widely used paradigm for multi-person tracking but is affected by variations in crowd density, obstacles in the scene, varying illumination, human pose variation, scale changes, etc. We propose an improved tracking-by-detection framework for multi-person tracking where the appearance model is formulated as a template ensemble updated online given detections provided by a pedestrian detector. We employ a hierarchy of trackers to select the most effective tracking strategy and an algorithm to adapt the conditions for trackers' initialization and termination. Our formulation is online and does not require calibration information. In experiments with four pedestrian t…
An on-line learning method for face association in personal photo collection
Due to the widespread use of cameras, it is very common to collect thousands of personal photos. A proper organization is needed to make the collection usable and to enable an easy photo retrieval. In this paper, we present a method to organize personal photo collections based on ''who'' is in the picture. Our method consists in detecting the faces in the photo sequence and arranging them in groups corresponding to the probable identities. This problem can be conveniently modeled as a multi-target visual tracking where a set of on-line trained classifiers is used to represent the identity models. In contrast to other works where clustering methods are used, our method relies on a probabilis…
Hop: Histogram of patterns for human action representation
This paper presents a novel method for representing actions in terms of multinomial distributions of frequent sequential patterns of different length. Frequent sequential patterns are series of data descriptors that occur many times in the data. This paper proposes to learn a codebook of frequent sequential patterns by means of an apriori-like algorithm, and to represent an action with a Bag-of-Frequent-Sequential-Patterns approach. Preliminary experiments of the proposed method have been conducted for action classification on skeletal data. The method achieves state-of-the-art accuracy value in cross-subject validation.
Gesture Modeling by Hanklet-Based Hidden Markov Model
In this paper we propose a novel approach for gesture modeling. We aim at decomposing a gesture into sub-trajectories that are the output of a sequence of atomic linear time invariant (LTI) systems, and we use a Hidden Markov Model to model the transitions from the LTI system to another. For this purpose, we represent the human body motion in a temporal window as a set of body joint trajectories that we assume are the output of an LTI system. We describe the set of trajectories in a temporal window by the corresponding Hankel matrix (Hanklet), which embeds the observability matrix of the LTI system that produced it. We train a set of HMMs (one for each gesture class) with a discriminative a…
Activity Monitoring Made Easier by Smart 360-degree Cameras
This paper proposes the use of smart 360-degree cameras for activity monitoring. By exploiting the geometric properties of these cameras and adopting off-the-shelf tracking algorithms adapted to equirectangular images, this paper shows how simple it becomes deploying a camera network, and detecting the presence of pedestrians in predefined regions of interest with minimal information on the camera, namely its height. The paper further shows that smart 360-degree cameras can enhance motion understanding in the environment and proposes a simple method to estimate the heatmap of the scene to highlight regions where pedestrians are more often present. Quantitative and qualitative results demons…
3D Object Modeling by Sharing Visual Attributes across Poses and Scales
Scene parsing aims at understanding a scene and the arrangements of the objects in it. While this is a task human beings are pretty good at [7], a machine needs to: recognize the kind of scene (indoor vs outdoor, bedroom vs. living room etc.)[4], detect and recognize 3D objects across multiple poses and scales [8, 5], infer the geometrical arrangement of the objects in the scene [2, 1], etc.. In the proposed framework, a 3D object is modeled as a graph. Each node in the graph represents a visual attribute automatically discovered by considering features that are consistently and repeatedly present across different poses and scales. Such visual attributes are different from “parts” [5], whic…
A data association approach to detect and organize people in personal photo collections
In this paper we present a method to automatically segment a photo sequence in groups containing the same persons. Many methods in literature accom- plish to this task by adopting clustering techniques. We model the problem as the search for probable associations between faces detected in subsequent photos con- sidering the mutual exclusivity constraint: a person can not be in a photo two times, nor two faces in the same photo can be assigned to the same group. Associations have been found considering face and clothing descriptions. In particular, a two level architecture has been adopted: at the first level, associations are computed within meaningful temporal windows (situations); at the …
Particle Filtering for Tracking in 360 Degrees Videos Using Virtual PTZ Cameras
360 degrees cameras are devices able to record spherical images of the environment. Such images can be used to generate views of the scene by projecting the spherical surface onto planes tangent to the sphere. Each of these views can be considered as the output of a virtual PTZ (vPTZ) camera with specific pan, tilt and zoom parameters. This paper proposes to formulate the visual tracking problem as the one of selecting, at each time, the vPTZ camera to foveate on the target from the unlimited set of simultaneously generated vPTZ camera views. Assuming that the selected vPTZ camera is a stochastic variable, the paper proposes to model the posterior distribution of the underlying stochastic p…
Enabling Technologies on Hybrid Camera Networks for Behavioral Analysis of Unattended Indoor Environments and Their Surroundings
This paper presents a layered network architecture and the enabling technologies for accomplishing vision-based behavioral analysis of unattended environments. Specifically the vision network covers both the attended environment and its surroundings by means of multi-modal cameras. The layer overlooking at the surroundings is laid outdoor and tracks people, monitoring entrance/exit points. It recovers the geometry of the site under surveillance and communicates people positions to a higher level layer. The layer monitoring the unattended environment undertakes similar goals, with the addition of maintaining a global mosaic of the observed scene for further understanding. Moreover, it merges …
Real-Time Object Detection in Embedded Video Surveillance Systems
In this paper we report a new method to detect both moving objects and new stationary objects in video sequences. On the basis of temporal consideration we classify pixels into three classes: background, midground and foreground to distinguish between long-term, medium-term and short-term changes. The algorithm has been implemented on a hardware platform with limited resources and it could be used in a wider system like a wireless sensor networks. Particular care has been put in realizing the algorithm so that the limited available resources are used in an efficient way. Experiments have been conducted on publicly available datasets and performance measures are reported.
Deep Motion Model for Pedestrian Tracking in 360 Degrees Videos
This paper proposes a deep convolutional neural network (CNN) for pedestrian tracking in 360◦ videos based on the target’s motion. The tracking algorithm takes advantage of a virtual Pan-Tilt-Zoom (vPTZ) camera simulated by means of the 360◦ video. The CNN takes in input a motion image, i.e. the difference of two images taken by using the vPTZ camera at different times by the same pan, tilt and zoom parameters. The CNN predicts the vPTZ camera parameter adjustments required to keep the target at the center of the vPTZ camera view. Experiments on a publicly available dataset performed in cross-validation demonstrate that the learned motion model generalizes, and that the proposed tracking algo…
A Decisional Multi-Agent Framework for Automatic Supply Chain Arrangement
In this work, a multi-agent system (MAS) for supply chain dynamic configuration is proposed. The brain of each agent is composed of a Bayesian Decision Network (BDN); this choice allows the agent for taking the best decisions estimating benefits and potential risks of different strategies, analyzing and managing uncertain information about the collaborating companies. Each agent collects information about customer's orders and current market prices, and analyzes previous experiences of collaborations with trading partners. The agent therefore performs a probabilistic inferential reasoning to filter information modeled in its knowledge base in order to achieve the best performance in the sup…
Decoding Children's Social Behavior
We introduce a new problem domain for activity recognition: the analysis of children's social and communicative behaviors based on video and audio data. We specifically target interactions between children aged 1-2 years and an adult. Such interactions arise naturally in the diagnosis and treatment of developmental disorders such as autism. We introduce a new publicly-available dataset containing over 160 sessions of a 3-5 minute child-adult interaction. In each session, the adult examiner followed a semi-structured play interaction protocol which was designed to elicit a broad range of social behaviors. We identify the key technical challenges in analyzing these behaviors, and describe met…
Ensemble of Hankel Matrices for Face Emotion Recognition
In this paper, a face emotion is considered as the result of the composition of multiple concurrent signals, each corresponding to the movements of a specific facial muscle. These concurrent signals are represented by means of a set of multi-scale appearance features that might be correlated with one or more concurrent signals. The extraction of these appearance features from a sequence of face images yields to a set of time series. This paper proposes to use the dynamics regulating each appearance feature time series to recognize among different face emotions. To this purpose, an ensemble of Hankel matrices corresponding to the extracted time series is used for emotion classification withi…
Entropy-based Localization of Textured Regions
Appearance description is a relevant field in computer vision that enables object recognition in domains as re-identification, retrieval and classification. Important cues to describe appearance are colors and textures. However, in real cases, texture detection is challenging due to occlusions and to deformations of the clothing while person's pose changes. Moreover, in some cases, the processed images have a low resolution and methods at the state of the art for texture analysis are not appropriate. In this paper, we deal with the problem of localizing real textures for clothing description purposes, such as stripes and/or complex patterns. Our method uses the entropy of primitive distribu…
Pedestrian Tracking in 360 Video by Virtual PTZ Cameras
Since the data acquired by a PTZ camera change while adjusting the pan, tilt and zoom parameters, the results of tracking algorithms are difficult to reproduce; such diffi- culty limits the development and the comparison of tracking algorithms with PTZ cameras. The recently introduced 360- degree cameras acquire spherical views of the environment, generally stored as equirectangular images. Each pixel of an equirectangular image corresponds to a point on the spherical surface. A gnomonic projection can be used to project the points on the spherical surface onto a plane tangent to the sphere. Such tangent plane can be interpreted as the image plane of a virtual PTZ camera oriented towards th…
A Dataset of Annotated Omnidirectional Videos for Distancing Applications
Omnidirectional (or 360°) cameras are acquisition devices that, in the next few years, could have a big impact on video surveillance applications, research, and industry, as they can record a spherical view of a whole environment from every perspective. This paper presents two new contributions to the research community: the CVIP360 dataset, an annotated dataset of 360° videos for distancing applications, and a new method to estimate the distances of objects in a scene from a single 360° image. The CVIP360 dataset includes 16 videos acquired outdoors and indoors, annotated by adding information about the pedestrians in the scene (bounding boxes) and the distances to the camera of some point…
Hankelet-based dynamical systems modeling for 3D action recognition
This paper proposes to model an action as the output of a sequence of atomic Linear Time Invariant (LTI) systems. The sequence of LTI systems generating the action is modeled as a Markov chain, where a Hidden Markov Model (HMM) is used to model the transition from one atomic LTI system to another. In turn, the LTI systems are represented in terms of their Hankel matrices. For classification purposes, the parameters of a set of HMMs (one for each action class) are learned via a discriminative approach. This work proposes a novel method to learn the atomic LTI systems from training data, and analyzes in detail the action representation in terms of a sequence of Hankel matrices. Extensive eval…
Real-time estimation of geometrical transformation between views in distributed smart-cameras systems
In this paper, we present a method to automatically estimate the geometric relations among the different views of cameras with partially overlapping fields of view in a wireless video-surveillance system. The method uses the locations of the detected moving objects visible at the same time in two or more views. The correspondences among objects are found by comparing their appearance models based on dominant colour descriptors while the geometric transformation are computed iteratively and may be used to solve the consistent labelling problem. As a significant part of the processing is performed on the smart cameras, the method has been conceived by taking into account the limited resources…
Joint Alignment and Modeling of Correlated Behavior Streams
The Variable Time-Shift Hidden Markov Model (VTS- HMM) is proposed for learning and modeling pairs of cor- related streams. Unlike previous coupled models for time series, the VTS-HMM accounts for varying time shifts be- tween correlated events in pairs of streams having different properties. The VTS-HMM is learned on a set of pairs of unaligned streams and, thus, learning entails simultaneous estimation of the varying time shifts and of the parameters of the model. The formulation is demonstrated in the analysis of videos of dyadic social interactions between children and adults in the Multimodal Dyadic Behavior Dataset (MMDB). In dyadic social interactions, an agent starts an interaction …
Path Modeling and Retrieval in Distributed Video Surveillance Databases
We propose a framework for querying a distributed database of video surveillance data in order to retrieve a set of likely paths of a person moving in the area under surveillance. In our framework, each camera of the surveillance system locally pro- cesses the data and stores video sequences in a storage unit and the metadata for each detected person in the distributed database. A pedestrian’s path is formulated as a dynamic Bayesian network (DBN) to model the dependencies between subsequent observa- tions of the person as he makes his way through the camera net- work. We propose a tool by which the analyst can pose queries about where a certain person appeared while moving in the site duri…
A method to reduce the FP/imm number through CC and MLO views comparison in mammographic images
In this paper we propose a method to reduce the FP/imm number through CC and MLO mammographic views comparison of the same patient. The proposed solution uses the symmetry properties of the breast to compute a geometric transformation that permits to represent the two images in comparable coordinates systems. Through this method, potential pathological ROIs of one of the projections are correlated with the ROIs in the second view. To show the effectiveness of the result we apply the method on a dataset composed of 112 couples of pathological images. Experiments shows that method enables a reduction by up to 700/0 of the FP/imm number detected after the classification step
Boosting Hankel matrices for face emotion recognition and pain detection
HighligthsDynamics of face expression descriptors are modeled for emotion recognition.A set of Hankel matrices is built upon several multi-scale face representations.Boosting and random subspace projection are used for dynamics selection.Dynamics of Haar-like features and Gabor Energies are compared.Fine-grained dynamics of subtle expressions can be modeled at small spatial scales. Studies in psychology have shown that the dynamics of emotional expressions play an important role in face emotion recognition in humans. Motivated by these studies, in this paper the dynamics of face expressions are modeled and used for automatic emotion recognition and pain detection.Given a temporal sequence o…
Integrating computer vision techniques and wireless sensor networks in video surveillance systems
Nowadays video-surveillance systems are essential tools to monitor sites and to guarantee the safety of people: automatic detection of moving objects in the scene and recognition of dangerous events are particularly interesting. Our project aims to realize tools and techniques for video surveillance systems in outdoor environment to detect people in an automatic real-time way without the direct control of a human operator. The reference framework consists of distributed stationary cameras coordinated with sensor networks. In particular, wireless sensors are used to sense characteristic quantities of the monitored site, such as variations in temperature, humidity, noise, vibrations, and so o…
Using Hankel matrices for dynamics-based facial emotion recognition and pain detection
This paper proposes a new approach to model the temporal dynamics of a sequence of facial expressions. To this purpose, a sequence of Face Image Descriptors (FID) is regarded as the output of a Linear Time Invariant (LTI) system. The temporal dynamics of such sequence of descriptors are represented by means of a Hankel matrix. The paper presents different strategies to compute dynamics-based representation of a sequence of FID, and reports classification accuracy values of the proposed representations within different standard classification frameworks. The representations have been validated in two very challenging application domains: emotion recognition and pain detection. Experiments on…
Multi-modal non-rigid registration of medical images based on mutual information maximization
In this paper, a new multi-modal non-rigid registration technique for medical images is presented. Firstly, the registration problem is outlined and some of the most common approaches reported, then, the proposed algorithm is presented. The proposed technique is based on mutual information maximization and computes a deformation field through a suitable globally smoothed affine piecewise transformation. The algorithm has been conceived with particular attention to computational load and accuracy of results. Experimental results involving intra-patient, inter-patients and atlas images on brain CT and MR (T1, T2 and PD modalities) are reported.
Iterative Multiple Bounding-Box Refinements for Visual Tracking.
Single-object visual tracking aims at locating a target in each video frame by predicting the bounding box of the object. Recent approaches have adopted iterative procedures to gradually refine the bounding box and locate the target in the image. In such approaches, the deep model takes as input the image patch corresponding to the currently estimated target bounding box, and provides as output the probability associated with each of the possible bounding box refinements, generally defined as a discrete set of linear transformations of the bounding box center and size. At each iteration, only one transformation is applied, and supervised training of the model may introduce an inherent ambig…
Concurrent photo sequence organization
Personal photo album organization is a highly demanding domain where advanced tools are required to manage large photo collections. In contrast to many previous works, that try to solve the problem of organizing a single user photo sequence, we present a new technique to account for the concurrent photo sequence organization problem, that is the problem of organizing multiple photo sequences taken during the same event. Given a set of sequences acquired at the same place during the same temporal window by several users using different cameras, our framework is intended to capture the evolution of the event and groups photos based on temporal proximity and visual content. The method automati…
Hankelet-based action classification for motor intention recognition
Powered lower-limb prostheses require a natural, and an easy-to-use, interface for communicating amputee’s motor intention in order to select the appropriate motor program in any given context, or simply to commute from active (powered) to passive mode of functioning. To be widely accepted, such an interface should not put additional cognitive load at the end-user, it should be reliable and minimally invasive. In this paper we present a one such interface based on a robust method for detecting and recognizing motor actions from a low-cost wearable sensor network mounted on a sound leg providing inertial (accelerometer, gyrometer and magnetometer) data in real-time. We assume that the sensor…
Tracking your detector performance: How to grow an effective training set in tracking-by-detection methods
In many tracking-by-detection approaches, a self-learning strategy is adopted to augment the training set with new positive and negative instances, and to refine the classifier weights. Previous works focus mainly on the learning algorithm and assume the detector is never wrong while classifying samples at the current frame; the most confident sample is chosen as the target, and the training set is augmented with samples selected in its surrounding area. A wrong choice of such samples may degrade the classifier parameters and cause drifting during tracking. In this paper, the focus is on how samples are chosen while retraining the classifier. A particle filtering framework is used to infer …
A Novel Time Series Kernel for Sequences Generated by LTI Systems
The recent introduction of Hankelets to describe time series relies on the assumption that the time series has been generated by a vector autoregressive model (VAR) of order p. The success of Hankelet-based time series representations prevalently in nearest neighbor classifiers poses questions about if and how this representation can be used in kernel machines without the usual adoption of mid-level representations (such as codebook-based representations). It is also of interest to investigate how this representation relates to probabilistic approaches for time series modeling, and which characteristics of the VAR model a Hankelet can capture. This paper aims at filling these gaps by: deriv…
Semiautomatic Behavioral Change-Point Detection: A Case Study Analyzing Children Interactions With a Social Agent
The study of human behaviors in cognitive sciences provides clues to understand and describe people’s personal and interpersonal functioning. In particular, the temporal analysis of behavioral dynamics can be a powerful tool to reveal events, correlations and causalities but also to discover abnormal behaviors. However, the annotation of these dynamics can be expensive in terms of temporal and human resources. To tackle this challenge, this paper proposes a methodology to semi-automatically annotate behavioral data. Behavioral dynamics can be expressed as sequences of simple dynamical processes: transitions between such processes are generally known as change-points. This paper describes th…
On the use of Deep Reinforcement Learning for Visual Tracking: a Survey
This paper aims at highlighting cutting-edge research results in the field of visual tracking by deep reinforcement learning. Deep reinforcement learning (DRL) is an emerging area combining recent progress in deep and reinforcement learning. It is showing interesting results in the computer vision field and, recently, it has been applied to the visual tracking problem yielding to the rapid development of novel tracking strategies. After providing an introduction to reinforcement learning, this paper compares recent visual tracking approaches based on deep reinforcement learning. Analysis of the state-of-the-art suggests that reinforcement learning allows modeling varying parts of the tracki…
Ensemble of Hankel Matrices for Face Emotion Recognition
In this paper, a face emotion is considered as the result of the composition of multiple concurrent signals, each corresponding to the movements of a specific facial muscle. These concurrent signals are represented by means of a set of multi-scale appearance features that might be correlated with one or more concurrent signals. The extraction of these appearance features from a sequence of face images yields to a set of time series. This paper proposes to use the dynamics regulating each appearance feature time series to recognize among different face emotions. To this purpose, an ensemble of Hankel matrices corresponding to the extracted time series is used for emotion classification withi…
I-MALL An Effective Framework for Personalized Visits. Improving the Customer Experience in Stores
In this paper we present I-MALL, an ICT hardware and software infrastructure that enables the management of services related to places such as shopping malls, showrooms, and conferences held in dedicated facilities. I-MALL offers a network of services that perform customer behavior analysis through computer vision and provide personalized recommendations made available on digital signage terminals. The user can also interact with a social robot. Recommendations are inferred on the basis of the profile of interests computed by the system analysing the history of the customer visit and his/her behavior including information from his/her appearance, the route taken inside the facility, as well…