Professor Subhas Chandra Mukhopadhyay
Exeley Inc. (New York)
Subject: Computational Science & Engineering , Engineering, Electrical & Electronic
eISSN: 1178-5608
SEARCH WITHIN CONTENT
S. Hernández ^{*} / P. Sallis
Keywords : Visual tracking, Bernoulli filter.
Citation Information : International Journal on Smart Sensing and Intelligent Systems. Volume 13, Issue 1, Pages 1-8, DOI: https://doi.org/10.21307/ijssis-2020-001
License : (BY-NC-ND-4.0)
Published Online: 01-February-2020
The efficiency and robustness of modern visual tracking systems are largely dependent on the object detection system at hand. Bernoulli and Multi-Bernoulli filters have been proposed for visual tracking without explicit detections (image observations). However, these previous approaches do not fully exploit discriminative features for tracking. In this paper, we propose a novel Bernoulli filter with determinantal point processes observations. The proposed observation model can select groups of detections with high detection scores and low correlation among the observed features; thus achieving a robust filter.
Visual tracking is a challenging computer vision task with applications in human-computer interaction, video surveillance and crowd monitoring among others. Modern visual tracking systems may use complex object detection schemes for estimating the current state of a target in any particular video frame. However, this approach does not fully exploit the temporal structure of the estimation problem. Visual tracking can be also thought of as a dynamic model with observed features and latent states representing the position/velocity of an object (Maggio and Cavallaro, 2011). In this context, the generative model for visual tracking requires not only the correct specification of the model and its parameters but also the ability to capture the variations of the system (Wang et al., 2015).
The Bernoulli filter is a powerful algorithm that allows objects to appear and disappear, using extracted features from the image as observations (Vo et al., 2010). Similar approaches for visual tracking have been proposed (also known in the literature as Track-Before-Detect). Nevertheless, these methods rely on unreliable background subtraction operations or the likelihood function being in a separable form (Hoseinnezhad et al., 2012, 2013).
Current state-of-the-art trackers are based on either correlation filters (Bolme et al., 2010), deformable parts models (Hare et al., 2016) or convolutional neural networks (Li et al., 2018). These trackers learn a discriminative model from a single frame and then update the model using new frames. Furthermore, tracking performance can be increased when using more discriminative features such as HOG (Henriques et al., 2015; Solis Montero et al., 2015; Xu et al., 2019). On the other hand, even when Bernoulli filters have demonstrated being useful models for tracking in complex scenarios, it is still hard to rely on such features for increasing their performance.
The Bernoulli filter is a specialized version of the PHD filter (Mahler, 2003), with the focus on single target tracking. While the original PHD filter is based on a Poisson point process, several extensions have been proposed to cope with non-Poisson distributions. In particular, the Cardinalized PHD filter allows estimating the number of targets using arbitrary distributions and provides improved estimates (Mahler, 2007). The multi-Bernoulli and Poisson multi-Bernoulli mixture filters also allow to approximate the cardinality distribution and become especially well suited when the mean of the multi-target posterior is higher than the variance (García-Fernández et al., 2018). All of these methods rely on first-order or second-order moments but assume that targets behave independently with each other. Therefore, the authors in Privault and Teoh (2019) propose a second-order filter that accounts for interaction between the targets. The method is based on determinantal point processes (DPP) that take into consideration the correlation among the targets through a kernel function. In Jorquera et al. (2017), the authors propose a determinantal point process for pruning the components of the Gaussian mixture PHD filter. More recently, the authors in Jorquera et al. (2019) compared the PHD filter using determinantal point process observations with other methods for visual multi-target tracking.
The contributions of this paper are twofold. First, the third section provides introductory notions of the Bernoulli filter and then we derive a novel Bernoulli filter using determinantal point process observations (B-DPP filter) for single target tracking in the fourth section. Second, in the fifth section we derive a Sequential Monte Carlo implementation of the B-DPP filter using a truncated likelihood, which can outperform other discriminative trackers in several scenarios.
A point process is a random pattern of points in a possibly multi-dimensional space (Kingman, 1993). A simple point process can be defined in one dimension, which is usually times and can be used to describe the random times where the events can occur with no coincident points.
The problem of performing joint detection and estimation of multiple objects has a natural interpretation as a dynamic point process, where the stochastic intensity of the model is a space-time function λ(x), where x ∈ℝ^{d} denotes the state space of the target. If we let B = B_{1}∪ B_{2} ∪ … ∪ B_{k} represent the union of disjoint video frames B_{i}, the corresponding number of objects on each image can be written as N(B _{1}), N(B_{2}), …, N(B_{k}). The Bernoulli point process for a single object that can randomly appear or disappear takes the form:
In recent years, deep learning approaches have demonstrated outstanding performance in several visual tracking benchmarks (Kristan et al., 2019). These trackers are mostly based on extracted features from a convolutional neural network and an objective loss that minimizes a localization error (Li et al., 2018). However, the detection process is not perfect and false positives and negatives are to be encountered after ranking the top proposals from the convolutional features.
In order to develop an stochastic approach for the single-object observation model, a discrete DPP can be used to capture probabilistic relationships using a kernel matrix $K:\ufffd\times \ufffd\mapsto \ufffd$K : Z _ × Z _ ↦ ℝ that measures the similarity among different detections (Lee et al., 2016). Therefore, instead of considering independent detections in a particular frame, the DPP likelihood specifies the joint probability over all 2^{n} subsets of $\ufffd$Z _ with distribution:
In this case, a model for detection and estimation of multiple objects can be achieved by the conditional expectation of the posterior point process (random finite set) under transformations (Ristic et al., 2013).
Let X_{k} = {x} be a Bernoulli point process and Z_{k} = {z_{1}, z_{2}, …, z_{m}} a DPP observed from frame K. The result from superposition, translation and thinning transformations is also a Bernoulli point process X_{k} ∼ p(X_{k}|X_{k−1}) (Kingman, 1993). The predicted point process can be written as the linear superposition of a π_{s} thinned point process with Markov translation f(x|x′) and a π_{b} Bernoulli birth process. The predicted expected number of targets N_{k|k−1} for a single target with probability of survival π_{s}(x) and spontaneous birth can be written as:
(4)where:The filtering density of a Bernoulli point process is completely specified by the pair (p_{k|k−1}, q_{k|k−1}), which is obtained by:
Using Equation (5), the probability of existence q_{k|k−1} can be written as:
And the probability of the predicted Bernoulli point process:
If we let Z_{k} be the observations that contain both false detections and target originated measurements, the update equation considers the probability of observing the target with probability of detection π_{d} under clutter (e.g. false positives). From (Mahler, 2003, 2007), the multi-target likelihood function for the standard measurement model (Poisson distributed clutter with density κ_{p}(Z_{k}) = e ^{−λ }∏_{i}λf_{c}(z_{i}) and Bernoulli probability of detection π_{d}) can be written as:
The likelihood term in Equation (8) considers all possible locations and location-to-track associations σ, so most of the terms will be canceled. The likelihood term becomes:
The Bayes update equation takes the form:
The denominator of Equation (10) can be written as:
The updated binomial point process can be derived as follows:
Let X_{k} = {x} be a Bernoulli point process and Z_{k} = {z_{1}, z_{2},…, z_{m}} a DPP observed at frame K. The result from superposition, translation and thinning transformations is also a Bernoulli point process X_{k} ∼ p(X_{k}|X_{k−1}) (Ristic et al., 2013). The predicted point process can be written as the linear superposition of a π_{s} thinned point process with Markov translation f(x|x′) and a π_{b} Bernoulli birth process. In order to measure the quality of the observations, we must introduce a random variable L such that p(L|Z) ∝ det(L(Z)), where L(Z) is a positive definite kernel matrix that depends on the observed features Z. The L(Z) kernel can be written as a Gram matrix:
The function g_{x}(z_{i}) = ∑_{c}p(z_{i}|c)p(c|x) is used to model the quality of the item z_{i} and S(Z) the diversity of the set Z. If we let W be a subset of detections arising from the target (Reuter et al., 2013):
The DPP Z can be treated as the union of two independent sets Z = C ∪ W, where C = {c_{1}, …, c_{m}} represents clutter. The clutter density becomes:
The likelihood function for the standard measurement model using determinantal observations becomes:
Now, we want to derive the posterior distribution for Bernoulli point process given DPP observations:
The updated binomial point process can now be derived as follows:
In practice, it is difficult to store and compute the power set with all possible configurations of Z_{k} in the likelihood term (see Equation (20)). An approximation can be constructed by truncating the likelihood and focusing only on the more likely elements. Let Z_{k} ^{*} = arg max_{Z⊂ Zk}η(Z|{x}) be a subset of Z_{k} whose elements are detections arising from the target. The likelihood becomes:
DPPs have been proposed in the literature as an alternative to other object refinement techniques such as non-maximum suppression (Lee et al., 2016). These methods operate over object proposals and eliminate redundant detections. For DPPs, mode finding can be tackled using the following greedy algorithm (Kulesza and Taskar, 2011):
Conversely, by using the truncated likelihood from Equation (25), the Sequential Monte Carlo algorithm for the Bernoulli filter can be used to estimate the single-target posterior (Ristic, 2013).
In order to demonstrate the advantages of the proposed model updating approach over other discriminative approaches, we evaluate the tracking results on six challenging video sequences from the Visual Object Challenge 2014 (VOT) data set^{1}. The proposed SMC implementation uses local binary patterns (LBP) as observed features and a simple observation model $p({z}_{i}|c)\propto exp(-\frac{{D}_{k}^{2}}{2{\sigma}_{o}^{2}})$ , with D_{k} = dist[z_{c},z_{k}] and z_{c} being a reference LBP histogram (Czyz et al., 2007). The state x_{k} is configured as a 4-dimensional rectangle including the left-most position, width and height of the target. The dynamic model uses a random walk and the parameters of the model are held fixed for all sequences. The B-DPP filter is implemented in the C++ language using the OpenCV library. The parameters for the B-DPP filter are determined empirically and shown in Table 1.
The parameter setting the Greedy Mode Finding algorithm is described in Table 2.
The sequence jogging is a challenging example containing full occlusions, rotations and background clutter. Figure 1 shows one frame of the sequence and the estimates using the proposed approach and other state-of-the-art methods.
The Bernoulli DPP filter maintains a balance between the observed features and the quality of the observations (see Figure 1). The observation model uses a simple histogram comparison and no template update is performed, so the model is not robust to object deformation or rotation. Even that, as seen in Figure 1 the Bernoulli-DPP tracker achieves good performance in cases such as full occlusion where the other discriminative tracking methods fail. Performance is measured using widely used precision and success metrics^{2}.
The precision metric describes the percentage of frames whose center location error is below a given threshold. Table 3 shows the overall precision metric averaged over all sequences on five different runs for each one of the algorithms.
The success measure accounts for bounding box overlap. Table 4 shows the number of success frames whose overlap is above some threshold, averaged over the sequences on five different runs. Quantitative analysis shows improved performance for the proposed approach when compared to the discriminative trackers in six different video sequences.
Figure 2 shows the precision metric against the location error threshold for all of the six tested sequences. The red line indicates the best performing method among the four different algorithms. Since the bolt and jogging sequences have background clutters (the background near the target has similar appearance as the target), the proposed Bernoulli DPP tracker reduces redundant observations and improves precision.
Figure 3 shows the ratio of the frames whose tracked box has more overlap with the ground-truth box than a threshold. The success metric can be associated with the tracker algorithm ability to maintain long-term tracks. Since the Bernoulli DPP filter accounts for missed detections, the proposed approach improves the area under the curve of the success metric in 67% of the tested sequences.
In this paper, a novel algorithm for joint detection and tracking a single object in video has been presented. The proposed approach takes into account the detection score and the similarity of the observed features. Then, a Bayesian filter using a Bernoulli point process estimates the state of the target from a diverse subset of object proposals. Experimental evaluations show that the results are comparable to other state-of-the-art techniques for visual tracking in only 6 of the 25 sequences of the data set. In this paper, we only considered a simple observation model (distance to a reference LBP histogram), which might hinder the performance of this approach in the overall data set. This observation model is not robust to scale and rotation changes and no model updating strategies are considered in this paper. Nevertheless, our model is expected to increase its performance when using a more complex observation model (such as deep learning features), model updating and ensemble post-processing techniques for combining the output from different tracking schemes.