目标跟踪算法五:MDNet: Learning Multi-Domain Convolutional Neural Networks for Visual Tracking

MDNet是一种基于深度学习的目标跟踪算法,专为视觉跟踪任务设计。该算法利用多域网络结构学习不同视频中目标的共性,通过长期和短期在线更新策略及hard negative mining等方法提高跟踪精度。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

目标跟踪算法五:MDNet: Learning Multi-Domain Convolutional Neural Networks for Visual Tracking

原文:https://zhuanlan.zhihu.com/p/25312850
文章链接: 在这里
代码链接: github.com/HyeonseobNam

MDNet是2015年VOT的冠军。这篇文章其实2015年底就出来了,被这是我第三次读它了。MDNet是Korea的POSTECH这个团队做的,与TCNN和CNN-SVM同一出处。

2015年底的时候,Visual Tracking领域继Object Detection之后,陆续将CNN引入,但是大部分算法只是用在大量数据上训练好的(pretrain)的一些网络如VGG作为特征提取器,结果证明确实用了CNN深度特征对跟踪结果是有较大的改进的。那么其实自己设计一个网络来做跟踪是大家都能够想到的思路,Korea的POSTECH这个团队就做了MDNet。但是为什么其实自己设计网络的并不多呢?因为训练数据是个问题,再者MDNet的效果在当时来说几乎是无法超越的。来看看它是怎么做的吧。

1. Motivations

  • 对于跟踪问题来说,CNN应该是由视频跟踪的数据训练得到的更为合理。所有的跟踪目标,虽然类别各不相同,但其实他们应该都存在某种共性,这是需要网络去学的。
  • 用跟踪数据来训练很难,因为同一个object,在某个序列中是目标,在另外一个序列中可能就是背景,而且每个序列的目标存在相当大的差异,而且会经历各种挑战,比如遮挡、形变等等。
  • 现有的很多训练好的网络主要针对的任务比如目标检测、分类、分割等的网络很大,因为他们要分出很多类别的目标。而在跟踪问题中,一个网络只需要分两类:目标和背景。而且目标一般都相对比较小,那么其实不需要这么大的网络,会增加计算负担。
针对这三点,作者提出了Multi-Domain Network,多域学习的网络结构,来学习这些目标的共性。

2. Multi-Domain Network(MDNet)

2.1 网络结构

首先来看看MDNet的网络结构:

  • Input: 网络的输入是107x107的Bounding box,设置为这个尺寸是为了在卷积层conv3能够得到3x3的feature map。
  • Convolutional layers: 网络的卷积层conv1-conv3来自于VGG-M [1]网络,只是输入的大小做了改变。
  • Fully connected layers: 接下来的两个全连接层fc4,fc5各有512个输出单元,并设计有ReLUs和Dropouts。fc6是一个二分类层(Domain-specific layers),一共有K个,对应K个Branches(即K个不同的视频),每次训练的时候只有对应该视频的fc6被使用,前面的层都是共享的。

为了学到不同视频中目标的共性,采用Domain-specific的训练方式:假设用K个视频来做训练,一共做N次循环。每一个mini-batch的构成是从某一视频中随机采8帧图片,在这8帧图片上随机采32个正样本和96个负样本,即每个mini-batch由某一个视频的128个框来构成。在每一次循环中,会做K次迭代,依次用K个视频的mini-batch来做训练,重复进行N次循环。用SGD进行训练,每个视频会对应自己的fc6层。通过这样的训练来学得各个视频中目标的共性。训练好的网络在做test的时候,会新建一个fc6层,在线fine-tune fc4-fc6层,卷积层保持不变。

2.2 用MDNet来做跟踪

2.2.1 网络在线更新策略

采用long-term和short-term两种更新方式。在跟踪的过程中,会保存历史跟踪到的目标作为正样本,当然样本是在得分高于一个阈值的时候才会被添加作为训练的正样本的。long-term对应历史的100个样本(超过100个抛弃最早的),固定时间间隔做一次网络的更新(程序中设置为每8帧更新一次),short-term对应20个(超过20个抛弃最早的),在目标得分低于0.5进行更新。负样本都是用short-term的方式收集的。另外在训练中负样本的生成用到了hard negative mining [2],就是让负样本越来越难分,从而使得网络的判别能力越来越强。

2.2.2 目标跟踪

每次新来一帧图片,以上一帧的目标位置为中心,用多维高斯分布(宽,高,尺度三个维度)的形式进行采样256个candidates,将他们大小统一为107x107后,分别作为网络的输入进行计算。

网络的输出是一个二维的向量,分别表示输入的bounding box对应目标和背景的概率。目标最终是确定为目标得分概率最高的那个bounding box:

最后得到的candidate其实不是直接作为目标,还要做一部bounding box regression。做法与R-CNN [3]一样,是常规算法,就不细讲了。这一步对最后的结果贡献还是有的,可以看下面的实验结果。

2.3 实验

2.3.1 OTB50和OTB100

2.3.2 VOT2014

2.3.3 自身对比实验

作者对文章中用到的一些策略做了结果验证。首先是不用多域训练,而是单域训练的SDNet,不做bounding box regression的MDNet-BB,以及不做hard negative mining和bounding box regression的MDNet-BB-HM,结果如下:

可以看到Multi-domain策略的提升在两个指标都比较明显,有3%~4%。另外hard negative mining和bounding box regression对结果也是有贡献的。

3. Conclusion

总结一下MDNet效果好的原因:
  • 用了CNN特征,并且是专门为了tracking设计的网络,用tracking的数据集做了训练
  • 有做在线的微调fine-tune,这一点虽然使得速度慢,但是对结果很重要
  • Candidates的采样同时也考虑到了尺度,使得对尺度变化的视频也相对鲁棒
  • Hard negative mining和bounding box regression这两个策略的使用,使得结果更加精确

[1] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. Return of the devil in the details: Delving deep into convolutional nets. In BMVC, 2014

[2] K.-K. Sung and T. Poggio. Example-based learning for viewbased human face detection. IEEE Trans. Pattern Anal. Mach. Intell., 20(1):39–51, 1998.

[3] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.

### CLFM Algorithm Implementation and Explanation The term **CLFM (Cross Linear Frequency Modulation)** does not directly appear in the provided references, but based on its potential relation to frequency modulation techniques mentioned in one of the citations[^2], it can be inferred that this might refer to a specific type of linear frequency modulation or hybrid signal processing technique. Below is an analysis combining related concepts from the given material. #### Overview of Relevant Techniques Frequency-modulated signals are widely used in radar systems, communication protocols, and other domains requiring precise control over waveforms. Among these, LFM (Linear Frequency Modulation), DLFM (Discrete Linear Frequency Modulation), VLFM (Variable Linear Frequency Modulation), MLFM (Multi-Level Linear Frequency Modulation), EQFM (Equally Spaced Quadratic FM), SFM (Sinusoidal Frequency Modulation), FSK (Frequency Shift Keying), PSK (Phase Shift Keying), and their combinations play critical roles in various applications. For instance: - **LFM**: A waveform where the instantaneous frequency increases or decreases linearly with time. - **BPSK-LFM**: Combines binary phase shift keying with LFM for enhanced robustness against noise and interference. If we interpret **CLFM** as potentially being a cross-application of LFM principles within another domain such as image-based motion estimation or sensor fusion, then certain methodologies like optical flow computation using CNNs could serve as analogous frameworks[^1]. For example: ```python import numpy as np from scipy.signal import chirp def generate_clfm_signal(fs=10e3, duration=1.0, f_start=100, f_end=500): """ Generate a Cross Linear Frequency Modulated Signal Parameters: fs : Sampling rate duration : Duration of the signal f_start : Starting frequency f_end : Ending frequency Returns: t : Time array clfm_signal : Generated CLFM signal """ t = np.linspace(0, duration, int(fs * duration)) clfm_signal = chirp(t, f0=f_start, f1=f_end, t1=duration, method='linear') return t, clfm_signal t, clfm_signal = generate_clfm_signal() print(clfm_signal[:10]) # Display first few samples ``` This code generates a basic form of a linearly modulated signal which may represent part of what "CLFM" refers to depending on contextual interpretation. #### Sensor Fusion Contextual Linkage In terms of integrating multiple modalities—such as cameras and LiDAR—the concept aligns closely with advanced methods discussed under bidirectional information exchange between sensors via deep learning architectures[^4]. Specifically, models leveraging both visual cues derived through convolutional layers alongside structured light patterns captured by laser scanners exemplify how disparate data streams interact effectively during tasks involving scene understanding or object tracking. Such approaches often involve preprocessing steps including normalization followed by feature extraction stages implemented either separately per input channel before merging them into unified representations suitable downstream operations e.g., classification regression etcetera . An illustrative snippet demonstrating early stage integration follows : ```python class CamLiFusionModel(nn.Module): def __init__(self): super(CamLiFusionModel, self).__init__() # Define separate branches for camera & lidar inputs respectively self.camera_branch = nn.Sequential( nn.Conv2d(in_channels=3,out_channels=64,kernel_size=(7,7)), ... ) self.lidar_branch = nn.Sequential( nn.Linear(input_dim,output_dim), ... ) # Final layer after concatenation self.fusion_layer = nn.FullyConnectedLayer(...) def forward(self,x_cam,x_lidr): cam_out=self.camera_branch(x_cam) lidr_out=self.lidar_branch(x_lidr) combined=torch.cat((cam_out,lidr_out),dim=-1) output=self.fusion_layer(combined) return output model_instance=CamLiFusionModel() ``` Here `torch` library constructs neural networks capable handling heterogeneous datasets while ensuring optimal performance metrics achieved post-training phases . --- §§Related Questions§§ 1. How do traditional energy minimization problems compare to modern CNN-based solutions when estimating dense pixel-wise motions? 2. What advantages arise from employing pulse compression schemes utilizing non-linear forms versus standard sinusoids ? 3. Can you elaborate upon any practical limitations encountered implementing real-time adaptive beamforming algorithms incorporating Doppler shifts compensation mechanisms ? 4. In scenarios characterized sparse point clouds recovered autonomous vehicles' surroundings ,what strategies mitigate occlusion effects preserving overall accuracy levels expected contemporary SLAM pipelines ? 5. Which mathematical foundations primarily govern derivation equations governing quadrature amplitude modulation processes utilized telecommunications industry standards today ?
评论 5
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值