Introduction of some papers

最近开始写小论文,针对性的总结几篇文章的 Introduction 部分

大纲

Introduction 我认为主要可以分为三部分:介绍课题的大方向 -> 介绍前人在这方面的工作,但是他们在某些方面做得不足或是有待改进 -> 引入自己的工作,说明自己的工作在一定程度上解决了这些问题。

Rotate to Attend: Convolutional Triplet Attention Module

Over the years of computer vision research, convolutional neural network architectures of increasing depth have demonstrated major success in many computer vision tasks. Numerous recent work have proposed using either channel attention, or spatial attention, or both to improve the performance of these neural networks. These attention mechanisms have the capabilities of improving the feature representations generated by standard convolutional layers by explicitly building dependencies among channels or weighted spatial mask for spatial attention. The intuition behind learning attention weights is to allow the network to have the ability to learn where to attend and further focus on the target objects.

开头即表明卷积神经网络结构在cv领域取得一定成绩。并且越来越多的工作要么使用了通道注意力机制,要么使用了空间注意力机制来提升网络的性能。并且解释了注意力机制是通过构建通道间的依赖或是给空间注意力机制赋予权重来增加网络性能,使得网络可以更加注重于要检测的目标物体。

这里即介绍课题的大方向,即神经网络流行 -> 注意力机制的引入 -> 简要介绍注意力机制的作用原理。以此可以得知这篇论文的论点是关于注意力机制的。

One of the most prominent methods is the squeeze-and-excitation networks (SENet). Squeeze and Excite (SE) module computes channel attentions and provides incremental performance gains at a considerably low cost. SENet was succeeded by Convolutional Block Attention Module (CBAM) and Bottleneck Attention Module (BAM), both of which stressed on providing robust representative attentions by incorporating spatial attention along with channel attention. They provided substantial performance gains over their squeeze-and-excite counterpart with a small computational overhead.

这段是在介绍前人的工作,SENet,它使用了CBAM和BAM,使得空间注意力机制沿着通道注意力机制进行了融合(incorporate),这使得其在较小的计算消耗下得到了较高的精确度提升。

Different from the aforementioned attention approaches that require a number of extra learnable parameters, the foundation backbone of this paper is to investigate the way of building cheap but effective attentions while maintaining similar or providing better performance. In particular, we aim to stress on the importance of capturing cross-dimension interaction while computing attention weights to provide rich feature representations. We take inspiration from the method of computing attention in CBAM which successfully demonstrated the importance of capturing spatial attention along with channel attention. In CBAM, the channel attention is computed in a similar way as that of SENet except for the usage of global average pooling (GAP) and global max pooling (GMP) while the spatial attention is generated by simply reducing the input to a single channel output to obtain the attention weights. We observe that the channel attention method within CBAM although providing significant performance improvements does not account for cross-dimension interaction which we showcase to have a favorable impact on the performance when captured. Additionally, CBAM incorporates dimensionality reduction while computing channel attention which is redundant to capture nonlinear local dependencies between channels.

这段便是说明前面提到工作的不足之处,即先前的工作仅仅是通过简单的将输入降维为单通道输出来获得注意力权重,但是这样的方法虽然可以提升性能,但是没有跨维度的交互,而这种交互在作者的实验中对性能有着有利的提升(favorable impact),并且CBAM在计算通道注意力时使用了降维操作,这种操作在捕获通道之间的非线性局部依赖时是冗余的。

总结即为上面的模型没有 cross-dimension interaction,所以会出现一定问题(所谓的非线性局部依赖)

Based on the above observation, in this paper, we propose triplet attention which accounts for cross-dimension interaction in an efficient way…

作者提出了自己的模型,并介绍了它是如何使用跨维度的互动。

Compared to previous channel attention mechanisms, our approach offers two advantages. First, our method helps in capturing rich discriminative feature representations at a negligible computational overhead which we further empirically verify by visualizing the Grad-CAM and Grad-CAM++ results. Second, unlike our predecessors, our method stresses the importance of cross-dimension interaction with no dimensionality reduction, thus eliminating indirect correspondence between channels and weights.

这段便是讲明作者模型的优势,一是可以用可忽略不计的计算成本捕获丰富的有区分度的特征表征(个人理解特征表征就是通过一定卷积后得到的东西)。第二个就是强调了没有降维情况下的模型之间的多维交互,从而消灭了通道和权重之间的非直接对应。


这篇论文的intro部分很好的遵守了 介绍课题的大方向 -> 介绍前人在这方面的工作,但是他们在某些方面做得不足或是有待改进 -> 引入自己的工作,说明自己的工作在一定程度上解决了这些问题 这个流程。

TSM: Temporal Shift Module for Efficient Video Understanding

Hardware-efficient video understanding is an important step towards real-world deployment, both on the cloud and on the edge. For example, there are over 105 hours of videos uploaded to YouTube every day to be processed for recom- mendation and ads ranking; tera-bytes of sensitive videos in hospitals need to be processed locally on edge devices to protect privacy. All these industry applications require both accurate and efficient video understanding. Deep learning has become the standard for video understanding over the years. One key difference between video recognition and image recognition is the need for temporal modeling. For example, to distinguish between opening and closing a box, reversing the order will give opposite results, so temporal modeling is critical. Existing efficient video understanding approaches directly use 2D CNN. However, 2D CNN on individual frames cannot well model the temporal information. 3D CNNs can jointly learn spatial and temporal features but the computation cost is large, making the deployment on edge devices difficult; it cannot be applied to real-time online video recognition. There are works to trade off between temporal modeling and computation, such as post-hoc fusion and mid-level temporal fusion. Such methods sacrifice the low-level temporal modeling for efficiency, but much of the useful information is lost during the feature extraction before the temporal fusion happens.

文章比较直接,首先说出硬件友好型的视频理解重要性,然后简单介绍一下时间建模的重要性,以及之前的一些工作,他们要么使用2dcnn使得无法有效建模时间信息,要么使用3dcnn但是有这较大的计算开销。一些在时间建模和计算开销之间trade-off的方法(post-hoc fusion和mid-level temporal fusion)但是会牺牲掉低级别的信息,而这些信息是比较重要的。

这里较为简短的介绍了课题大方向+前人的相关工作及其待改进的地方。我感觉我的论文也可以一定借鉴这里,比如先说明时间建模的重要性,再说明普通的3dcnn较为耗时以及TSM因为涉及较多的移动操作对内存要求较高,我们可以追求对硬件更加友好的轻便模型,这种模型不仅可以保证一定的性能,并且训练的更快。

In this paper, we propose a new perspective for efficient temporal modeling in video understanding by proposing a novel Temporal Shift Module (TSM). Concretely, an activation in a video model can be represented as A ∈ RN×C×T×H×W, where N is the batch size, C is the number of channels, T is the temporal dimension, H and W are the spatial resolutions. Traditional 2D CNNs operate independently over the dimension T; thus no temporal modeling takes effects. In contrast, our Temporal Shift Module (TSM) shifts the channels along the temporal dimension, both forward and backward. As shown in Figure 1b, the information from neighboring frames is mingled with the current frame after shifting. Our intuition is: the convolution operation consists of shift and multiply-accumulate. We shift in the time dimension by ±1 and fold the multiply-accumulate from time dimension to channel dimension. For real-time online video understanding, future frames can’t get shifted to the present, so we use a uni-directional TSM to perform online video understanding. Despite the zero-computation nature of the shift operation, we empirically find that simply adopting the spatial shift strategy used in image classifications introduces two major issues for video understanding: (1) it is not efficient: shift operation is conceptually zero FLOP but incurs data movement. The additional cost of data movement is non-negligible and will result in latency increase. This phenomenon has been exacerbated in the video networks since they usually have a large memory consumption (5D activation). (2) It is not accurate: shifting too many channels in a network will significantly hurt the spatial modeling ability and result in performance degradation. To tackle the prob- lems, we make two technical contributions. (1) We use a temporal partial shift strategy: instead of shifting all the channels, we shift only a small portion of the channels for efficient temporal fusion. Such strategy significantly cuts down the data movement cost. (2) We insert TSM inside residual branch rather than outside so that the activation of the current frame is preserved, which does not harm the spatial feature learning capability of the 2D CNN backbone.

这里介绍了作者的模型,他使用沿着时间维度移动通道来使模型获得时间信息,因为只有位移操作,因此几乎不会增加计算消耗。

Video Modeling with Correlation Networks

After the breakthrough of AlexNet on ImageNet, convolutional neural networks (CNNs) have become the dominant model for still-image classification. In the video domain, CNNs were initially adopted as image-based feature extractor on individual frames of the video. More recently, CNNs for video analysis have been extended with the capability of capturing not only appearance information contained in individual frames but also motion information extracted from the temporal dimension of the image sequence. This is usually achieved by one of two possible mechanisms. One strategy involves the use of a two-stream network where one stream operates on RGB frames to model appearance information and the other stream extracts motion features from optical flow provided as input. The representations obtained from these two distinct inputs are then fused, typically in a late layer of the network. An alternative strategy is to use 3D convolutions which couple appearance and temporal modeling by means of spatiotemporal kernels.

这段就是比较常用的背景介绍,首先cnn在image领域,其次发展到视频领域,为了获得时间信息(动作信息)一般有两种策略,一种是双流网络,另一种就是3dCNN网络。

In this paper we propose a new scheme based on a novel correlation operator inspired by the correlation layer in FlowNet. While in FlowNet the correlation layer is only applied once to convert the video information from the RGB pixel space to the motion displacement space, we propose a learnable correlation operator to establish frame-to-frame matches over convolutional feature maps to capture different notions of similarity in different layers of the network. Similarly to two-stream models, our model enables the fusion of explicit motion cues with appearance information. However, while in two-stream models the motion and appearance subnets are disjointly learned and fused only in a late layer of the model, our network enables the efficient integration of appearance and motion information throughout the network. Compared to 3D CNNs, which extract spatiotemporal features, our model factorizes the computation of appearance and motion, and learns distinct filters capturing different measures of patch similarity. The learned filters can match pixels moving in different directions. Through our extensive experiments on four action recognition datasets (Kinetics, Something-Something, Diving48 and Sports1M), we demonstrate that our correlation network compares favorably with widely-used 3D CNNs for video modeling, and achieves competitive results over the prominent two-stream network while being much faster to to train.

作者其实没有像往常一样介绍前人的论文,而是说他的论文是基于FlowNet(我觉得这样可行,我可以说我的论文是基于C3D),然后介绍了FlowNet的一些缺点,比如FlowNet(光流网络)只将RGB空间的视频信息转换成动作位移空间,而作者提出的模型可以捕捉卷积后的特征图之间的different notion。然后说到双流网络将动作信息和表象信息分开处理,但是作者的模型是一起处理的。还比较了3dcnn,作者模型将动作和表象信息进行分解处理(个人理解这里是因为3dcnn使用3d卷积核一起对特征图进行卷积,而作者的是分开处理)


这里简单介绍了几篇文章的intro部分,个人感觉我的论文可以借鉴第三篇论文的结构。一是说明我的论文是在xxx的基础上,然后因为xxx的一些问题,所以我对他进行一定程度的改进。然后可以将我的模型对比双流网络和普通的3dcnn网络。