最近开始写小论文，针对性的总结几篇文章的 Introduction 部分
Introduction 我认为主要可以分为三部分：介绍课题的大方向 -> 介绍前人在这方面的工作，但是他们在某些方面做得不足或是有待改进 -> 引入自己的工作，说明自己的工作在一定程度上解决了这些问题。
Rotate to Attend: Convolutional Triplet Attention Module
Over the years of computer vision research, convolutional neural network architectures of increasing depth have demonstrated major success in many computer vision tasks. Numerous recent work have proposed using either channel attention, or spatial attention, or both to improve the performance of these neural networks. These attention mechanisms have the capabilities of improving the feature representations generated by standard convolutional layers by explicitly building dependencies among channels or weighted spatial mask for spatial attention. The intuition behind learning attention weights is to allow the network to have the ability to learn where to attend and further focus on the target objects.
这里即介绍课题的大方向，即神经网络流行 -> 注意力机制的引入 -> 简要介绍注意力机制的作用原理。以此可以得知这篇论文的论点是关于注意力机制的。
One of the most prominent methods is the squeeze-and-excitation networks (SENet). Squeeze and Excite (SE) module computes channel attentions and provides incremental performance gains at a considerably low cost. SENet was succeeded by Convolutional Block Attention Module (CBAM) and Bottleneck Attention Module (BAM), both of which stressed on providing robust representative attentions by incorporating spatial attention along with channel attention. They provided substantial performance gains over their squeeze-and-excite counterpart with a small computational overhead.
Different from the aforementioned attention approaches that require a number of extra learnable parameters, the foundation backbone of this paper is to investigate the way of building cheap but effective attentions while maintaining similar or providing better performance. In particular, we aim to stress on the importance of capturing cross-dimension interaction while computing attention weights to provide rich feature representations. We take inspiration from the method of computing attention in CBAM which successfully demonstrated the importance of capturing spatial attention along with channel attention. In CBAM, the channel attention is computed in a similar way as that of SENet except for the usage of global average pooling (GAP) and global max pooling (GMP) while the spatial attention is generated by simply reducing the input to a single channel output to obtain the attention weights. We observe that the channel attention method within CBAM although providing significant performance improvements does not account for cross-dimension interaction which we showcase to have a favorable impact on the performance when captured. Additionally, CBAM incorporates dimensionality reduction while computing channel attention which is redundant to capture nonlinear local dependencies between channels.
总结即为上面的模型没有 cross-dimension interaction，所以会出现一定问题（所谓的非线性局部依赖）
Based on the above observation, in this paper, we propose triplet attention which accounts for cross-dimension interaction in an efficient way…
Compared to previous channel attention mechanisms, our approach offers two advantages. First, our method helps in capturing rich discriminative feature representations at a negligible computational overhead which we further empirically verify by visualizing the Grad-CAM and Grad-CAM++ results. Second, unlike our predecessors, our method stresses the importance of cross-dimension interaction with no dimensionality reduction, thus eliminating indirect correspondence between channels and weights.
这篇论文的intro部分很好的遵守了 介绍课题的大方向 -> 介绍前人在这方面的工作，但是他们在某些方面做得不足或是有待改进 -> 引入自己的工作，说明自己的工作在一定程度上解决了这些问题 这个流程。
TSM: Temporal Shift Module for Efficient Video Understanding
Hardware-efficient video understanding is an important step towards real-world deployment, both on the cloud and on the edge. For example, there are over 105 hours of videos uploaded to YouTube every day to be processed for recom- mendation and ads ranking; tera-bytes of sensitive videos in hospitals need to be processed locally on edge devices to protect privacy. All these industry applications require both accurate and efficient video understanding. Deep learning has become the standard for video understanding over the years. One key difference between video recognition and image recognition is the need for temporal modeling. For example, to distinguish between opening and closing a box, reversing the order will give opposite results, so temporal modeling is critical. Existing efficient video understanding approaches directly use 2D CNN. However, 2D CNN on individual frames cannot well model the temporal information. 3D CNNs can jointly learn spatial and temporal features but the computation cost is large, making the deployment on edge devices difficult; it cannot be applied to real-time online video recognition. There are works to trade off between temporal modeling and computation, such as post-hoc fusion and mid-level temporal fusion. Such methods sacrifice the low-level temporal modeling for efficiency, but much of the useful information is lost during the feature extraction before the temporal fusion happens.
文章比较直接，首先说出硬件友好型的视频理解重要性，然后简单介绍一下时间建模的重要性，以及之前的一些工作，他们要么使用2dcnn使得无法有效建模时间信息，要么使用3dcnn但是有这较大的计算开销。一些在时间建模和计算开销之间trade-off的方法（post-hoc fusion和mid-level temporal fusion）但是会牺牲掉低级别的信息，而这些信息是比较重要的。
In this paper, we propose a new perspective for efficient temporal modeling in video understanding by proposing a novel Temporal Shift Module (TSM). Concretely, an activation in a video model can be represented as A ∈ RN×C×T×H×W, where N is the batch size, C is the number of channels, T is the temporal dimension, H and W are the spatial resolutions. Traditional 2D CNNs operate independently over the dimension T; thus no temporal modeling takes effects. In contrast, our Temporal Shift Module (TSM) shifts the channels along the temporal dimension, both forward and backward. As shown in Figure 1b, the information from neighboring frames is mingled with the current frame after shifting. Our intuition is: the convolution operation consists of shift and multiply-accumulate. We shift in the time dimension by ±1 and fold the multiply-accumulate from time dimension to channel dimension. For real-time online video understanding, future frames can’t get shifted to the present, so we use a uni-directional TSM to perform online video understanding. Despite the zero-computation nature of the shift operation, we empirically find that simply adopting the spatial shift strategy used in image classifications introduces two major issues for video understanding: (1) it is not efficient: shift operation is conceptually zero FLOP but incurs data movement. The additional cost of data movement is non-negligible and will result in latency increase. This phenomenon has been exacerbated in the video networks since they usually have a large memory consumption (5D activation). (2) It is not accurate: shifting too many channels in a network will significantly hurt the spatial modeling ability and result in performance degradation. To tackle the prob- lems, we make two technical contributions. (1) We use a temporal partial shift strategy: instead of shifting all the channels, we shift only a small portion of the channels for efficient temporal fusion. Such strategy significantly cuts down the data movement cost. (2) We insert TSM inside residual branch rather than outside so that the activation of the current frame is preserved, which does not harm the spatial feature learning capability of the 2D CNN backbone.
Video Modeling with Correlation Networks
After the breakthrough of AlexNet on ImageNet, convolutional neural networks (CNNs) have become the dominant model for still-image classification. In the video domain, CNNs were initially adopted as image-based feature extractor on individual frames of the video. More recently, CNNs for video analysis have been extended with the capability of capturing not only appearance information contained in individual frames but also motion information extracted from the temporal dimension of the image sequence. This is usually achieved by one of two possible mechanisms. One strategy involves the use of a two-stream network where one stream operates on RGB frames to model appearance information and the other stream extracts motion features from optical flow provided as input. The representations obtained from these two distinct inputs are then fused, typically in a late layer of the network. An alternative strategy is to use 3D convolutions which couple appearance and temporal modeling by means of spatiotemporal kernels.
In this paper we propose a new scheme based on a novel correlation operator inspired by the correlation layer in FlowNet. While in FlowNet the correlation layer is only applied once to convert the video information from the RGB pixel space to the motion displacement space, we propose a learnable correlation operator to establish frame-to-frame matches over convolutional feature maps to capture different notions of similarity in different layers of the network. Similarly to two-stream models, our model enables the fusion of explicit motion cues with appearance information. However, while in two-stream models the motion and appearance subnets are disjointly learned and fused only in a late layer of the model, our network enables the efficient integration of appearance and motion information throughout the network. Compared to 3D CNNs, which extract spatiotemporal features, our model factorizes the computation of appearance and motion, and learns distinct filters capturing different measures of patch similarity. The learned filters can match pixels moving in different directions. Through our extensive experiments on four action recognition datasets (Kinetics, Something-Something, Diving48 and Sports1M), we demonstrate that our correlation network compares favorably with widely-used 3D CNNs for video modeling, and achieves competitive results over the prominent two-stream network while being much faster to to train.