Mengshi Qi

I am currently a research scientist in CVLab at École polytechnique fédérale de Lausanne (EPFL), where I work closely with Prof. Pascal Fua and Dr. Mathieu Salzmann.

Prior to that, I have ever worked at Baidu Research, where I focus on computer vision and deep learning collaborated with Prof. Yi Yang in 2019. I did my PhD and Master at Beihang University (BUAA) in 2019 and 2014, respectively, where I was advised by Prof. Yunhong Wang. Especially, I had been a visiting PhD at University of Rochester supervised by Prof. Jiebo Luo from 2017 to 2018. I did my bachelors at Beijing University of Posts and Telecommunications (BUPT) in 2012.

Email  /  CV  /  Google Scholar  /  LinkedIn

profile photo
Research Interests

I'm interested in computer vision and machine learning, especially scene understanding, 3D reconstruction and multimedia analysis. Most of my research are about how to understand the semantic content and infer the physical information from images and videos.

Publications
Semantics-Aware Spatial-Temporal Binaries for Cross-Modal Video Retrieval
Mengshi Qi, Jie Qin, Yi Yang, Yunhong Wang, Jiebo Luo
IEEE Transactions on Image Processing (TIP), 2021
pdf / press / bibtex

In this paper, we propose a novel binary representation learning framework, named Semantics-aware Spatial-temporal Binaries (S2Bin), which simultaneously considers spatial-temporal context and semantic relationships for cross-modal video retrieval. By exploiting the semantic relationships between two modalities, S2Bin can efficiently and effectively generate binary codes for both videos and texts. In addition, we adopt an iterative optimization scheme to learn deep encoding functions with attribute-guided stochastic training.

Latent Memory-augmented Graph Transformer for Visual Storytelling
Mengshi Qi, Jie Qin, Di Huang, Zhiqiang Shen, Yi Yang, Jiebo Luo
ACM MM, 2021   (Oral Presentation)
pdf / press / video / bibtex

In this paper, we present a novel Latent Memory-augmented Graph Transformer (LMGT), a Transformer based framework including a designed graph encoding module and a latent memory unit for visual story generation.

Imitative Non-Autoregressive Modeling for Trajectory Forecasting and Imputation
Mengshi Qi, Jie Qin, Yu Wu, Yi Yang
CVPR, 2020
pdf / press / video / bibtex

In this paper, we proposed a novel Imitative Non-Autoregressive Modeling method to bridge the performance gap between autoregressive and non-autoregressive models for temporal sequence forecasting and imputation. Our proposed framework leveraged an imitation learning fashion including two parts, i.e., a recurrent conditional variational autoencoder (RC-VAE) demonstrator and a nonautoregressive transformation model (NART) learner.

Few-Shot Ensemble Learning for Video Classificaion with SlowFast Memory Networks
Mengshi Qi, Jie Qin, Xiantong Zhen, Di Huang, Yi Yang, Jiebo Luo
ACM MM, 2020
pdf / press / video / bibtex

In this paper, we address few-shot video classification by learning an ensemble of SlowFast networks augmented with memory units. Specifically, we introduce a family of few-shot learners based on SlowFast networks which are used to extract informative features at multiple rates, and we incorporate a memory unit into each network to enable encoding and retrieving crucial information instantly.

STC-GAN: Spatio-Temporally Coupled Generative Adversarial Networks for Predictive Scene Parsing
Mengshi Qi, Yunhong Wang, Annan Li, Jiebo Luo
IEEE Transactions on Image Processing (TIP), 2020
pdf / press / bibtex

In this paper, we present a novel Generative Adversarial Networks-based model (i.e., STC-GAN) for predictive scene parsing. STC-GAN captures both spatial and temporal representations from the observed frames of a video through CNN and convolutional LSTM network. Moreover, a coupled architecture is employed to guide the adversarial training via a weight-sharing mechanism and a feature adaptation transform between the future frame generation model and the predictive scene parsing model.

Sports Video Captioning via Attentive Motion Representation and Group Relationship Modeling
Mengshi Qi, Yunhong Wang, Annan Li, Jiebo Luo
IEEE Transactions on Circuits and Systems fo Video Technology (TCSVT), 2019   (An extension of our MMSports@MM 18' paper)
pdf / press / bibtex

In this study, we present a novel hierarchical recurrent neural network based framework with an attention mechanism for sports video captioning, in which a motion representation module is proposed to capture individual pose attribute and dynamical trajectory cluster information with extra professional sports knowledge, and a group relationship module is employed to design a scene graph for modeling players’ interaction by a gated graph convolutional network.

stagNet: An Attentive Semantic RNN for Group Activity and Individual Action Recognition
Mengshi Qi, Yunhong Wang, Jie Qin, Annan Li, Jiebo Luo, Luc Van Gool
IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2019   (An extension of our ECCV 18' paper)
pdf / press / bibtex

In the paper, we present a novel attentive semantic recurrent neural network (RNN), namely stagNet, for understanding group activities and individual actions in videos, by combining the spatio-temporal attention mechanism and semantic graph modeling. Specifically, a structured semantic graph is explicitly modeled to express the spatial contextual content of the whole scene, which is afterward further incorporated with the temporal factor through structural- RNN.

Attentive Relational Networks for Mapping Images to Scene Graphs
Mengshi Qi*, Weijian Li*, Zhengyuan Yang, Yunhong Wang, Jiebo Luo
CVPR, 2019
pdf / arxiv / press / bibtex

In this study, we propose a novel Attentive Relational Network that consists of two key modules with an object detection backbone to approach this problem. The first module is a semantic transformation module utilized to capture semantic embedded relation features, by translating visual features and linguistic features into a common semantic space. The other module is a graph self-attention module introduced to embed a joint graph representation through assigning various importance weights to neighboring nodes.

KE-GAN: Knowledge Embedded Generative Adversarial Networks for Semi-Supervised Scene Parsing
Mengshi Qi, Yunhong Wang, Jie Qin, Annan Li
CVPR, 2019
pdf / press / bibtex

In this paper, we propose a novel Knowledge Embedded Generative Adversarial Networks, dubbed as KE-GAN, to tackle the challenging problem in a semi-supervised fashion. KE-GAN captures semantic consistencies of different categories by devising a Knowledge Graph from the large-scale text corpus.

stagNet: An Attentive Semantic RNN for Group Activity Recognition
Mengshi Qi, Jie Qin, Annan Li, Yunhong Wang, Jiebo Luo, Luc Van Gool
ECCV, 2018
pdf / press / bibtex

We propose a novel attentive semantic recurrent neural network (RNN), dubbed as stagNet, for understanding group activities in videos, based on the spatio-temporal attention and semantic graph.

Sports Video Captioning by Attentive Motion Representation based Hierarchical Recurrent Neural Networks
Mengshi Qi, Yunhong Wang, Annan Li, Jiebo Luo
MMSports@MM, 2018   (Oral Presentation)
pdf / press / video / bibtex

In this paper, we present a novel hierarchical recurrent neural network (RNN) based framework with an attention mechanism for sports video captioning. A motion representation module is proposed to extract individual pose attribute and group-level trajectory cluster information.

Online Cross-modal Scene Retrieval by Binary Representation and Semantic Graph
Mengshi Qi, Yunhong Wang, Annan Li
ACM MM, 2017
pdf / press / bibtex

We propose a new framework for online cross-modal scene retrieval based on binary representations and semantic graph. Specially, we adopt the cross-modal hashing based on the quantization loss of different modalities. By introducing the semantic graph, we are able to extract wealthy semantics and measure their correlation across different modalities.

DEEP-CSSR: Scene Classification using Category-specific Salient Region with Deep Features
Mengshi Qi, Yunhong Wang
IEEE ICIP, 2016   (Oral Presentation)
pdf / press / bibtex

In this paper, we introduce a novel framework towards scene classification using category-specific salient region(CSSR) with deep CNN features, called Deep-CSSR.

Pre-Print
Unsupervised Domain Adaptation with Temporal-Consistent Self-Training for 3D Hand-Object Joint Reconstruction
Mengshi Qi, Edoardo Remelli, Mathieu Salzmann, Pascal Fua
Arxiv, 2020
pdf / press / bibtex

In this paper, we introduce an effective approach to exploit 3D geometric constraints within a cycle generative adversarial network (CycleGAN) to perform domain adaptation. Furthermore, we propose to enforce short- and long-term temporal consistency to fine-tune the domain-adapted model in a self-supervised fashion.

Service
Conference Reviewer, ICCV 2019-2021/CVPR 2020-2022/ECCV 2020/ICML 2021/ICLR 2021-2022/NeurIPS 2020-2021/ACM MM 2021

Journal Reviewer, IJCV/TIP/TMM/TCSVT/PR/ACM Computing Surveys

Senior PC Member, IJCAI 2021

PC Member, AAAI 2020-2022

IEEE Member and ACM Member




Thanks to this awesome guy.