He is currently a third-year Ph.D. student in the Research School of Computer Science, The Australian National University. On the one hand, he is a passionate starter in academic research and interested in many deep learning topics, particularly video understanding, human action recognition. On the other hand, he is an active full-stack web developer. He is currently doing a research project supervised by Professor Stephen Gould, Dr. Yizhak Ben-Shabat, Dr. Anoop Cherian, and Dr. Cristian Rodriguez. Before that, in 2021, he received his bachelor’s degree in Advanced Computing (honours) and Computer Science and Technology from The Australian National University and Shandong University, Weihai respectively.
Ph.D. of Computer Science, 2022 - Present
The Australian National University
Bachelor of Advanced Computing (Honours), 2019 - 2021
The Australian National University
Bachelor of Computer Science and Technology, 2017 - 2019
Shandong University, Weihai
The paper “Temporally Grounding Instructional Diagrams in Unconstrained Videos” introduces a method for simultaneously localizing multiple instructional diagram queries in videos, addressing the limitations of current approaches that handle queries individually. The proposed method uses composite queries combining visual features and positional embeddings, reducing overlaps and correcting temporal misalignment. Tested on the IAW and YouCook2 datasets, this approach significantly improves grounding accuracy by leveraging self-attention and cross-attention mechanisms, outperforming existing methods while maintaining the temporal structure of instructional steps. (Generated by ChatGPT4o).
This paper introduces a supervised contrastive learning approach that learns to align videos with the subtle details of assembly diagrams, guided by a set of novel losses. To study this problem and evaluate the effectiveness of their method, they introduce a new dataset: IAW—for Ikea assembly in the wild—consisting of 183 hours of videos from diverse furniture assembly collections and nearly 8,300 illustrations from their associated instruction manuals and annotated for their ground truth alignments. They define two tasks on this dataset: First, nearest neighbor retrieval between video segments and illustrations, and, second, alignment of instruction steps and the segments for each video. Extensive experiments on IAW demonstrate superior performance of their approach against alternatives. (Generated by New Bing).
Since 2022
Since 2017
Since 2017
Since 2017