Skip to main content
Thesis defences

PhD Oral Exam - Ali Ghelmani Rashid Abad, Information and Systems Engineering

Vision-Based Construction Activity Recognition Using Supervised and Self-Supervised Methods


Date & time
Friday, March 28, 2025
10 a.m. – 1 p.m.
Cost

This event is free

Organization

School of Graduate Studies

Contact

Dolly Grewal

Accessible location

Yes

When studying for a doctoral degree (PhD), candidates submit a thesis that provides a critical review of the current state of knowledge of the thesis subject as well as the student’s own contributions to the subject. The distinguishing criterion of doctoral graduate research is a significant and original contribution to knowledge.

Once accepted, the candidate presents the thesis orally. This oral exam is open to the public.

Abstract

Tracking and monitoring the activities of construction entities, such as workers and equipment, on construction sites is crucial for assessing their performance and productivity. However, manual monitoring is demanding, time-consuming, and susceptible to inaccuracies. To this end, numerous automated computer vision (CV)-based methods have been developed to detect construction entities and classify their activities. Recently, single-stage activity recognition methods that simultaneously analyze spatial and temporal information have been proposed in the construction domain. While these methods have demonstrated improved performance over multi-stage approaches and can alleviate their limitations, they still suffer from a significant drawback: relatively low per-frame activity recognition and localization accuracy. This limitation necessitates additional post-processing to link the per-frame detection results and construct the corresponding action tubes, which in turn reduces the real-time applicability of these methods for simultaneous detection of the activities of multiple construction entities.

Another major disadvantage of the current state-of-the-art construction equipment activity recognition methods is their reliance on supervised learning. This approach requires large, labeled datasets for each type of equipment and activity, which can be costly and time-consuming to create. Particularly, for the task of activity recognition and localization that requires frame-level annotations. To address this challenge, many self-supervised deep learning methods have been proposed in the CV domain, which exploit the abundant unlabeled data to alleviate the data annotation cost by creating labels from the input data itself. However, the assumption of availability of abundant unlabeled data limits the applicability of the current self-supervised methods in the construction domain in which the number of videos for various activities of different construction equipment is also limited.

To overcome the aforementioned limitations, parallel frameworks are proposed in this research for developing a general construction entity activity recognition method. As such, the main objectives of this research are: (1) to tackle the data annotation requirements of existing supervised methods by utilizing Self-Supervised Learning (SSL) to leverage the information available in the unlabeled construction site videos; (2) to develop and apply an SSL method specifically tailored to limited-data scenarios in the construction domain, in contrast to the SSL methods developed in the CV domain that depend on the availability of abundant unlabeled data; (3) to develop a supervised, single-stage construction equipment activity recognition and localization method with high per-frame performance, thus, removing the need for a post-processing step; (4) to improve the activity recognition and localization performance for complex and fast-paced activities (e.g., excavator swinging) by incorporating the dynamic information present in the temporal gradient data modality of input videos, combined with knowledge distillation, to improve per-frame performance without increasing inference computation; (5) to enhance the multi-scale and generalization performance of the supervised method developed in the third objective through the development of a custom pyramid architecture and a novel anchor-free localization method; and (6) to provide accurate, real-time activity recognition and localization information of a diverse set of construction entities with significant size differences.

It is found that the developed SSL method results in an improvement of 10.1% over its supervised counterpart when using only 5% of the labels in the dataset. As such, the results clearly indicate the potential of the proposed SSL method for improving the model performance at no additional data annotation cost. Furthermore, the developed supervised single-stage method achieved per-frame excavator activity recognition and localization accuracies of 93.6% and 79.8%, respectively, thus eliminating the need for post-processing. This method was further enhanced to provide accurate, consistent, and real-time performance for simultaneous activity recognition and localization of construction entities with significant scale variations. Notably, it attained per-frame activity recognition accuracy of 94.67% and localization accuracy of 87.35% for simultaneous recognition of excavators and workers’ activities, despite their substantial size differences. Consequently, the obtained results demonstrate the effectiveness and applicability of the developed method at providing real-time monitoring information.

Back to top

© Concordia University