Falls From Heights - A Computer Vision-Based Approach For Safety Harness Detection
Falls From Heights - A Computer Vision-Based Approach For Safety Harness Detection
Automation in Construction
journal homepage: www.elsevier.com/locate/autcon
A R T I C L E I N F O A B S T R A C T
Keywords: Falls from heights (FFH) are major contributors of injuries and deaths in construction. Yet, despite workers being
Convolution neural network made aware of the dangers associated with not wearing a safety harness, many forget or purposefully do not
Falls from height wear them when working at heights. To address this problem, this paper develops an automated computer
Harness vision-based method that uses two convolutional neural network (CNN) models to determine if workers are
Unsafe behavior
wearing their harness when performing tasks while working at heights. The algorithms developed are: (1) a
Faster-R-CNN to detect the presence of a worker; and (2) a deep CNN model to identify the harness. A database
of photographs of people working at heights was created from activities undertaken on several construction
projects in Wuhan, China. The database was then used to test and train the developed networks. The precision
and recall rates for the Faster R-CNN were 99% and 95%, and the CNN models 80% and 98%, respectively. The
results demonstrate that the developed method can accurately detect workers not wearing their harness. Thus,
the computer vision-based approach developed can be used by construction and safety managers as a mechanism
to proactively identify unsafe behavior and therefore take immediate action to mitigate the likelihood of a FFH
occurring.
1. Introduction such workers tend to have a poor awareness and risk perception. Thus,
good communication, effective consultation, improved training and
Falls from heights (FFH) are a major problem in construction [1–6]. reasonable adjustments can often be enough to head off objections to
Research has revealed that FFH account for approximately 48% of wearing a harness.
serious injuries and 30% of fatalities [7]. Numerous safety policies and But more fundamentally, behavioral and cultural change is required
procedures have been established to protect people working at heights to address the reluctance to wear personal protective equipment (PPE),
in construction [8]. For example, scaffolds/platforms and the use fall but this can take a considerable amount of time to implement. To ex-
prevention solutions such as travel restraints systems (e.g. lines and pedite and enact behavioral change, it is suggested that real-time
belts) are required when working above a certain height [9]. monitoring of harnesses that are worn by people working at heights can
Yet, despite the considerable amount of research that has been contribute to preventing falls. Construction and safety managers require
undertaken and the implementation of policies, procedures and the practical methods to monitor and ensure workers are using their har-
development of protection measures, FFH remain a pervasive problem, nesses, particularly scaffolders. However, the safety inspection process
particularly for scaffolders and roofers [10]. In China, for example, can be toilsome and is often undertaken intermittently [11]. As a result,
people working above a height of two metres are required by law to use safety compliance is unable to be assured and therefore the likelihood a
fall arrest equipment [10]. There has, however, been a reluctance from FFH remains a risk.
scaffolders to use a harnesses, in spite of its use being a legal require- To address this problem, the research presented in this paper de-
ment and workers being cognizant of their exposure to a fall [10]. velops an automatic and non-invasive approach using a computer-vi-
Reasons for such non-compliance have been found to be attributable to sion-based method to monitor the use of harnesses. Computer vision-
discomfort while wearing the harness and the restrictions it place on based methods have been widely used in construction. For example, to
movement [10]. While such reasons may well have a degree of validity, track workers on-site [12,13], progress monitoring [14], productivity
⁎
Corresponding author at: Dept. of Construction Management, School of Civil Engineering and Mechanics, Huazhong University of Science and Technology, Wuhan, Hubei, China.
E-mail address: [email protected] (L. Ding).
https://doi.org/10.1016/j.autcon.2018.02.018
Received 15 September 2017; Received in revised form 21 January 2018; Accepted 7 February 2018
0926-5805/ © 2018 Elsevier B.V. All rights reserved.
W. Fang et al. Automation in Construction 91 (2018) 53–61
analysis [15], safety and health monitoring [5], automated doc- 3. Computer vision-based approaches
umentation [5], and postural ergonomic assessment [16].
In comparison with sensing techniques (e.g., Radio Frequency Computer vision is an interdisciplinary field of endeavour that deals
Identification (RFID), Geographical Positioning Systems (GPS) and with how computers can acquire a high-level of understanding from
Ultra-Wide Band (UWB)), which tend to be limited to providing loca- digital images or videos. From an engineering perspective, it seeks to
tion data for a specific entity being monitored, computer vision can automate tasks that the human visual system is unable to do. Vision-
provide a rich set of information (e.g., locations and behaviors of pro- based applications have been developed to capture and process video
ject entities and site conditionss) by analyzing images or videos [5]. [27–29]. This had been aided by the development of new algorithms
While technologies, such as RFID, for instance, have been widely used (e.g. Faster R-CNN) that can be used to detect and track resources (e.g.,
in construction and applied to an array of PPE types [17], there has people, plant and equipment), as well as identify the unsafe behavior of
been an absence of research that has monitored the use of harnesses. workers [30–36].
The paper commences by reviewing existing methods that have been A fundamental tenet of computer vision-based is action recognition,
used to prevent FFH in construction and then introduces a novel ap- which is used to exploit handcrafted features (e.g., shapes) from images
proach based on convolutional neural network (CNN) that can be used or videos. To extract features of workers' actions, descriptors such as
to monitor the use of safety harnesses worn by people working at Histogram of Oriented Gradients (HOG) [37], Histogram of Optical
heights. The technical challenges of the developed CNN approach are Flow (HOF) [38], and Bag-of-features (Bof) [39] have all been em-
presented and the implications for future research are identified. ployed to compute on the image or videos. Hand-crafted feature-based
methods usually employ a three-stage procedure, which consists of: (1)
extraction; (2) representation; and (3) classification.
2. Falls from heights Image representation that is used to recognize human actions can
extract features such as shapes and temporal motions from images.
The unique, dynamic, and complex working environment of con- Action recognition features, however, need to contain rich information
struction sites and non-standardized design and work procedures can so that a wide range of actions can be identified and analyzed.
increase workers' exposure to hazards [5]. The prevention of FFH has Techniques that can be used to analyze such features include classifier
received a significant amount of attention from construction safety and tools (e.g. Support Vector Machine (SVM)), temporal state-space
health management researchers and professionals [18]. Undertaking models (e.g., Hidden Markov models (HMM), conditional random fields
regular safety inspections and risk assessments to identify hazards has (CRF)), and detection-based methods (e.g., bag-of-words coding).
been repeatedly identified in the literature as a core activity to pre- However, the use of these approaches may lead to overfitting and
venting the occurrence of falls [7,19]. A comprehensive review of the therefore weaken the ability to derive generalizations from a dataset.
FHH literature is therefore eschewed, as this can be found in Nadhim Another approach that is often used to collect motion data from
et al. [7]. But for the purposes of brevity key studies that are aligned stereo videos and to reconstruct a three-dimensional (3D) skeleton
with the research presented in this paper are drawn upon. model are depth sensors (Kinect™) [5,40–45]. Kinect™ and multiple
Strategies to prevent and mitigate the severity of injuries can be video cameras have been used to monitor the behavior of workers by
categorized as being passive or proactive [7]. Strategies of a passive estimating the positioning of individual joints in 3D [41–45]. This
nature are based on analyzing fall accident data to develop future method provides a useful way to obtain accurate motion data. But more
prevention plans. For example, identifying those factors that have specifically, it provides the ability to record, model, and analyze the
contributed to fatal occupational falls from accident reports and ac- human motion that has occurred from the performing an unsafe act.
quiring data from regular safety inspections [20]. FFH preventive However, monitoring the positioning of workers using 3D can require
measures that have been derived from an analysis of accident records lengthy computational periods and the line of motion may also be
and autopsy records include [20]: (1) fixed barriers; (2) travel restraint hampered by sensitivities in lighting [46,47].
systems (e.g., belts), fall arrest systems (e.g., harness); and (3) fall
containment systems (e.g., nets). Factors that have been found to con- 4. Convolutional neural networks in construction
tribute to roofers FFH include cognitive slips and lapses, weather, and
schedule demands [21]. The emergent risk factors contributing to FFH Deep learning methods that incorporate CNNs have been demon-
are often prioritized and then used to develop mitigation strategies strated to be effective for computer vision and pattern recognition
[22,23]. For example, an automated Building Information Modeling- [48,49]. LeCun et al. [50] developed the LeNet-5 (a CNN model), which
based safety checking platform that is integrated with safety risks has recognizes handwritten numbers, based on a dataset created by the
been developed, which supports fall prevention planning prior to the Mixed National Institute of Standards and Technology. CNN models can
commencement of construction [24,25]. effectively and automatically recognize features from static images by
Proactive strategies are precautionary measures that place emphasis stacking multiple convolutional and pooling layers.
on safety training and education. For example, the implementation of Krizhevsky et al. [49] was the first to achieve substantially high
specific fall protection training programs [19]; and the design of short levels of image classification accuracy at the ImageNet Large Scale
courses, seminars and talks that focus on working the risks of working Visual Recognition Challenge (LSVRC) by training a deep CNN. Since
at heights with the aim to improve people's safety behavior. While the inception of Krizhevsky's [41] deep CNN, almost all the effective
enforcement of regulations may increase the use of PPE [1], this is a algorithms used for image classification, object recognition, and visual
reactive approach to addressing the issue of safety and does not ne- tracking that have been developed are based on this fundamental work.
cessarily change people behaviors [5]. Hence, it is more important to Krizhevsky et al. [49] used deep CNNs to classify 1.2 million high-re-
influence the mind-sets, attitudes and culture (i.e. values and beliefs) of solution images in the ImageNet LSVRC-2010 contest into a thousand
workers than solving specific violations [26]. It has been suggested that different classes. Hong et al. [51] proposed an online visual tracking
effective measures to enhance the use of PPE are needed, especially in algorithm by learning a discriminative saliency map using a CNN,
the context of FFH, as scaffolders, are often reluctant use their harness which provided superior results compared to other state-of-the-art
[10]. The purpose of harness monitoring is to ensure that it is being tracking algorithms (e.g., discrete fourier transform (DSK), local sparse
used correctly by workers and to ensure an organization's safety and and K-selection (LSK), and circulant tructure of tracking-by-detection
health plans and standards are being met. with kernels (CSK)).
The success of region-based CNNs and region proposal methods has
prompted advancements in object detection and their use in
54
W. Fang et al. Automation in Construction 91 (2018) 53–61
construction for the purpose of visual detection [52,53]. For example, Fig. 1.
Ding et al. [54] proposed a hybrid learning model that integrated CNNs
and long short-term memory (LSTM) to detect worker unsafe behavior. 5.1. Design development of human-harness network
Fang et al. [55] developed a computer vision method to detect non-
hardhat-use workers on jobsites. Cha et al. [56] developed a deep ar- 5.1.1. Design of human network
chitecture of CNN for detecting concrete cracks without extracting the The Faster R-CNN detection network is selected for this research due
defect's feature. Similarly, Feng et al. [57] constructed a deep active to its ability to accurately identify objects with a minimal time lag. The
learning system to detect defects in structure and classify in an image. Faster R-CNN employs the Zeiler and Fergus network [65], which
Roberts et al. [58] adapted to CNN to detect and classify cranes for comprises of five convolutional layers. The Faster R-CNN model effec-
monitoring safety hazards using Unmanned Aerial Vehicles. Note- tively initiates the acquisition of the feature maps through its extraction
worthy, the use of a CNN requires a significantly large dataset to train from within the CNN. The CNN then combines the RPN and Fast R-
its capacity to learn [59]. CNN. As a result, it directly connects the proposal extracted by the RPN
A popular state-of-the-art detection network is the Faster R-CNN. It to the regions of interest (ROI) pooling layer. This enables the end-to-
consists of a fully convolutional region proposal network (RPN) for end target detection of the CNN, thereby accelerating its speed to
proposing candidate regions and a downstream classifier [60]. Essen- identify the target.
tially, the Faster R-CNN is able to accurately identify objects more than The core module of the Faster R-CNN is the RPN. The RPN employs
any other deep learning method that has been proposed [52]. A major n × n spatial windows that slides onto the feature map of the last
challenge confronting the detection of a harness is its color; it is often convolution (Conv) layer of the original image. Each sliding window is
similar to that of a worker's clothing. mapped into a D-dimensional vector. Then, it is used as the input for
two fully connected (fc) layers, namely, the box classification (cls) and
5. Research approach box regression (reg) layers. The former provides the probability of ob-
jects/non-objects, and the latter provides the coordinates of the pre-
In tackling this problem, a design science research approach is dicted object bounding box (Bbox). When n × n sliding windows reach
adopted to design and develop a CNN that can automatically detect the end of the convolution feature matrix, the cls layer outputs 2k scores
workers who are not wearing their safety harness. Design science fo- that represent the probability of the anchor that belongs to the fore-
cuses on describing, explaining and predicting the current natural or ground or the background, and the reg layer outputs 4k coordinates that
social world, by not only understanding problems, but also designing represent the transformation parameters of the real target frame.
solutions to improve human performance [61,62]. In doing so, design
science can be used to develop the corresponding knowledge and ap- 5.1.2. Design of safety harness network
plications to design and implement a product that has value to an or- In developing the process to test for the detection of a harness, a
ganization [63,64]. The research process used to design and develop crop of worker patches obtained from Faster R-CNN are re-entered by
the CNN for detecting harness compliance by workers is presented in constructing a depth neural network. This is undertaken to develop the
55
W. Fang et al. Automation in Construction 91 (2018) 53–61
ĊĊĊĊĊĊĊ
ĊĊĊĊĊĊĊ
ĊĊĊĊĊĊĊ
ĊĊĊĊĊĊĊ
ĊĊĊĊĊĊĊ
Inputs
ĊĊĊĊĊĊĊ
fc fc fc Probability
vector
CNNs ability to learn and classify those workers wearing a harness from state-of-the-art methods [42].
an image by using forward propagation and gradient processes. The It can be seen in Fig. 2 that the processing and output dimensions of
process to crop an image for training the CNN is as follows: each layer for the network vary. The network accepts the original pixel
The output of the Faster R-CNN is OF: of the input image and produces an output in the form of a probability
[[p, x1, y1, x2, y2]1 [p, x1, y1, x2, y2]2 …[p, x1, y1, x2, y2]n]. vector, as noted in Fig. 3.
For i in Range (length (OF)): The implementation of the convolution and pooling layers of the
D[i] = I [x1(i): x2(i), y1(i): y2(i)
, :] network play a critical role in the process of feature extraction. The
Where, p is the confidence of the classification results; (x1, y1) is the convolutional layer is a feature extraction mechanism used to form the
upper-left coordinate of the rectangle; (x2, y2) is lower-right coordinate eigenvector by setting a filter or a convolution kernel. For each layer, a
of the rectangle; n is number of detected human; I is the matrix of convolution operation and activation function on the output of the
original image, the dimensional of matrix is three (length, width, RGB); previous layer in the forward propagation phase is employed, which is
D is an assembly of output detected human images matrix. formalized as:
As denoted in Fig. 2 the harness detection network consists of five
Xijk = f (W k ∗x )ij + bk (2)
convolutional layers, three fully connected layers, and one Softmax
classifier layer (Fig. 2). The Softmax function [56] used in the classifi- where, f is activation function, bk is the bias for this feature map, W is k
cation process is expressed as a probabilistic function: the value of the kernel connected to the kth feature map.
The input of the pooling layer is generally derived from the output
(1) (1) T (i)
⎡ y = 1 |x ; W ⎤ ⎡ eW1 x ⎤ of previous convolution layers. Its main function is to maintain a
⎢ y (2) = n |x (2) ; W ⎥ 1 ⎢ W2T x (i) ⎥ translation invariance (such as rotation, translation, and expansion)
P (y (i) = n |x (i) ; W ) = ⎢ ⎥= n ⎢e ⎥
⎢⋮ ⎥ T (i)
∑ eW j x ⎢⋮ ⎥ and reduce the number of parameters to prevent overfitting.
⎢ y (i) = n |x (i) ; W ⎥ ⎢ eWnT x (i) ⎥
⎣ ⎦ j=1 ⎣ ⎦ (1)
6. Experiments
where, P stands for the ith training example out of m number of training
examples, the jth class out of n number of classes, and weights W; WjTx(i) The developed CNN framework was tested using an experiment to
stands for the inputs of the Softmax layers. detect people who were not wearing their harness (Fig. 4). All algo-
The structure of proposed harness-network can be seen in Fig. 3. A rithms were performed on a server with a 2.40 GHz Intel(R) Xeon(R)
detailed description of the deep CNN used as the basis of the research E5-2680 CPU, NVIDIA(R) TITAN X GPU, and 64 GB RAM. For the
presented in this paper can be found in Krizhevsky et al. [49]. The purpose of this research the Python programming language was used.
architecture of deep CNN model is selected due to its ability to accu- The Caffe deep learning framework was drawn upon, which enabled
rately classify images. For example, it achieved top-1 and top-5 error labelled data to be fed into a Python interface so the calculation and
rates of 37.5% and 17.0%, which is better than previously developed updating of weights could be performed.
56
W. Fang et al. Automation in Construction 91 (2018) 53–61
(a) Workers not wearing harness (b) Workers are identified as they did not wear
harness
Fig. 4. Detection of workers not wearing harness.
To assess the performance of the detection algorithm the following scale, occlusions, and illumination needed to be considered when
two measures are used: (1) precision (i.e. the fraction of relevant in- creating the collection of images that formed the datasets (Fig. 5). A
stances among those retrieved); and (2) recall (i.e. the fraction of re- subset of 693 randomly selected positive images (i.e. workers wearing a
levant instances that have been retrieved over the total amount). These harness) and > 5000 negative images (i.e. workers not wear a harness)
measures are widely used to determine the ability of a system to classify were used to extract and generalize image features during the algo-
objects [66]. The value of precision and recall rates can be calculated rithms training stage. A subset of 77 images (i.e. workers wearing a
using Eq. (3): harness) and other 53 images (workers not wearing a harness) that
TP
included different scales, occlusions, illumination, and other char-
⎧ Pr ecision = TP + FP acteristics were randomly selected as test data.
⎨ Re call = TP
⎩ TP + FN (3)
6.2. Human-harness system testing
where, TP is the number of true positives, FN is the number of false
negatives, FP is the number of false positives, and TN is the number of 6.2.1. Processing of worker detection
true negatives. As noted above, a Faster R-CNN is used to accurately detect workers
in real time with an example being presented in Fig. 6. The original
6.1. Experiment design and data collection image was inputted into the CNN to extract and generalize image fea-
tures, which were subsequently shared by the RPN and Fast R-CNN
Prior to testing the algorithm, a comprehensive dataset of images of (FRCN) as their respective input. The RPN module was used to extract
workers, equipment and materials from construction sites is needed for the region proposal from the convolution network feature map, which
the purpose of training. However, there is limited access to such data- enabled its target score and regressed bounds to be acquired.
sets, which has hindered the implementation of intelligent monitoring To deal with the different scales and aspect ratios of objects, anchors
systems in construction [5]. Due to the unavailability of such datasets, were introduced in the RPN. An anchor was placed in the center of each
they were created to overcome this limitation. spatial window at each sliding location of the convolutional maps.
Using a monocular camera, a dataset of 770 images of people Three different scales (1282, 2562, 5122) and aspect ratios (1:1, 1:2,
working at heights was collected from a number of construction sites in 2:1) were set, and k = 9 anchors were placed at each location. Each
Wuhan, China. Video recordings were also collected of people working proposal is parameterized to correspond to an anchor. If the size of the
at various heights that had been involved with welding steel beams and feature map in the last convolution layer is H × W, then the number of
reinforcement. The experimental dataset was randomly divided into possible proposals in a feature map would be H × W × k.
two parts: (1) training and (2) testing. To avoid bias, different views, The ROI classification and regression network module pool and
57
W. Fang et al. Automation in Construction 91 (2018) 53–61
process feature map of the convolution network so that the target re- Table 1
cognition and decision making can be achieved. This process is un- Detection results with a testing dataset in which workers are randomly taken (detection
with p > 0.8).
dertaken to share the same feature map between the RPN and FRCN.
The research of Ren et al. [45] was drawn upon to train the Faster R- Metric Worker detection Harness detection
CNN. The parameters used to train the Faster R-CNN are: (a) base-
learning rate is 0.001; (b) step size is 10,000; and (c) Gamma is 0.1. For Correctly detected (TP) 249 198
the purpose of tuning, the base learning rate and step size were adjusted Mis-detected (FP) 2 51
Not detected (FN) 12 2
and after 10,000 iterations, drop a Gramma (0.1) was observed. Precision 99% 80%
Recall 95% 98%
6.2.2. Process to detect a harness Note: TP is defined as the number of correctly detected workers/harness. FP is the number
Fig. 6 (right side) identifies a red rectangular box surrounding the of incorrectly detected workers/harness, and FN is the number of undetected workers/
body of a person. This indicates that the program is searching for a harness.
harness. On detection of the worker, the coordinates of the rectangular
box can be obtained. Then, rectangular box's pixels can be fed into the results are ambiguous and possess a high precision but low recall, and
detection model as inputs. The harness in the images can be manually therefore are rejected. However, at a low confidence threshold, the
identified as being a positive training sample. When the network ac- results are more ambiguous with a high recall and low precision and
cepts the original pixel of the input image, and an output is generated thus are accepted. To obtain a high precision-recall rate, an acceptance
based on the results obtained from the picture. threshold value of 0.8 for them is derived [52]. A value < 0.8, indicates
Taking the first convolution layer of this network as an example, its a person has not or been falsely detected. As observed in Fig. 5, each
input is an image with a size of 227 × 227 × 3, using 96 filters with a image in the dataset is taken at a unique scale and in a specific pose,
size of 11 × 11. A total of 96 features maps with a size of 55 × 55 can illumination, and occlusion condition. The red rectangular box illustrates
be obtained using the convolution Eq. (1). The visual work is then workers without a harness. However, the green rectangular box identifies
performed to print the output of the first convolution layer, as shown in workers wearing their harness. A sample of the harness detection re-
the middle of Fig. 7. Each convolution kernel presents characteristics of sults are presented in Figs. 8 and 9.
an image, such as the directions of the edges, and provides a feature Table 1 indicates that the developed CNN models are able to suc-
map that records the different aspects of the image. cessfully detect workers and their harness within images. Notably,
In the case of the first pooling layer, its input is 96 feature graphs however, workers were able to be detected more easily than their
with a size of 55 × 55. After drop sampling by a pooling factor of 3 × 3, harness due to occlusions. There was also similarity between the colors
96 eigenvectors with a size of 27 × 27 are obtained (Fig. 7c). of the workers' clothes and their harness. However, the use of the
proposed CNNs can improve the ability to train data under varying
conditions and therefore increase the likelihood of detecting a harness.
6.3. Human-safety network evaluation
Several examples of TPs, FPs, and FN for the detection of the harness
are shown in Figs. 10 and 11.
The results of the automatic detection of images that were recorded
are presented in Table 1. At a high confidence threshold, the detection
(a) Original map (b) Output map of the first convolution layer (c) Output map of the first pool
layer
Fig. 7. Feature maps visualization.
58
W. Fang et al. Automation in Construction 91 (2018) 53–61
59
W. Fang et al. Automation in Construction 91 (2018) 53–61
needs to be acknowledged that several limitations exist. The study was Acknowledgments
limited to a select number of activities working at heights. Future re-
search, is therefore, required to in increase the scope of activities that This research is supported in part by a major project of The National
are examined and develop an automatic monitoring systems that can Social Science Key Fund of China (Grant No.13&ZD175), supported by
recognize unsafe behaviors. In some instances, the models were not National Natural Science Foundation of China (Grant No.71732001,
able to detect workers and their harnesses. This was due to a number of No.51678265, No.71301059), supported by “the Fundamental
issues, the sample size and the harness's color had a direct effect on the Research Funds for the Central Universities” (Grant NO.
CNNs recognition ability. Learning requires numerous samples to 2017KFYXJJ134, Grant No. 2015ZDTD023). The authors would like
achieve good testing results, otherwise, overfitting may arise. In the also acknowledge the constructive and insightful comments provided
experiment, the sample of pictures was limited to 693 for training the by the Associate Editor and four anonymous reviewers, which have
networks, which resulted in the harness not being unrecognized. This helped improve the quality of this manuscript.
may, in part, be attributable to the limited number of images possessing
varying scales, which may have hindered the networks ability to learn References
and recognize the harness. In addition, cluttered construction sites can
occlude the ability recognize the harness [67]. [1] F.P. Rivara, D.C. Thompson, Prevention of falls in the construction industry: evi-
dence for program effectiveness, Am. J. Prev. Med. 18 (4) (2000) 23–26, http://dx.
doi.org/10.1016/S0749-3797(00)00137-9.
[2] X. Huang, J. Hinze, Analysis of construction worker fall accidents, J. Constr. Eng.
9. Conclusion Manag. 129 (3) (2003) 262–271, http://dx.doi.org/10.1061/(asce)0733-
9364(2003)129:3(262).
[3] S.M. Whitaker, R.J. Graves, M. James, P. McCann, Safety with access scaffolds:
A new approach is proposed to detect individual workers who do development of a prototype decision aid based on accident analysis, J. Saf. Res. 34
not wear their harnesses while working at heights on construction sites. (3) (2003) 249–261, http://dx.doi.org/10.1016/s0022-4375(03)00025-2.
[4] P. Becker, M. Fullen, M. Akladios, G. Hobbs, Prevention of construction falls by
Two algorithms were developed: (1) a Faster-R-CNN to detect the organizational intervention, Inj. Prev. 7 (Suppl. 1) (2001) i64–i67, http://dx.doi.
presence of workers; and (2) CNN models to identify the harness at- org/10.1136/ip.7.suppl_1.i64.
[5] J. Seo, S. Han, S. Lee, H. Kim, Computer vision techniques for construction safety
tached to workers. The results that were achieved demonstrated that by and health monitoring, Adv. Eng. Inform. 29 (2) (2015) 239–251, http://dx.doi.
combining the CNNs that a high degree of accuracy could be achieved org/10.1016/j.aei.2015.02.001.
in detecting workers that were not wearing their harness. [6] M.G. Yang, An empirical investigation of the average deployment force of personal
fall arrest energy absorbers, J. Constr. Eng. Manag. 141 (1) (2015), http://dx.doi.
The precision and recall rates for the Faster R-CNN were 99% and org/10.1061/(asce)co.1943-7862.0000910.
95%, respectively. Similarly, in case of the CNNs the precision and [7] E.A. Nadhim, C. Hon, B. Xia, I. Stewart, D. Fang, Falls from height in the con-
recall rates were marginally lower being 80% and 98%, respectively. It struction industry: a critical review of the scientific literature, Int. J. Environ. Res.
Public Health 13 (7) (2016) 638, http://dx.doi.org/10.3390/ijerph13070638.
was revealed that the Faster R-CNN provided a better detection accu- [8] M.-W. Park, N. Elsafty, Z. Zhu, Hardhat-wearing detection for enhancing on-site
racy to detect people and the CNN models for the harness. The Faster R- safety of construction workers, J. Constr. Eng. Manag. 141 (9) (2015) 04015024, ,
http://dx.doi.org/10.1061/(asce)co.1943-7862.0000974.
CNN framework enabled the deep models to focus on annotating a [9] K. Hu, H. Rahmandad, T. Smith-Jackson, W. Winchester, Factors influencing the
worker. In addition, by mapping the feature map and sharing with the risk of falls in the construction industry: a review of the evidence, Constr. Manag.
RPN and FRCN computation time was reduced. This enabled the ex- Econ. 29 (4) (2011) 397–416, http://dx.doi.org/10.1080/01446193.2011.558104.
[10] M. Zhang, D. Fang, A cognitive analysis of why Chinese scaffolders do not use safety
tracted feature of the RPN to be directly connected to ROI pooling layer harnesses in construction, Constr. Manag. Econ. 31 (3) (2013) 207–222, http://dx.
ensuring the CNN to quickly detect the worker. doi.org/10.1080/01446193.2013.764000.
[11] K. Shrestha, P.P. Shrestha, D. Bajracharya, E.A. Yfantis, Hard-hat detection for
Despite not being able to recognize the harness with 100% accu-
construction safety visualization, J. Constr. Eng. 2015 (2015), http://dx.doi.org/10.
racy, the developed deep CNN approach can provide site management 1155/2015/721380.
with several benefits to their everyday practice. Firstly, safety behavior [12] M. Golparvar-Fard, A. Heydarian, J.C. Niebles, Vision-based action recognition of
earthmoving equipment using spatio-temporal features and support vector machine
can be monitored without disturbing people while they working. And classifiers, Adv. Eng. Inform. 27 (4) (2013) 652–663, http://dx.doi.org/10.1016/j.
secondly, a wide range of working areas can be simulteanously mon- aei.2013.09.001.
itored, which can reduce the costs and time associated with inspections. [13] M.-W. Park, I. Brilakis, Continuous localization of construction workers via in-
tegration of detection and tracking, Autom. Constr. 72 (2016) 129–142, http://dx.
Having a real-time monitoring system in place to monitor people doi.org/10.1016/j.autcon.2016.08.039.
working at heights provides a mechanism to reduce falls and improve [14] K.K. Han, M. Golparvar-Fard, Appearance-based material classification for mon-
itoring of operation-level construction progress using 4D BIM and site photologs,
safety. Autom. Constr. 53 (2015) 44–57, http://dx.doi.org/10.1016/j.autcon.2015.02.007.
[15] J. Gong, C.H. Caldas, Computer vision-based video interpretation model for auto-
mated productivity analysis of construction operations, J. Comput. Civ. Eng. 24 (3)
60
W. Fang et al. Automation in Construction 91 (2018) 53–61
(2009) 252–263, http://dx.doi.org/10.1061/(asce)cp.1943-5487.0000027. [42] M. Liu, S. Han, S. Lee, Tracking-based 3D human skeleton extraction from stereo
[16] J. Seo, K. Yin, S. Lee, Automated postural ergonomic assessment using a computer video camera toward an on-site safety and ergonomic analysis, Constr. Innov. 16 (3)
vision-based posture classification, Construction Research Congress, San Juan, (2016) 348–367, http://dx.doi.org/10.1108/ci-10-2015-0054.
Puerto Rico, USA, 2016, pp. 809–818, , http://dx.doi.org/10.1061/ [43] S. Han, S. Lee, F. Peña-Mora, Comparative study of motion features for similarity-
9780784479827.082. based modeling and classification of unsafe actions in construction, J. Comput. Civ.
[17] S. Barro-Torres, T.M. Fernández-Caramés, H.J. Pérez-Iglesias, C.J. Escudero, Real- Eng. 28 (5) (2014) A4014005, , http://dx.doi.org/10.1061/(asce)cp.1943-5487.
time personal protective equipment monitoring system, Comput. Commun. 36 (1) 0000339.
(2012) 42–50, http://dx.doi.org/10.1016/j.comcom.2012.01.005. [44] S.U. Han, M. Achar, S.H. Lee, F. Peña-Mora, Empirical assessment of a RGB-D sensor
[18] E. Cheung, A.P. Chan, Rapid demountable platform (RDP)—a device for preventing on motion capture and action recognition for construction worker monitoring, Vis.
fall from height accidents, Accid. Anal. Prev. 48 (2012) 235–245, http://dx.doi.org/ Eng. 1 (1) (2013) 6, http://dx.doi.org/10.1186/2213-7459-1-6.
10.1016/j.aap.2011.05.037. [45] S. Han, S. Lee, F. Peña-Mora, Vision-based detection of unsafe actions of a con-
[19] H. Jebelli, C.R. Ahn, T.L. Stentz, Fall risk analysis of construction workers using struction worker: case study of ladder climbing, J. Comput. Civ. Eng. 27 (6) (2013)
inertial measurement units: validating the usefulness of the postural stability me- 635–644, http://dx.doi.org/10.1061/(asce)cp.1943-5487.0000279.
trics in construction, Saf. Sci. 84 (2016) 161–170, http://dx.doi.org/10.1016/j.ssci. [46] D. Wang, F. Dai, X. Ning, Risk assessment of work-related musculoskeletal disorders
2015.12.012. in construction: state-of-the-art review, J. Constr. Eng. Manag. 141 (6) (2015),
[20] C.-F. Chi, T.-C. Chang, H.-I. Ting, Accident patterns and prevention measures for http://dx.doi.org/10.1061/(ASCE)CO.1943-7862.0000979.
fatal occupational falls in the construction industry, Appl. Ergon. 36 (4) (2005) [47] J. Gong, C.H. Caldas, C. Gordon, Learning and classifying actions of construction
391–400, http://dx.doi.org/10.1016/j.apergo.2004.09.011. workers and equipment using Bag-of-Video-Feature-Words and Bayesian network
[21] H. Cakan, E. Kazan, M. Usmen, Investigation of factors contributing to fatal and models, Adv. Eng. Inform. 25 (4) (2011) 771–782, http://dx.doi.org/10.1016/j.aei.
nonfatal roofer fall accidents, Int. J. Constr. Educ. Res. 10 (4) (2014) 300–317, 2011.06.002.
http://dx.doi.org/10.1080/15578771.2013.868843. [48] J. Schmidhuber, Deep learning in neural networks: an overview, Neural Netw. 61
[22] O. Aneziris, I.A. Papazoglou, H. Baksteen, M. Mud, B. Ale, L.J. Bellamy, A.R. Hale, (2015) 85–117, http://dx.doi.org/10.1016/j.neunet.2014.09.003.
A. Bloemhoff, J. Post, J. Oh, Quantified risk assessment for fall from height, Saf. Sci. [49] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep con-
46 (2) (2008) 198–220, http://dx.doi.org/10.1016/j.ssci.2007.06.034. volutional neural networks, International Conference on Neural Information
[23] P. Kines, Construction workers' falls through roofs: fatal versus serious injuries, J. Processing Systems, 2012, pp. 1097–1105, , http://dx.doi.org/10.1145/3065386.
Saf. Res. 33 (2) (2002) 195–208, http://dx.doi.org/10.1016/s0022-4375(02) [50] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to
00019-1. document recognition, Proc. IEEE 86 (11) (1998) 2278–2324, http://dx.doi.org/10.
[24] S. Zhang, J. Teizer, J.K. Lee, C.M. Eastman, M. Venugopal, Building information 1109/9780470544976.ch9.
modeling (BIM) and safety: automatic safety checking of construction models and [51] S. Hong, T. You, S. Kwak, B. Han, Online tracking by learning discriminative sal-
schedules, Autom. Constr. 29 (4) (2013) 183–195, http://dx.doi.org/10.1016/j. iency map with convolutional neural network, Comput. Sci. (2015) 597–606
autcon.2012.05.006. (arXiv:1502.06796v1).
[25] S. Zhang, K. Sulankivi, M. Kiviniemi, I. Romo, C.M. Eastman, J. Teizer, BIM-based [52] S. Ren, K. He, R. Girshick, J. Sun, Faster R-CNN: towards real-time object detection
fall hazard identification and prevention in construction safety planning, Saf. Sci. with region proposal networks, IEEE Trans. Pattern Anal. Mach. Intell. (2015)
72 (8) (2015) 31–45, http://dx.doi.org/10.1016/j.ssci.2014.08.001. 91–99, http://dx.doi.org/10.1109/tpami.2016.2577031.
[26] K.J. Nielsen, A comparison of inspection practices within the construction industry [53] Q. Fang, H. Li, X. Luo, L. Ding, T.M. Rose, W. An, Y. Yu, A deep learning-based
between the Danish and Swedish Work Environment Authorities, Constr. Manag. method for detecting non-certified work on construction sites, Adv. Eng. Inf. 35
Econ. 35 (3) (2017) 154–169, http://dx.doi.org/10.1080/01446193.2016. (2018) 56–68, http://dx.doi.org/10.1016/j.aei.2018.01.001.
1231407. [54] L. Ding, W. Fang, H. Luo, P.E.D. Love, B. Zhong, X. Ouyang, A deep hybrid learning
[27] T. Guan, L. Duan, J. Yu, Y. Chen, X. Zhang, Real-time camera pose estimation for model to detect unsafe behavior: integrating convolution neural networks and long
wide-area augmented reality applications, IEEE Comput. Graph. Appl. 31 (3) (2011) short-term memory, Autom. Constr. 86 (2018) 118–124, http://dx.doi.org/10.
56–68, http://dx.doi.org/10.1109/mcg.2010.23. 1016/j.autcon.2017.11.002.
[28] H. Pan, T. Guan, Y. Luo, L. Duan, Y. Tian, L. Yi, Y. Zhao, J. Yu, Dense 3D re- [55] Q. Fang, H. Li, X. Luo, L. Ding, H. Luo, T.M. Rose, W. An, Detecting non-hardhat-use
construction combining depth and RGB information, Neurocomputing 175 (PA) by a deep learning method from far-field surveillance videos, Autom. Constr. 85
(2016) 644–651, http://dx.doi.org/10.1016/j.neucom.2015.10.104. (2018) 1–9, http://dx.doi.org/10.1016/j.autcon.2017.09.018.
[29] Z. Wang, J. Yu, Y. He, T. Guan, Affection arousal based highlight extraction for [56] Y.J. Cha, W. Choi, O. Büyüköztürk, Deep learning-based crack damage detection
soccer video, Multimedia Tools Appl. 73 (1) (2014) 519–546, http://dx.doi.org/10. using convolutional neural networks, Comput. Aided Civ. Inf. Eng. 32 (5) (2017)
1007/s11042-013-1619-1. 361–378, http://dx.doi.org/10.1111/mice.12263.
[30] Y. Yu, H. Guo, Q. Ding, H. Li, M. Skitmore, An experimental study of real-time [57] C. Feng, M.Y. Liu, C.C. Kao, T.Y. Lee, Deep active learning for civil infrastructure
identification of construction workers' unsafe behaviors, Autom. Constr. (2017), defect detection and classification, ASCE International Workshop on Computing in
http://dx.doi.org/10.1016/j.autcon.2017.05.002. Civil Engineering, ASCE, Seattle, Washington, USA, 25–27 June 2017, 2017, pp.
[31] H. Kim, K. Kim, H. Kim, Vision-based object-centric safety assessment using fuzzy 298–306, , http://dx.doi.org/10.1061/9780784480823.036.
inference: monitoring struck-by accidents with moving objects, J. Comput. Civ. Eng. [58] D. Roberts, T. Bretl, M. Golparvar-Fard, Detecting and Classifying cranes using
30 (4) (2016), http://dx.doi.org/10.1061/(asce)cp.1943-5487.0000562. camera-equipped UAVs for monitoring crane-related safety hazards, ASCE
[32] Z. Zhu, X. Ren, Z. Chen, Visual tracking of construction jobsite workforce and International Workshop on Computing in Civil Engineering 2017, ASCE, Seattle,
equipment with particle filtering, J. Comput. Civ. Eng. 30 (6) (2016), http://dx.doi. Washington, USA, 25–27 June 2017, 2017, pp. 442–449, , http://dx.doi.org/10.
org/10.1061/(asce)cp.1943-5487.0000573. 1061/9780784480847.055.
[33] L. Ma, R. Sacks, R. Zeibak-Shini, Information modeling of earthquake-damaged [59] A. Krizhevsky, I. Sutskever, G.E. Hinton, ImageNet classification with deep con-
reinforced concrete structures, Adv. Eng. Inform. 29 (3) (2015) 396–407, http://dx. volutional neural networks, International Conference on Neural Information
doi.org/10.1016/j.aei.2015.01.007. Processing Systems, 2012, pp. 1097–1105 http://papers.nips.cc/paper/4824-
[34] R. Zeibak-Shini, R. Sacks, L. Ma, S. Filin, Towards generation of as-damaged BIM imagenet-classification-with-deep-convolutional-neural-networks.pdf.
models using laser-scanning and as-built BIM: first estimate of as-damaged locations [60] L. Zhang, L. Lin, X. Liang, K. He, Is faster R-CNN doing well for pedestrian detec-
of reinforced concrete frame members in masonry infill structures, Adv. Eng. tion? European Conference on Computer Vision, Springer, 2016, pp. 443–457, ,
Inform. 30 (3) (2016) 312–326, http://dx.doi.org/10.1016/j.aei.2016.04.001. http://dx.doi.org/10.1007/978-3-319-46475-6_28.
[35] L. Ma, R. Sacks, U. Kattel, T. Bloch, 3D object classification using geometric features [61] J.E.V. Aken, Management research based on the paradigm of the design sciences:
and pairwise relationships, Comput. Aided Civ. Inf. Eng. (2017), http://dx.doi.org/ the quest for field-tested and grounded technological rules, J. Manag. Stud. 41 (2)
10.1111/mice.12336. (2004) 219–246, http://dx.doi.org/10.1111/j.1467-6486.2004.00430.x.
[36] R. Sacks, A. Kedar, A. Borrmann, L. Ma, I. Brilakis, P. Hüthwohl, S. Daum, U. Kattel, [62] M. Chu, J. Matthews, P.E.D. Love, Integrating mobile building information mod-
R. Yosef, T. Liebich, B.E. Barutcu, S. Muhic, SeeBridge as next generation bridge elling and augmented reality systems: an experimental study, Autom. Constr. 85
inspection: overview, information delivery manual and model view definition, (2018) 305–316, http://dx.doi.org/10.1016/j.autcon.2017.10.032.
Autom. Constr. 90 (2018) 134–145, http://dx.doi.org/10.1016/j.autcon.2018.02. [63] J.E.V. A., Management research as a design science: articulating the research pro-
033. ducts of mode 2 knowledge production in management, Br. J. Manag. 16 (1) (2005)
[37] N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, computer 19–36, http://dx.doi.org/10.1111/j.1467-8551.2005.00437.x.
vision and pattern recognition, IEEE Computer Society Conference, San Diego, CA, [64] G.L. Geerts, A design science research methodology and its application to ac-
USA, June 20–26 2005, pp. 886–893, , http://dx.doi.org/10.1109/cvpr.2005.177. counting information systems research, Int. J. Account. Inf. Syst. 12 (2) (2011)
[38] H. Wang, C. Schmid, Action recognition with improved trajectories, IEEE 142–151, http://dx.doi.org/10.1016/j.accinf.2011.02.004.
International Conference on Computer Vision, 2014, pp. 3551–3558, , http://dx. [65] M.D. Zeiler, R. Fergus, Visualizing and understanding convolutional networks,
doi.org/10.1109/iccv.2013.441. European conference on computer vision, Springer, 2014, pp. 818–833, , http://dx.
[39] T. Guan, Y. Fan, L. Duan, J. Yu, On-device mobile visual location recognition by doi.org/10.1007/978-3-319-10590-1_53.
using panoramic images and compressed sensing based visual descriptors, PLoS One [66] D.M.W. Powers, Evaluation: from precision, recall and F-factor to ROC,
9 (6) (2014) e98806, , http://dx.doi.org/10.1371/journal.pone.0098806. Informedness, Markedness & Correlation, J. Mach. Learn. Technol. 2 (2011)
[40] S.J. Ray, J. Teizer, Real-time construction worker posture analysis for ergonomics 2229–3981 http://hdl.handle.net/2328/27165.
training, Adv. Eng. Inform. 26 (2) (2012) 439–455, http://dx.doi.org/10.1016/j. [67] J. Yang, M.-W. Park, P.A. Vela, M. Golparvar-Fard, Construction performance
aei.2012.02.011. monitoring via still images, time-lapse photos, and video streams: now, tomorrow,
[41] S. Han, S. Lee, A vision-based motion capture and recognition framework for be- and the future, Adv. Eng. Inform. 29 (2) (2015) 211–224, http://dx.doi.org/10.
havior-based safety management, Autom. Constr. 35 (2013) 131–141, http://dx. 1016/j.aei.2015.01.011.
doi.org/10.1016/j.autcon.2013.05.001.
61