A unified framework for automated person re-Identification

Along with the strong development of camera networks, a video analysis system has
been become more and more popular and has been applied in various practical applications. In
this paper, we focus on person re-identification (person ReID) task that is a crucial step of video
analysis systems. The purpose of person ReID is to associate multiple images of a given person
when moving in a non-overlapping camera network. Many efforts have been made to person ReID.
However, most of studies on person ReID only deal with well-alignment bounding boxes which are
detected manually and considered as the perfect inputs for person ReID. In fact, when building a
fully automated person ReID system the quality of the two previous steps that are person detection
and tracking may have a strong effect on the person ReID performance. The contribution of this
paper are two-folds. First, a unified framework for person ReID based on deep learning models
is proposed. In this framework, the coupling of a deep neural network for person detection and a
deep-learning-based tracking method is used. Besides, features extracted from an improved ResNet
architecture are proposed for person representation to achieve a higher ReID accuracy. Second, our
self-built dataset is introduced and employed for evaluation of all three steps in the fully automated
person ReID framework.
Download
Trang 1
Trang 2
Trang 3
Trang 4
Trang 5
Trang 6
Trang 7
Trang 8
Trang 9
Trang 10
Tải về để xem bản đầy đủ
13 trang baonam 10020
Download
Bạn đang xem 10 trang mẫu của tài liệu "A unified framework for automated person re-Identification", để tải tài liệu gốc về máy hãy click vào nút Download ở trên
Tóm tắt nội dung tài liệu: A unified framework for automated person re-Identification

Transport and Communications Science Journal, Vol. 71, Issue 7 (09/2020), 868-880
A UNIFIED FRAMEWORK FOR AUTOMATED PERSON
RE-IDENTIFICATION
Hong Quan Nguyen1,3, Thuy Binh Nguyen 1,4, Duc Long Tran 2, Thi Lan Le1,2
1School of Electronics and Telecommunications, Hanoi University of Science and Technology, 
Hanoi, Vietnam
2International Research Institute MICA, Hanoi University of Science and Technology, Hanoi, 
Vietnam
3Viet-Hung Industry University, Hanoi, Vietnam
4Faculty of Electrical-Electronic Engineering, University of Transport and Communications, 
Hanoi, VietNam
ARTICLE INFO
TYPE: Research Article
Received: 31/8/2020
Revised: 26/9/2020
Accepted: 28/9/2020
Published online: 30/9/2020
https://doi.org/10.47869/tcsj.71.7.11
∗Coresponding author:
Email:thuybinh ktdt@utc.edu.vn
Abstract. Along with the strong development of camera networks, a video analysis system has
been become more and more popular and has been applied in various practical applications. In
this paper, we focus on person re-identification (person ReID) task that is a crucial step of video
analysis systems. The purpose of person ReID is to associate multiple images of a given person
when moving in a non-overlapping camera network. Many efforts have been made to person ReID.
However, most of studies on person ReID only deal with well-alignment bounding boxes which are
detected manually and considered as the perfect inputs for person ReID. In fact, when building a
fully automated person ReID system the quality of the two previous steps that are person detection
and tracking may have a strong effect on the person ReID performance. The contribution of this
paper are two-folds. First, a unified framework for person ReID based on deep learning models
is proposed. In this framework, the coupling of a deep neural network for person detection and a
deep-learning-based tracking method is used. Besides, features extracted from an improved ResNet
architecture are proposed for person representation to achieve a higher ReID accuracy. Second, our
self-built dataset is introduced and employed for evaluation of all three steps in the fully automated
person ReID framework.
Keywords. Person re-identification, human detection, tracking
c© 2020 University of Transport and Communications
868
Transport and Communications Science Journal, Vol. 71, Issue 7 (09/2020), 868-880
1. INTRODUCTION
Along with the strong development of camera networks, a video analysis system has been
become more and more popular and is applied in various practical applications. In early years,
these systems are performed in manual manner which are time consuming and tedious. Moreover,
the accuracy is low and it is difficult to retrieve information when needed. Fortunately, with the
great help of image processing and pattern recognition, automatic techniques are used to solve this
problem. The automatic video analysis system normally includes four main components that are
object detection, tracking, person ReID, and event/activity recognition. Nowadays, these systems
are deployed in airport, shopping mall and traffic management departments [1].
In this paper, we focus on a fully automatic person ReID system which contains only three
first steps of the full video analysis system that are person detection, tracking and re-identification.
The purpose of human detection is to create a bounding box contain an object in a given image
while tracking methods aim at connecting the detected bounding boxes of the same person. Finally,
person ReID is to associate multiple images of the same person in different camera views. Although
studying on person ReID has achieved some important milestones [2], this problem still has to cope
with various challenges, such as the variations in illuminations, poses, view-points, etc.
Additionally, most studies on person ReID only deal with Region of Interests (RoIs) which
are extracted manually with high quality and well-alignment bounding boxes. Meanwhile, there are
several challenges when working with a unified framework for person ReID in which these bounding
boxes are automatically detected and tracked. For example, in the detection step, a bounding box
might contain only several parts of the human body, occlusion appears with high frequency, or
there are more than one person in a detected bounding box. For the tracking step, the sudden
appearance or disappearance of the pedestrian cause the fragment of tracklets and identity switch
(ID switch). This makes a pedestrian’s tracklet is broken into several fragments or a tracklet includes
more than one individual. These errors reduce person ReID accuracy. This is the motivation for us
to conduct this study on the fully automated person ReID framework. The contribution of this
paper are two-folds. First, a unified framework for person ReID based on deep learning models is
proposed. In this framework, among different models proposed for object detectio ... .6 93.2 94.4 7 7 0 0 91.5 90.4 92.7 7 11 88.0 0.26
outdoor easy 70 65 97.5 97.3 97.4 7 7 0 0 74.5 74.4 74.6 6 16 94.5 0.21
outdoor hard 533 460 93.0 92.0 92.5 20 19 1 0 78.0 77.6 78.4 30 67 84.4 0.28
20191104 indoor left 164 215 83.3 86.7 85.0 10 8 2 0 83.8 85.5 82.1 7 24 70.0 0.34
20191104 indoor right 118 188 85.2 90.1 87.6 13 8 5 0 79.6 81.9 77.4 9 16 75.1 0.30
20191104 indoor cross 142 244 76.9 85.1 80.8 10 5 4 1 68.0 71.6 64.7 12 29 62.3 0.29
20191104 outdoor left 249 160 88.0 82.5 85.2 10 8 2 0 73.5 71.2 76.0 10 48 68.6 0.33
20191104 outdoor right 203 197 86.0 85.6 85.8 11 7 3 1 70.6 70.5 70.8 17 45 70.3 0.29
20191104 outdoor cross 213 134 85.7 79.1 82.3 12 8 2 2 71.9 69.2 75.0 14 33 61.6 0.30
20191105 indoor left 66 276 81.6 94.9 87.7 11 6 4 1 84.1 90.9 78.2 14 34 76.3 0.29
20191105 indoor right 106 291 74.0 88.7 80.7 11 5 6 0 77.4 85.1 71.0 7 49 63.9 0.32
20191105 indoor cross 284 833 73.0 88.8 80.1 21 10 11 0 68.7 76.1 62.6 29 104 62.9 0.28
20191105 outdoor left 104 104 93.4 93.4 93.4 11 10 1 0 92.1 92.1 92.1 8 24 86.2 0.27
20191105 outdoor right 220 256 77.1 79.7 78.4 11 4 6 1 67.3 68.4 66.2 14 67 56.2 0.33
20191105 outdoor cross 317 378 85.6 87.6 86.6 17 15 2 0 72.2 72.8 71.4 48 97 71.6 0.29
OVERALL 2869 3852 86.5 89.6 88.0 182 127 49 6 76.6 77.9 75.3 232 664 75.7 0.28
Tables 2 and 3, we realize that Prcn is in range from 79.1% to 97.3% and from 79.9% to 94.4%
when applying YOLOv3 and Mask R-CNN, respectively while Rcll varies from 73.0% to 97.5%
and from 82.8% to 98.4% in case of using YOLOv3 and Mask R-CNN, respectively. The large
difference between these results indicate the great difference in challenging levels of each video.
Table 3. Performance on FAPR dataset when employing Mask R-CNN as a detector and DeepSORT as a
tracker.
Videos
For evaluating a detector (1) For evaluating a tracker (2)
FP↓ FN↓ Rcll(%)↑ Prcn(%)↑ F1-score(%)↑ GT MT↑ PT↑ ML↓ IDF1(%)↑ IDP(%)↑ IDR(%)↑ IDs↓ FM↓ MOTA(%)↑ MOTP↓
indoor 87 18 98.4 92.9 95.6 7 7 0 0 92.7 90.1 95.5 2 6 90.7 0.22
outdoor easy 148 47 98.2 94.4 96.3 7 7 0 0 93.6 91.8 95.5 2 10 92.3 0.18
outdoor hard 569 226 96.6 91.7 94.1 20 19 1 0 85.3 83.2 87.5 13 29 87.7 0.26
20191104 indoor left 128 93 92.8 90.3 91.5 10 9 1 0 91.0 89.8 92.2 5 18 82.4 0.31
20191104 indoor right 175 46 96.4 87.5 91.7 13 12 1 0 82.8 78.9 87.0 12 14 81.6 0.26
20191104 indoor cross 165 89 91.6 85.4 88.4 10 9 1 0 72.1 69.7 74.7 15 29 74.5 0.27
20191104 outdoor left 217 28 97.9 85.7 91.4 10 10 0 0 91.0 85.3 97.4 2 12 81.5 0.28
20191104 outdoor right 275 169 88.0 81.8 84.8 11 8 2 1 74.5 71.9 77.3 13 33 67.5 0.26
20191104 outdoor cross 244 75 92.0 78.0 84.4 12 9 3 0 67.6 62.5 73.7 22 20 63.7 0.27
20191105 indoor left 130 140 90.7 91.3 91.0 11 9 2 0 87.8 88.0 87.5 14 35 81.1 0.27
20191105 indoor right 143 164 85.3 87.0 86.1 11 8 3 0 80.5 81.2 79.7 7 41 71.9 0.30
20191105 indoor cross 520 531 82.8 83.1 82.9 21 14 7 0 74.4 74.4 74.2 45 112 64.5 0.27
20191105 outdoor left 229 37 97.6 87.0 92.0 11 10 1 0 90.1 85.1 95.6 5 8 82.7 0.22
20191105 outdoor right 240 164 85.3 79.9 82.5 11 6 5 0 73.8 71.4 76.3 12 59 62.8 0.31
20191105 outdoor cross 370 243 90.7 86.5 88.6 17 17 0 0 75.2 73.2 77.1 37 81 75.2 0.25
OVERALL 3640 2070 92.8 87.9 90.3 182 154 27 1 82.8 80.6 85.1 206 507 79.3 0.26
Among 15 considered videos, three videos are most challenging including 20191105 indoor right,
20191105 indoor cross and 20191105 outdoor right. The most tracked tracklets for those videos
are 45.45%, 47.62%, 36.36% and 72.73%, 66.67%, 54.54% compared to the highest result (100%)
when coupling YOLOv3 and Mask R-CNN with DeepSORT, respectively. This is also shown
through MOTA and MOTP values. When working on 20191105 outdoor right, MOTA and MOTP
are 56.2% and 0.33, 62.80% and 0.31 in the two examined cases YOLOv3 and Mask R-CNN, re-
spectively. This can be explained that this video has 10 individuals but there are six persons (three
pairs) move together which cause serious occlusions in a long time. Therefore, it is really difficult
to detect human regions as well as to track pedestrian’s trajectories.
One interesting point is that the best results obtained when working on outdoor easy video,
MOTA and MOTP are 94.5% and 0.21, 92.3% and 0.18 in case of applying YOLOv3 or Mask
R-CNN for human detection and DeepSORT for tracking, respectively. These values show the
effectiveness of the proposed framework for both human detection and tracking steps with high
accuracy but small average distance between all true positive and their corresponding target. Figures
3 and 4 show several examples for obtained results in human detection and tracking steps.
876
Transport and Communications Science Journal, Vol. 71, Issue 7 (09/2020), 868-880
(a)
(b)
Figure 3. An example indicates the obtained results in human detection. a) The detected boxes and their
corresponding ground-truth are remarked in green and yellow bounding boxes, respectively. b) several errors
appeared in human detection step: human body-part detection or a bounding box contains more than one
pedestrian.
(a)
(b)
(c)
Figure 4. An example for obtained results tracking step a) a perfect tracklet, b) switch ID, and c) a tracklet
has a few bounding boxes.
Concerning Person ReID, in this study, ResNet features are proposed for person representation
and similarities between tracklets are computed based on cosine distance. For feature extraction
step, ResNet-50 [20] is pre-train on ImageNet [21], a large-scale and diversity dataset designed
for use in visual object recognition research, and then fined tune on PRID-2011 [22] for person
ReID task. For tracklet representation, ResNet feature is first extracted on every bounding box
belonging to the same tracket. These extracted features are forward to temporal feature pooling
layer to generate the final feature vector. For image representation, in order to exploit both local and
global information of an image, each image is divided into seven non-overlapping regions. Feature
is extracted on each region and then, the extracted features are concatenated together to form a large-
dimensional vector for image representation. By this way, we can achieve more useful information
877
Transport and Communications Science Journal, Vol. 71, Issue 7 (09/2020), 868-880
and improve the matching rate for person ReID.
For person ReID evaluation, 12 videos (shown in the below Table) are used, in which half of
these videos are captured on the same day by the two indoor and outdoor cameras in three different
scenarios according to movement manners (left, right, and cross). In the proposed framework for
automated person ReID, the matching problem is considered as tracklet matching. For this, indoor
tracklets are used as the probe set and outdoor tracklets are the gallery set. The experiments are
performed in both cases including single view and multi-views. Due to limitation of research time,
in this work, we only focus on matching rate at rank-1. The matched tracklet for the given probe
tracklet is chosen based on taking the minimum distance between the probe tracklet and each of
gallery tracklets. These matched pairs are divided into correct and wrong matching. A matched
pair is called correct matching if these two tracklets represent the same pedestrian. Inversely, if the
matched pair describes different pedestrians, it called wrong matching. True and wrong matching
are described in Fig. 5. The matching rate at rank-1 is the ratio between the number of correct
matching and total of the probe tracklets. The obtained results show that the matching rates for
the last four cases (in the same day) are higher than the others and when pedestrians move cross
each other, the ReID performance is worst. Additionally, even in case of mixed data (for all move-
ment direction on the same day), the matching rates are 58.82% and 78.57%. These results bring
hopefulness for building a fully automated person ReID in practice.
(a)
(b)
Figure 5. An example for obtained results in person ReID step a) true matching and b) wrong matching.
5. CONCLUSIONS
This paper proposes a unified framework for automated person ReID. The contribution of this
paper are two-folds. First, deep-learning based methods are proposed for all three steps of this
878
Transport and Communications Science Journal, Vol. 71, Issue 7 (09/2020), 868-880
Table 4. Matching rate (%)at rank-1 for person ReID task in different scenarios.
Scenarios Probe Gallery Matching rates (%)
1 20191104 indoor left 20191104 outdoor left 53.33
2 20191104 indoor right 20191104 outdoor right 64.29
3 20191104 indoor cross 20191104 outdoor cross 45.45
4 20191104 indoor all 20191104 outdoor all 58.82
5 20191105 indoor left 20191105 outdoor left 100.00
6 20191105 indoor right 20191105 outdoor right 75.00
7 20191105 indoor cross 20191105 outdoor cross 57.14
8 20191105 indoor all 20191105 outdoor all 78.57
framework. The combination of YOLOv3 or Mask R-CNN and DeepSORT for human detection
and tracking, respectively. Meanwhile, in person ReID step, the improved version of ResNet fea-
tures with 7-stripes are used for person representation. Second, FAPR dataset is built on our own
for evaluating performance of all three steps. This dataset has the same challenging compared to
the common used datasets. The obtained results bring the feasibility of building a fully automated
person ReID system in practical. However, the examined videos in this study contain a few persons
leading a non-objective results. In the future work, we will consider this issue and deal with com-
plicated data.
ACKNOWLEDGMENT.
This research is funded by University of Transport and Communications (UTC) under grant
number T2020-DT-003.
REFERENCES
[1] M. Zabłocki, K. Gos´ciewska, D. Frejlichowski, R. Hofman, Intelligent video surveillance sys-
tems for public spaces–a survey, Journal of Theoretical and Applied Computer Science 8 (4)
(2014) 13–27.
[2] Q. Leng, M. Ye, Q. Tian, A survey of open-world person re-identification, IEEE
Transactions on Circuits and Systems for Video Technology 30 (2019) 1092–1108. 
https://doi.org/10.1109/TCSVT.2019.2898940.
[3] J. Redmon, A. Farhadi, Yolov3: An incremental improvement, arXiv preprint
arXiv:1804.02767, 2018. https://arxiv.org/pdf/1804.02767v1.pdf.
[4] K. He, G. Gkioxari, P. Dolla´r, R. Girshick, Mask r-cnn, in: Proceedings of the IEEE interna-
tional conference on computer vision, 2017, pp. 2961–2969.
[5] N. Wojke, A. Bewley, D. Paulus, Simple online and realtime tracking with a deep association
metric, in: 2017 IEEE International Conference on Image Processing (ICIP), 2017, pp. 3645–
3649. https://doi.org/10.1109/ICIP.2017.8296962.
[6] H.-Q. Nguyen, T.-B. Nguyen, T.-A. Le, T.-L. Le, T.-H. Vu, A. Noe, Comparative evaluation
of human detection and tracking approaches for online tracking applications, in: 2019 In-
ternational Conference on Advanced Technologies for Communications (ATC), IEEE, 2019, 
pp. 348–353. https://www.researchgate.net/publication/336719645 Comparative evaluation of 
human detection and tracking approaches for online tracking applications.pdf.
[7] T. T. T. Pham, T.-L. Le, H. Vu, T. K. Dao, et al., Fully-automated person 
re-identification in multi-camera surveillance system with a robust kernel descriptor and 
effective shadow removal method, Image and Vision Computing 59 (2017) 44–62. https://
doi.org/10.1016/j.imavis.2016.10.010. 
879
Transport and Communications Science Journal, Vol. 71, Issue 7 (09/2020), 868-880
[8] M. Taiana, D. Figueira, A. Nambiar, J. Nascimento, A. Bernardino, Towards
fully automated person re-identification, i n: 2 014 I nternational C onference o n Com-
puter Vision Theory and Applications (VISAPP), Vol. 3, IEEE, 2014, pp. 140–147. 
https://ieeexplore.ieee.org/document/7295073.
[9] Y.-J. Cho, J.-H. Park, S.-A. Kim, K. Lee, K.-J. Yoon, Unified framework for automated person
re-identification and camera network topology inference in camera networks, in: Proceedings 
of the IEEE International Conference on Computer Vision Workshops, 2017, pp. 2601–2607. 
https://arxiv.org/abs/1704.07085.
[10] D. A. B. Figueira, Automatic person re-identification for video surveillance applications,
Ph.D. thesis, University of Lisbon, Lisbon, Portugal (2016). https://www.ulisboa.pt/
prova-academica/automatic-person-re-identification-video-surveillance-applications.
[11] J. Redmon, S. Divvala, R. Girshick, A. Farhadi, You only look once: Unified, real-time object
detection, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 
2016, pp. 779–788. https://arxiv.org/abs/1506.02640.
[12] S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: Towards real-time object detection with region
proposal networks, in: Advances in neural information processing systems, 2015, pp. 91–99. 
https://arxiv.org/abs/1506.01497.
[13] S. Karanam, M. Gou, Z. Wu, A. Rates-Borras, O. Camps, R. J. Radke, A systematic evaluation
and benchmark for person re-identification: Features, metrics, and datasets, IEEE Transactions
on Pattern Analysis & Machine Intelligence (1) (2018) 1–1.
[14] L. Zheng, Y. Yang, A. G. Hauptmann, Person re-identification: Past, present and future, arXiv
preprint arXiv:1610.02984. https://arxiv.org/pdf/1610.02984.pdf.
[15] R. E. Kalman, A new approach to linear filtering and prediction problems, Journal of basic
Engineering 82 (1) (1960) 35–45. https://doi.org/10.1109/9780470544334.ch9.
[16] M. ul Hassan, ResNet (34, 50, 101): Residual CNNs for Image Classification Tasks.
https://neurohive.io/en/popular-networks/resnet/, [Online; accessed 10-March-2020].
[17] Tzutalin, Labelimg. gitcode(2015). https://github.com/tzutalin/labelImg/, [Online; accessed
20-Sep-2020].
[18] A. Milan, L. Leal-Taixe´, I. Reid, S. Roth, K. Schindler, Mot16: A benchmark for multi-object
tracking, arXiv preprint arXiv:1603.00831, 2016. https://arxiv.org/abs/1603.00831
[19] X. Wang, G. Doretto, T. Sebastian, J. Rittscher, P. Tu, Shape and appearance context modeling,
in: Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on, IEEE, 2007, 
pp. 1–8. https://www.ndmrb.ox.ac.uk/research/our-research/publications/439059
[20] J. Gao, R. Nevatia, Revisiting temporal modeling for video-based person ReID, arXiv preprint
arXiv:1805.02104. https://arxiv.org/abs/1805.02104 
880
[21] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, ImageNet: A large-scale hierarchical
image database, in: 2009 IEEE conference on computer vision and pattern recognition, IEEE, 
2009, pp. 248–255. 
https://www.bibsonomy.org/bibtex/252793859f5bcbbd3f7f9e5d083160acf/analyst
[22] M. Hirzer, C. Beleznai, P. M. Roth, H. Bischof, Person re-identification by descriptive and
discriminative classification, in: Scandinavian conference on Image analysis (2011), Springer,
2011, pp. 91–102. https://doi.org/10.1007/978-3-642-21227-7_9
File đính kèm:
a_unified_framework_for_automated_person_re_identification.pdf