Academia.eduAcademia.edu

Automatic Dataset Augmentation Using Virtual Human Simulation

2019, ArXiv

Virtual Human Simulation has been widely used for different purposes, such as comfort or accessibility analysis. In this paper, we investigate the possibility of using this type of technique to extend the training datasets of pedestrians to be used with machine learning techniques. Our main goal is to verify if Computer Graphics (CG) images of virtual humans with a simplistic rendering can be efficient in order to augment datasets used for training machine learning methods. In fact, from a machine learning point of view, there is a need to collect and label large datasets for ground truth, which sometimes demands manual annotation. In addition, find out images and videos with real people and also provide ground truth of people detection and counting is not trivial. If CG images, which can have a ground truth automatically generated, can also be used as training in machine learning techniques for pedestrian detection and counting, it can certainly facilitate and optimize the whole pr...

Automatically Dataset Augmentation Using Virtual Human Simulation Marcelo C. Ghilardi, Leandro Dihl, Estevão Testa, Pedro Braga, João P. Pianta, Isabel H. Manssour, Soraia R. Musse arXiv:1905.00261v1 [cs.CV] 1 May 2019 DaVint - Data Visualization and Interaction Lab, VHLab - Virtual Human Simulation Lab, School of Technology, PUCRS-Pontifical Catholic University of Rio Grande do Sul, Porto Alegre, Brazil Abstract—Virtual Human Simulation has been widely used for different purposes, such as comfort or accessibility analysis. In this paper, we investigate the possibility of using this type of technique to extend the training datasets of pedestrians to be used with machine learning techniques. Our main goal is to verify if Computer Graphics (CG) images of virtual humans with a simplistic rendering can be efficient in order to augment datasets used for training machine learning methods. In fact, from a machine learning point of view, there is a need to collect and label large datasets for ground truth, which sometimes demands manual annotation. In addition, find out images and videos with real people and also provide ground truth of people detection and counting is not trivial. If CG images, which can have a ground truth automatically generated, can also be used as training in machine learning techniques for pedestrian detection and counting, it can certainly facilitate and optimize the whole process of event detection. In particular, we propose to parametrize virtual humans using a data-driven approach. Results demonstrated that using the extended datasets with CG images outperforms the results when compared to only real images sequences. I. I NTRODUCTION In the last years, there is a growing interest in understanding the behavior of pedestrian and crowds in video sequences. It is important in many applications, but certainly one of the most relevant is the safety of pedestrians in complex buildings or in mass events. Many methodologies to detect groups and crowd events have been proposed in the literature and achieved results showing that groups, social behaviors and navigation aspects can be successfully detected in video sequences. For example, counting people in crowds [1], [2], abnormal behavior detection [3], [4], study of social groups in crowds [5], [6], understanding of group behaviors [7] and characterization of crowd features [8]. Most of these approaches are based on individual pedestrian tracking or optical flow algorithms, and in general consider features like speed, directions, and distance over time. On the other hand, many of these applications have been also addressed with another perspective, i.e. by using huge datasets for training and testing machine learning techniques, as described in Section II, in order to reach accurate results. One of the main drawbacks of this area is the needed work to build the datasets and respective ground truth, that mainly for pedestrians and crowds is not a trivial task. Sometimes this ground truth is manually prepared, which is very timeconsuming. Thus, computer graphics simulations started to be used to generate greater labeled datasets to apply as ground truth. LCrowdV [9] is a recent example of computer graphics technology to generate crowds datasets from a set of provided parameters. In this paper, we intend to investigate the efficiency of CG images generated with our framework to simulate crowds and render Virtual Humans (VH) in a simplistic way. In particular, we are interested about semi-automatically generating CG images based on a labeled dataset, i.e. using data-driven techniques. The idea behind is to use a parameterized crowd simulator, where information from trajectories labeled in the dataset generates parameters for crowds. We used two crowd simulators to simulate virtual humans and generate automatically the ground truth. In addition, images were rendered using the Unity Engine. We also implemented the Multi-column Convolutional Neural Network (MCNN) proposed in [10] to test the CG dataset and compare the efficiency with known and used dataset UCSD [11]. The main contribution of this paper is the discussion and investigation of VH simulation used to extend crowds and pedestrian datasets in a data-driven way. Results indicate that this idea is indeed promising since it reduces the work of generating ground truth datasets manually labeled. This paper is organized as follows: Section II describes the literature review on the topic of crowd counting and VH simulation focused on dataset generation. The proposed model is presented in Section III, while experimental results are discussed in Section IV. Finally, Section V draws some conclusions and future work. II. R ELATED W ORK In the last years, several works have been developed for crowd counting [1], [2], [12]–[15] with different purposes, as crowd control, urban planning and video surveillance [16]. This problem consists in the definition of the number of people in a crowd [17], and has been addressed over the years using several approaches [11]–[13], [18], [19], such as Support Vector Machine (SVM) classifier [20] and object detection using a boosted cascade of features [21]. Recently, trying to improve the results accuracy, different methods of Convolutional Neural Networks (CNN) have been widely used [10], [22]–[27]. Sourtzinos et al. [27] presented a method for people counting using CNN, and tested with the available Mall crowd counting dataset [28]. This dataset was annotated manually through the labeling process of head position for each pedestrian in all frames. Zhang et al. [10], e.g., proposed a Multi-column Convolutional Neural Network (MCNN) that allows input images of arbitrary resolution. However, besides using existing datasets, they also had to collect and annotate a huge dataset to perform the experiments in order to verify the effectiveness of their method. Another large-scale dataset with annotated pedestrians for crowd counting algorithms was provided by Zhao et al. [26]. Gao et al. [24] combined the Adaboost algorithm and the CNN for head detection and used a classroom surveillance dataset also manually annotated to evaluate the proposed method. Due to the need for this large amount of training data, Boominathan et al. [22] performed an augmentation of their training dataset cropping patches from the multi-scale pyramidal representation of each training image. Cheung et al. [9], on the other hand, claim that the task to manually label the datasets is time-consuming and error-prone, besides needing several human operators. Therefore, they proposed a procedural framework called LCrowdV, to generate labeled crowd videos. Thus, it is possible to see that although CNN methods present excellent results, there is a need to collect and label large datasets for ground truth, which sometimes demands manual annotation. Because of this, recent research has been addressed to help the problem of generating labeled videos, as the LCrowdV developed by Cheung et al. [9]. Synthetic data has already been used to improve image recognition [29]–[31], however, this approach was not yet explored in crowd/pedestrian counting solutions. One advantage of crowd simulation applications is the possibility to easily generate a huge dataset together with a ground truth, which fully eliminates the need for an annotated ground truth, and this fact is an important and relevant advantage. This advantage is further enhanced with the possibility to generate automatically labeled crowd videos similar to the real ones, in order to easily extend existent datasets to be used to train machine learning techniques. III. P ROPOSED M ODEL Since the focus of this paper is to discuss the automatic process of augmenting labeled datasets, we chose to use one state-of-the-art architecture [10] to conduct our research. We implemented the Multi-column Convolutional Neural Network (MCNN) [10] due to the contribution presented by the authors: their method can manage features at different scales all together in order to accurately estimate crowd counts for different images. However, first of all, we used our simulators in order to simulate virtual humans and generate Virtual Human datasets. Section III-A describes details about this process. The overview of our method is presented in Figure 1. It is possible to see the illustration of four used datasets. Firstly, on the top-left appears the UCSD [11] images that were used for training and testing. Such dataset contains low dense crowds and ground truth data. The dataset called ”Students” was filmed in our University and presents from 0 to 30 students in a top-view camera, in an environment of 9sqm. The goal is to provide a dataset with a different camera perspective as well as different crowd density if compared to UCSD. This dataset was also used to train and test our method. We tracked the people in Students dataset using a method proposed by Bins et al. [32]. We visually analyzed all tracking data and manually corrected any possible problem, in order to generate a semi-automatic accurate ground truth for Students too. The other two datasets, ”CG-CrowdSim” and ”CG-BioCrowds”, were generated through simulation in order to augment the training data used, as explained below. Fig. 1. Outline of the proposed methodology. The UCSD and Students datasets are used to train and test, already the CG dataset was used for only training. In addition, we used the CG images as a dataset for only the training phase in the machine learning method. Trajectories and behaviors have been generated using two simulators (i.e. CrowdSim and BioCrowds) and rendered at Unity. These two simulators are controlled in a different way. While CrowdSim has a graphical interface that can be used to define the simulation someone is interested in, BioCrowds is a parametrized simulator, that can be data-driven. We firstly used CrowdSim because it is more comparable with LCrowdV method [9], i.e. the crowd designer should manually define initial positions, goals, speeds and etc. So, in CrowdSim case (as developed using LCrowdV) people design a crowd they are interested to simulate. In the case of this work, it is clear that we want to ”imitate” the dataset, providing the augmentation. However, this imitation process is user-based since the UCSD data is not used as parameters for CrowdSim. More details about CrowdSim is presented in Section III-A1. In order to test BioCrowds we read the information stored into the dataset Students and provide an automatic parametrization of the simulator, as discussed in Section III-A2. In both cases, we used a very simplistic method at Unity (e.g. virtual humans do not generate shadow on the floor). In order to generate the ground truth for machine learning method (agents positions in image coordinates at a function of time), we used the clear advantages that in both simulators all virtual humans positions are known in world coordinates. So, we assumed a classical pinhole camera model u = P x, where u is the pixel in the image (homogeneous coordinates), P is the projection matrix 3 × 4 (known in CG world) , e x is the 3D position in the world (also inhomogeneous coordinate). In the next sections, we discuss some details about crowd simulators, and then some information about how density maps were generated to be used in the MCNN. Fig. 2. CrowdSim environment example. S1, S2, S3, and S4 represent the exits in a simulated night club. A. Crowd Simulators This section details the two used crowd simulators. 1) CrowdSim: CrowdSim is a rule-based crowd simulation software developed to simulate coherent motion and behaviors of virtual humans in a geometric environment. In particular, CrowdSim has been used to simulate evacuation scenarios [33]. CrowdSim simulates VH, while keeping behaviors as seek-to-goal and collision avoidance, and also generates outputs to be used in post-processing phases, such as the position of each agent at each time that can be used to visualize the characters in other platforms. In addition, CrowdSim generates statistical data that are used to estimate human comfort and safety in a specific environment, e.g. densities, velocities and etc. Two key components are considered in CrowdSim, organized in distinct modules: Configuration and Simulation, which are respectively responsible for configuring the environment/population/routes information and for the simulation and events. For further details, please refer to [33]. Figure 2 illustrates CrowdSim environment. We can see the CrowdSim contexts (walkable regions) and connections among them (white edges) that guide agents to the pre-defined exits. S1, S2, S3, and S4 represent the exits in the simulated night club [33], that was also simulated in real life. The advantages of CrowdSim, if compared to other crowd simulators in literature, is that it has been evaluated and validated according to Galea [34] and also tested in a real scenario. The navigation graph generated by CrowdSim (edges are routes and contexts are nodes), together with the population distribution in the entry contexts (entry rooms) and the expected distributions at the decision points form the definition of our crowd motion plans. 2) BioCrowds: One agent in the environment perceives a set of markers (dots) on the ground (described through a space subdivision method) within its observational radius and move forward to its goals taking into account such markers (unoccupied and closest to this agent than any other one). This is the main aspect of BioCrowds simulator [35] which Fig. 3. BioCrowds: We show emergent phenomena being produced by our simulator in a manner similar to other crowd frameworks; on the left: arc formation and on the right: emergent lanes. supports some of the important emergent behaviors expected in crowd simulation (illustrated in images from Figure 3), as also emergent in other crowd simulators [36], [37]. As a consequence of BioCrowds main functions, obstacles are very easy to represent as zones without any markers in space discretization method. In order to provide data-driven control, we read information from the dataset, i.e. the number of individuals n and their positions xfi in image coordinates, as a function of time f . For the moment we only work with top of view cameras where perspective and coordinates do not need to be transformed. The only performed mapping to generate world coordinates is to find out the correspondent positions in pixels xfi to meters Xif , in order to compute BioCrowds parameters. We generate following information for each person i: speed sfi (meters/frame) computed based on Xi , initial Xiif and final Xif f positions for each individual, in frames if and f f , that represents respectively the first and last frame that individual i appeared in the video sequence. Having this data we are able to parametrize BioCrowds as follows: • • • nB = n, where nB is the number of agents in BioCrowds; For each agent j in [0; nB ]; Xjif = f (xif i ), where i is the index of individual in a video sequence and j is the index in BioCrowds and f (w) is a function that maps positions from image coordinates to world coordinates; ff ff f • Similarly, Xj = f (xi ) and sj are computed. Then, BioCrowds is able to simulate the nB agents having their parameters defined w.r.t input dataset. As output, BioCrowds generates the position of each agent at each frame Xjf . GT and Unity images with virtual humans are the specific output. It is important to highlight that we chose to simulate people between initial and final frames because we want to be able to simulate the same pattern of crowd existent in the dataset, but allowing to increase or decrease the number of agents, i.e. varying the generated data. For this, we just need to replicate some positions coming from the dataset to serve as input information to agents in BioCrowds. Even if two agents have the same initial and goals positions, they adopt different motion due to the collision avoidance present in BioCrowds method. (a) UCSD image. (b) UCSD GT. (c) UCSD output. (d) CrowdSim image. (e) CrowdSim GT. (f) CrowdSim output. (g) Students image. (h) Students GT. (i) Students output. (j) BioCrowds image. (k) BioCrowds GT. (l) BioCrowds output. B. Generation of Density maps for Computer Graphics Datasets As mentioned in [10], the estimated crowd density, computed from an input image and used in the training step, is very determinant in the CNN performance. In order to provide CG dataset that can be comparable to the UCSD and [10] results, we used the same method, howeve,r adapted to data obtained from virtual human simulation. Indeed, for CrowdSim dataset we simulated from 0 to 20 agents and their motion aimed to replicate the environment present at UCSD dataset. For BioCrowds dataset we simulated exactly 22 agents and positions from Students were used to people simulation. For both synthetic datasets, the rendering was processed in real time and 30 images were generated per second. Associated to each image, a file was generated having ~ f of each agent i at each frame f in world the position X i coordinates. This set of files was used to transform coordinates from world to image, given the camera position used in CG generation, then generating the position of each agent in image coordinate ~ufi . In order to generate the density maps for CG datasets, we use distances among agents in the frame. We denote the distances from agent i to its k nearest neighbors (in image coordinates) as d1i , d2i , ..., dki and the average distance is d¯i . Therefore, to estimate the crowd density around the pixel ~ui , we perform a convolution δ(~u − ~ui ) with a Gaussian kernel with variance γi proportional to d¯i . For more details please refer to [10]. Figure 4 illustrates images from the four datasets (on the left), the result of density maps generated for GT (middle) and the result of MCNN (right). IV. E XPERIMENTAL R ESULTS In order to evaluate the performance of CG images in the training phase, we tested two sets of Real Dataset + CG images. Next section (IV-A) aims to discuss the results of UCSD augmented with crowd simulation manually defined. Fig. 4. Some images from the UCSD, CrowdSim, Students and BioCrowds datasets. (a), (b) and (c) illustrate the original image, the processed ground truth, and MCNN output, respectively. (d), (e) and (f) are correspondent images for a dataset using computer graphics (CrowdSim). (g), (h) and (i) are correspondent images for Students dataset. (j), (k) and (l) are correspondent images for a dataset using computer graphics (BioCrowds). Section IV-B has the goal to show the results when a dataset Students was augmented. The total number of images are 4630, being 2000 from UCSD, 1545 from CrowdSim (using UCSD as basis), 350 from Students (filmed in our University) and 735 from BioCrowds datasets (using Students as basis). The experimental results aim to show that even these simplistic rendering applied in CrowdSim and BioCrowds datasets can improve the performance of machine learning method. As commonly used, we adopted the MAE (Mean Absolute Error) and MSE (Mean Squared Error) metrics.1 First, we evaluated individually the four datasets used in this work for training and testing (see Table I). It is easy to see that the MCNN performance works better in CrowdSim and BioCrowds datasets, when compared to others. One difference among the datasets is that the CG images are more homogeneous, given the synthetic background (see Figure 4 for illustration about the datasets). Next sections describe the performed analysis in the augmented datasets. 1 Differently from [10], we used MSE and not RMSE. Training UCSD(40%) CrowdSim(40%) Students(40%) BioCrowds(40%) Testing UCSD(60%) CrowdSim(60%) Students(60%) BioCrowds(60%) TABLE I MAE 1.3745 0.7898 1.1087 1.0981 MSE 9.4633 3,4966 5.5069 5.5776 MCNN APPLIED TO THE FOUR ANALYZED DATASETS . A. UCSD and CrowdSim We compared the evolution of MCNN performance in two different situations: i) when training only with UCSD (40%) and ii) extending the training dataset with CG images from CrowdSim (from 0% to 100% of total 1545 images). For UCSD we used the same setting than [10], i.e. we use frames from 601 to 1400 as training data, and the remaining 1200 frames as test data. For CrowdSim dataset, we selected randomly the images until complete the required percentage of images in the dataset (from 0% to 100% of total 1545 images), to be used as training information. The tested images were always the same set of UCSD (60%) for both evaluations. Figure 5 shows the evolution in terms of computed epochs. It is easy to see that the extended dataset using CG images improved the MCNN performance. Fig. 7. Comparing the MSE metric when training dataset UCSD was extended with CG images generated using CrowdSim. of CrowdSim the improvement was approximately 11% in comparison to the original non-augmented dataset UCSD. Training Dataset UCSD 40% UCSD 40% + CG UCSD 40% + CG UCSD 40% + CG UCSD 40% + CG UCSD 40% + CG MAE Values % improvement 1.3745 20% 1.2649 7.9745 40% 1.2468 9.2914 60% 1.2467 9.2987 80% 1.2438 9.5097 100% 1.2279 10.6665 TABLE II C OMPARISON OF PERFORMANCE WHEN UCSD WAS EXTENDED WITH CG IMAGES . As expected, training the MCNN using only CrowdSim dataset was not efficient for testing with any real dataset, indicating that it is necessary to add some real images to the training to obtain better results. The tests used 100% of samples from all datasets (Table III). Fig. 5. MCNN performance when training with UCSD and when we extended the dataset with CG images generated using CrowdSim. Figures 6 and 7 illustrate the results. It is easy to see that the augmentation in the training dataset with CG images provided significantly better performance than without any extension. Fig. 6. Comparing the MAE metric when training dataset UCSD was extended with CG images generated using CrowdSim. In addition, we computed the numerical improvement between the performance with and without augmentation in the UCSD dataset (see Table II). Considering the total images Test dataset MAE MSE UCSD 4.3384 43.2086 Students 1.4851 8.3711 TABLE III T RAINING THE MCNN USING ONLY C ROWD S IM DATASET AND TESTING WITH REAL DATASETS . B. Students and BioCrowds As in the last section, we compared the evolution of MCNN performance in two different situations: i) when training only with Students (40%) and ii) extending the training dataset with CG images from BioCrowds (from 0% to 100% of total 735 images). For both cases, we randomly selected the used images. Figure 8 shows the evolution in terms of computed epochs. It is easy to see again that the extended dataset using CG images improved the MCNN performance. Figures 9 and 10 illustrate the results. It is easy to see that the augmentation in the training dataset with CG images provided significantly better performance than without any extension. We also computed the numerical improvement between the performance with and without augmentation in the Students dataset (see Table IV). Next section presents some final considerations about this work. Training Dataset Students 40% Students 40% + CG Students 40% + CG Students 40% + CG Students 40% + CG Students 40% + CG MAE Values % improvement 1.1087 20% 1.0722 3.2921 40% 1.0615 4.2572 60% 1.0588 4.5007 80% 1.0613 6.3768 100% 1.0342 7.6305 TABLE IV C OMPARISON OF PERFORMANCE WHEN S TUDENTS WAS EXTENDED WITH CG IMAGES . T HE LAST COLUMN PRESENTS THE IMPROVEMENT IN COMPARISON TO NON - AUGMENTED DATA . Fig. 8. MCNN performance when training with Students and when we extended the dataset with CG images generated using BioCrowds. Fig. 9. Comparing the MAE metric when training dataset Students was extended with CG images generated using BioCrowds. V. F INAL C ONSIDERATIONS In this paper, we described the results of using images of Virtual Human Simulation to extend datasets of pedestrians in order to train machine learning techniques for VH counting and detection. In order to evaluate our proposal, we implemented the Multi-column Convolutional Neural Network (MCNN) presented by Zhang [10]. We used four datasets: UCSD [11], also used in paper [10]; a new one called Student, recorded in our University; a synthetic one created with virtual human simulation similar to UCSD dataset and another synthetic dataset based on Students generated in an automatic way using BioCrowds. We trained the MCNN with different samples of the datasets: UCSD only, UCSD+CG, Students only and Students+CG. The results after training with our extended datasets outperform the results of training using just the original dataset, i.e., there was an increase in network performance using the CG extended datasets. In particular, augmenting UCSD with CG images we obtained approximately 10% of improvement in MAE values, while the improvement in Students was approximately 7%. These results were coherent with LCrowdV information when the authors said that their improvement is around 7%. Of course, this number depends on the characteristics of the augmented dataset. The performance improved for both tested datasets demonstrates the good generalization of the proposed investigation. Moreover, the possibility of automatically generating a ground truth for labeling datasets facilitates and optimizes the process of pedestrian detection and counting, decreasing the arduous task of manually labeling the videos. We also tested two crowd simulators which main difference was the way to control the animations. While in CrowdSim we manually designed experiments imitating UCSD dataset, BioCrowds was automatically parametrized based on datasets. Although tests are necessary, the two simulators do not present differences, in the learning process, since rendering and humans visualization are in the same platform. The only difference is the required work associated with the task to animate crowds, much more easier if using BioCrowds. For future work, we intend to create, evaluate and let available new datasets of CG images simulating several sizes and densities of crowds. We also want to provide extended datasets to other known datasets, such as the Mall crowd counting dataset [28]. R EFERENCES Fig. 10. Comparing the MSE metric when training dataset Students was extended with CG images generated using BioCrowds. [1] A. Chan and N. Vasconcelos, “Bayesian poisson regression for crowd counting,” in 12th IEEE ICCV, Sept 2009, pp. 545–551. [2] Z. Cai, Z. L. Yu, H. Liu, and K. Zhang, “Counting people in crowded scenes by video analyzing,” in 9th IEEE ICIEA, June 2014, pp. 1841– 1845. [3] E. Ermis, V. Saligrama, P. Jodoin, and J. Konrad, “Motion segmentation and abnormal behavior detection via behavior clustering,” in 15th IEEE ICIP, Oct 2008, pp. 769–772. [4] V. Mahadevan, W. Li, V. Bhalodia, and N. Vasconcelos, “Anomaly detection in crowded scenes,” in CVPR 2010, June 2010, pp. 1975– 1981. [5] J. Shao, C. Loy, and X. Wang, “Scene-independent group profiling in crowd,” in IEEE CVPR, June 2014, pp. 2227–2234. [6] A. Chandran, L. A. Poh, and P. Vadakkepat, “Identifying social groups in pedestrian crowd videos,” in ICAPR, Jan 2015, pp. 1–6. [7] B. Solmaz, B. E. Moore, and M. Shah, “Identifying behaviors in crowd scenes using stability analysis for dynamical systems,” IEEE PAMI, vol. 34, no. 10, pp. 2064–2070, Oct. 2012. [8] B. Zhou, X. Tang, H. Zhang, and X. Wang, “Measuring crowd collectiveness,” IEEE PAMI, vol. 36, no. 8, pp. 1586–1599, Aug 2014. [9] E. Cheung, T. K. Wong, A. Bera, X. Wang, and D. Manocha, LCrowdV: Generating Labeled Videos for Simulation-Based Crowd Behavior Learning. Cham: Springer International Publishing, 2016, pp. 709–727. [Online]. Available: http://dx.doi.org/10.1007/978-3-319-48881-3 50 [10] Y. Zhang, D. Zhou, S. Chen, S. Gao, Y. Ma, undefined, undefined, undefined, and undefined, “Single-image crowd counting via multi-column convolutional neural network,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 00, no. undefined, pp. 589– 597, 2016. [11] A. B. Chan, Z.-S. J. Liang, and N. Vasconcelos, “Privacy preserving crowd monitoring: Counting people without people models or tracking,” in 2008 IEEE Conference on Computer Vision and Pattern Recognition, June 2008, pp. 1–7. [12] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan, “Object detection with discriminatively trained part-based models,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, no. 9, pp. 1627–1645, Sept 2010. [13] L. Fiaschi, U. Koethe, R. Nair, and F. A. Hamprecht, “Learning to count with regression forest and structured labels,” in Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012), Nov 2012, pp. 2685–2688. [14] C. Zhang, H. Li, X. Wang, and X. Yang, “Cross-scene crowd counting via deep convolutional neural networks,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015, pp. 833– 841. [15] Y. Hu, H. Chang, F. Nian, Y. Wang, and T. Li, “Dense crowd counting from still images with convolutional neural networks,” J. Vis. Comun. Image Represent., vol. 38, no. C, pp. 530–539, Jul. 2016. [Online]. Available: http://dx.doi.org/10.1016/j.jvcir.2016.03.021 [16] Z. Ma and A. B. Chan, “Counting people crossing a line using integer programming and local features,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 26, no. 10, pp. 1955–1969, Oct 2016. [17] B. Sheng, C. Shen, G. Lin, J. Li, W. Yang, and C. Sun, “Crowd counting via weighted vlad on dense attribute feature maps,” IEEE Transactions on Circuits and Systems for Video Technology, vol. PP, no. 99, pp. 1–1, 2016. [18] P. Viola, M. J. Jones, and D. Snow, “Detecting pedestrians using patterns of motion and appearance,” in Proceedings Ninth IEEE International Conference on Computer Vision, Oct 2003, pp. 734–741 vol.2. [19] V. Rabaud and S. Belongie, “Counting crowded moving objects,” in 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), vol. 1, June 2006, pp. 705–711. [20] C. Cortes and V. Vapnik, “Support-vector networks,” Machine Learning, vol. 20, no. 3, pp. 273–297, Sep. 1995. [Online]. Available: http://dx.doi.org/10.1023/A:1022627411411 [21] P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple features,” in 2001 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 1, 2001, pp. I–511–I–518 vol.1. [22] L. Boominathan, S. S. S. Kruthiventi, and R. V. Babu, “Crowdnet: A deep convolutional network for dense crowd counting,” in Proceedings of the 2016 ACM on Multimedia Conference, ser. MM ’16. New York, NY, USA: ACM, 2016, pp. 640–644. [Online]. Available: http://doi.acm.org/10.1145/2964284.2967300 [23] E. Walach and L. Wolf, Learning to Count with CNN Boosting. Cham: Springer International Publishing, 2016, pp. 660–676. [Online]. Available: http://dx.doi.org/10.1007/978-3-319-46475-6 41 [24] C. Gao, P. Li, Y. Zhang, J. Liu, and L. Wang, “People counting based on head detection combining adaboost and {CNN} in crowded surveillance environment,” Neurocomputing, vol. 208, pp. 108 – 116, 2016, sI: BridgingSemantic. [Online]. Available: http: //www.sciencedirect.com/science/article/pii/S0925231216304660 [25] D. Oñoro-Rubio and R. J. López-Sastre, Towards PerspectiveFree Object Counting with Deep Learning. Cham: Springer International Publishing, 2016, pp. 615–629. [Online]. Available: http://dx.doi.org/10.1007/978-3-319-46478-7 38 [26] Z. Zhao, H. Li, R. Zhao, and X. Wang, Crossing-Line Crowd Counting with Two-Phase Deep Neural Networks. Cham: Springer [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] International Publishing, 2016, pp. 712–726. [Online]. Available: http://dx.doi.org/10.1007/978-3-319-46484-8 43 P. Sourtzinos, S. A. Velastin, M. Jara, P. Zegers, and D. Makris, People Counting in Videos by Fusing Temporal Cues from Spatial Context-Aware Convolutional Neural Networks. Cham: Springer International Publishing, 2016, pp. 655–667. [Online]. Available: http://dx.doi.org/10.1007/978-3-319-48881-3 46 K. Chen, C. C. Loy, S. Gong, and T. Xiang, “Feature mining for localised crowd counting.” in BMVC, vol. 1, no. 2, 2012, p. 3. N. P. H. Thian, S. Marcel, and S. Bengio, “Improving face authentication using virtual samples,” in Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP ’03). 2003 IEEE International Conference on, vol. 3, April 2003, pp. III–233–6 vol.3. J. Zuo, N. A. Schmid, and X. Chen, “On generation and analysis of synthetic iris images,” IEEE Transactions on Information Forensics and Security, vol. 2, no. 1, pp. 77–90, March 2007. J. Galbally, R. Plamondon, J. Fierrez, and J. Ortega-Garcia, “Synthetic on-line signature generation. part i: Methodology and algorithms,” Pattern Recogn., vol. 45, no. 7, pp. 2610–2621, Jul. 2012. [Online]. Available: http://dx.doi.org/10.1016/j.patcog.2011.12.011 J. Bins, L. L. Dihl, and C. R. Jung, “Target tracking using multiple patches and weighted vector median filters,” MIV, vol. 45, no. 3, pp. 293–307, Mar. 2013. [Online]. Available: http://dx.doi.org/10.1007/ s10851-012-0354-y V. Cassol, J. Oliveira, S. R. Musse, and N. Badler, “Analyzing egress accuracy through the study of virtual and real crowds,” in 2016 IEEE Virtual Humans and Crowds for Immersive Environments (VHCIE), March 2016, pp. 1–6. E. R. Galea, “A general approach to validating evacuation models with an application to EXODUS,” Journal of Fire Sciences, vol. 16, no. 6, pp. 414–436, 1998. A. de Lima Bicho, R. A. Rodrigues, S. R. Musse, C. R. Jung, M. Paravisi, and L. P. Magalhes, “Simulating crowds based on a space colonization algorithm,” Computers & Graphics, vol. 36, no. 2, pp. 70– 79, Apr. 2012. D. Helbing and A. Johansson, Pedestrian, crowd and evacuation dynamics. Springer New York, 2011. J. van den Berg, M. Lin, and D. Manocha, “Reciprocal velocity obstacles for real-time multi-agent navigation,” in IEEE International Conference on Robotics and Automation, May 2008, pp. 1928–1935.