Academia.eduAcademia.edu

A multi-agent framework for visual surveillance

1999, Proceedings 10th International Conference on Image Analysis and Processing

We describe an architecture for implementing scene understanding algorithms in the visual surveillance domain. To achieve a high level description of events observed by multiple cameras, many interrelated , event-driven processes must be executed. We use the agent paradigm to provide a framework in which these processes can be managed. Each camera has an associated agent, which detects and tracks moving regions of interest. This is used to construct and update object agents. Each camera is calibrated so that image coordinates can be transformed into ground plane locations. By comparing properties, two object agents can infer that they have the same referent, i.e. that two cameras are observing the same entity, and as a consequence merge identities. Each object's trajectory is classified with a type of activity, with reference to a ground plane agent. We demonstrate objects simultaneously tracked in two cameras, which infer this shared observation. The combination of the agent framework, and visual surveillance application provides an excellent environment for development and evaluation of scene understanding algorithms.

A Multi-agent Framework for Visual Surveillance J. Orwell, S. Massey, P. Remagnino, D. Greenhill, G.A. Jones School of Computer Science and Electronic Systems, Kingston University, Penrhyn Road, Kingston, Surrey KT1 2EE United Kingdom Tel: +44 (0)181 547 7669; fax: +44 (0)181 547 7824 e-mail: fJ.Orwell,P.Remagnino,[email protected] Abstract We describe an architecture for implementing scene understanding algorithms in the visual surveillance domain. To achieve a high level description of events observed by multiple cameras, many inter-related, event-driven processes must be executed. We use the agent paradigm to provide a framework in which these processes can be managed. Each camera has an associated agent, which detects and tracks moving regions of interest. This is used to construct and update object agents. Each camera is calibrated so that image co-ordinates can be transformed into ground plane locations. By comparing properties, two object agents can infer that they have the same referent, i.e. that two cameras are observing the same entity, and as a consequence merge identities. Each object’s trajectory is classified with a type of activity, with reference to a ground plane agent. We demonstrate objects simultaneously tracked in two cameras, which infer this shared observation. The combination of the agent framework, and visual surveillance application provides an excellent environment for development and evaluation of scene understanding algorithms. 1 Introduction In Britain the $5 billion video security industry is enjoying a phenomenal growth. Surprisingly, however, the more sophisticated video installations are little more than multimonitor consoles with joy-stick control of multiplexed PTZ cameras. Until recently there has not been affordable processing power to implement more sophisticated intelligent algorithms. However the recent availability of inexpensive Pentium processors has made it possible to develop smart assistant technologies for security staff which can be integrated into current practice. In the proposed car park monitoring system, the physical environment is monitored by multiple cameras with overlapping viewpoints[4]. Events such as a vehicle or person passing through the environment may be simultaneously visible in more than one camera. Moreover the object may move in and out of the view volumes of several cameras in a sequential manner as the object progresses through the car park. An agent-based architecture for performing visual surveillance in a multi-camera environment is described. The loosely-coupled nature of the agent approach and the ability to neatly partition responsibilities is ideal for solving such complex issues as target handover between cameras, data fusion and description of inter-object situations. Much of the previous work on motion detection and tracking[1, 5, 9, 8, 6] has been based on pixel differencing techniques which assume that objects move against a static scene (and hence static camera). In recent MPEG4 work[3, 7], the technique has been extended to moving cameras by generating a global motion field between frames, using this optic flow field to compensate for the motion, then performing pixel differencing as before. The high computational expense of this approach precludes it from realtime implementation in the near future. Agent technology, originally formalised in the seminal paper of Shoham[11], is nowadays used in many disciplines. Intelligent user interfaces, electronic commerce and communication systems in general have put most of the agent theory into practice. Agent technology has currently become an artificial intelligence paradigm in its own right, and computer vision[2, 10] has adopted it to implement intelligence more formally. The advantages of the agent methodology are clear. A team of agents provides a flexible distributed platform where parallel processing can take place in an optimised manner. Computer-based visual surveillance represents an ideal application for the agent technology. Cameras, stationary and moving objects can indeed be thought of as independent agents co-operating Camera Agent Object Agent Colour and 3D Positional Data Activity Agent Object Agent Behaviour Data Camera Agent Ground Plane Figure 1. Agent Architecture Figure 2. Camera 1: Extracted Regions to infer the most plausible scene understanding. The agent paradigm provides an appropriate framework to orchestrate all the agents to optimise the interpretation of stationary and dynamic visual data. Overview of Agent Architecture Each camera is managed by its own camera agent responsible for detecting and tracking moving objects as they traverse each view volume. The tracking algorithm is described in considerable detail in Section 2. For each event detected, the camera agent creates an object agent which is responsible for progressively identifying the nature of the event and describing the underlying activity. Threedimensional positional and colour observation data of the event are continually streamed to each existing object agent. Each object agent uses this data to continuously update internal colour and trajectory models of the event (see Section 3). In addition, the object agent attempts to classify its dynamics against a number of pre-stored set of activities e.g. people walking in the car park, people getting out of a vehicle, or vehicles entering the car park and parking. The regions of interest in the scene are semantically labelled off-line, and stored in the ground plane agent. (parking areas, entrances, exits and vehicle, bicycle and pedestrian routes). These are stored and used by an object agent to classify its activity. The ground plane agent manages the community of object agents and classifies their activity referencing this semantic labelling. Figuratively speaking, a community of camera agents will continuously detect events and invoke object agents on the ground plane. The resultant population of object agents both model their motion within the semantically labelled ground plane, and their interactions with other objects. A number of important problems can be elegantly handled by requiring that all object agents periodically communicate with each other. The handover of events between cameras, or the fusion of data from multiple overlapping cameras, is facilitated by allowing neighbouring object agents to compare trajectory and colour data. When pair of object agents determine that they refer to the same event, they are merged to create a single object agent which inherits both sets of camera data links. The internal colour and trajectory models are updated from both sources of information. In the context of visual surveillance applications, many interesting activities are signalled by the close proximity of objects, for instance associating cars and owners or identifying potential car theft. When such situations arise, the agents of the proximal objects communicate to provide a basic interpretation of the interaction. Both the camera and object agent are described in greater detail below. 2 The Camera Agent In the proposed architecture, an agent is assiged to each camera and has the responsibility of tracking events in its field of view. To perform this function reliably a number of tasks need to be performed:  Maintain the planar homography between image plane and ground plane.  Learn and continuously update a model of the appearance of the scene.  Detect and track moving events within the scene.  Invoke an object agent for each stable tracked event in the view volume.  Integrate within the tracker 3D positional and colour feedback information supplied by each object model. The role of a tracker is to associate moving objects between frames and with a high degree of temporal and spatial coherence. In summary, the presented tracker works as follows. First, the pixels in each new frame are classified as moving or not-moving by comparing pixel greylevels with an adaptive reference image. Next, moving regions are recovered and annotated by appropriate region statistics. Finally, an inter-frame matching procedure is used to establish the temporal continuity of the moving regions. This shows the observations of two people walking within the field of views of two cameras. The 2D image positions have been converted into ground plane coordinates and overlaid along with their associated covariance ellipses. Note that each person is represented by two almost coincident tracks. Using the 3D data stream supplied by the camera, an object agent updates its estimate of the trajectory of the 3D event. A simple second-order trajectory model is employed which is updated from 3D observations using a Kalman filter formulation. The state variable i is composed of an acceleration, velocity and positional term for each axis on the ground plane. X Figure 3. Camera 2: Extracted Regions procedure is made robust to the spatial and temporal fragmentation exhibited by any segmentation technique based on pixel differencing. For each successfully matched region in the current image, a linear motion model is computed and updated using a simple , filter. Some examples of the successfully tracked events caught in two cameras with overlapping views are shown in Figure 2 and 2. At 5Hz, these camera agents recover dense tracks ideal for trajectory estimation on the ground plane. Note that although a linear motion model has been employed, the simple , motion filter is capable of tracking complex motions. 3 Scene Interpretation using Object Agents Once invoked each object agent is responsible for describing the appearance, motion and activity of the event in the ground plane. A data link between the camera agent and object agent is established and maintained while the 3D event is being successfully tracked. Down this connection, the tracker pushes in realtime both pixel data and the computed location of the event on the ground plane. Figure 4 Xi = [ai ; bi; ui ; vi ; xi ; yi]T where ai and bi are the acceleration components in the X and Y directions, ui and vi are the velocity components, and xi and yi are the positional components. The resultant trajectory information is employed in three ways. First the trajectory information is used to identify and fuse object agents which track the same 3D event from different cameras. Second each object uses the trajectory to classify the activity [9]. 3.1 Trajectory Data Fusion in Multi-Camera Environments Each camera which successfully tracks the same event in the 3D scene creates its own object agent to which it subsequently streams pixel and 3D positional observations of the event. Consequently agents which correspond to the same 3D event must have some mechanism for first identifying each other and second fusing. This is achieved by requiring all agents to compare trajectory information. Two agents which have highly similar trajectories are combined to form a new agent. The original data streams from each camera are inherited by the new agent from each of the merged agents. Thus positional information will arrive from multiple cameras. Such a new agent will continue as before to update its trajectory from this combined source using the standard Kalman update equations. Identical trajectories can be identified using a hypothesis test on the Mahalanobis distance between the location of estimates in six dimensional trajectory space. Given two and 0 and their respective covariances P trajectories 0 and P , the normalised separation  is given by X X X X = (X , X0 )T (P + P 0 ),1 (X , X0 ) Figure 4. Fusing positional observations on the ground plane (1) (2) Objects whose trajectories are separated by a distance less than some threshold X (where X represents a confidence level of 0:95) are considered identical. In fact merging of agents is only initiated after the trajectories have been repeatedly determined as identical for T seconds (where T is currently set at 5 seconds). Introducing such an inertial characteristic ensures that the costly process of merging agents is not initiated by any occasional gross noise conditions. 4 Conclusion A framework for visual surveillance has been presented, consisting of camera, object and activity agents. A camera agent processes the raw bitstream of observed data into foreground and background. The foreground data is segmented into regions, and tracked through the image plane. Object agents compete to claim the regions as their reference, using chromatic and spatial data as evidence. Any persistent, unclaimed foreground data is used to spawn new objects. A camera independent location is defined for each object, by using a pre-calibrated mapping between each image plane and the observed ground plane. Each object agent can interrogate each other’s properties, which may be used to update their own, and thus the distributed representation of the scene interpretation. As an example, a formal technique for object fusion was presented, by which two object agents, spawned by different cameras, may ascertain if they are actually the same object. The agent framework is well suited for application to scene understanding. First, it provides a mechanism for binding together a set of tasks, all relating to a particular input, object or aspect of the whole problem. Second, it allows a clear specification of the interface between these sets, which, with careful design and implementation, permits an asynchronous, multi-platform instantiation. Third, it facilitates an event driven process control: agents are separate threads which are granted some autonomy. Using this framework, we are able to solve problems such as recognition that two cameras are observing the same object, and the classification of the activity of these objects. There is scope for further problem solving using this framework. For example, the fusion of shape information from multiple cameras can be used for enhanced classification of object category. An object agent can transmit information to a camera agent, to enhance tracking capabilities. The agent framework provides a suitable environment for the implementation of these functions. References [1] F. Bremond and M. Thonnat. “Tracking multiple non-rigid objects in a cluttered scene”. In Scandinavian Conference on Image Analysis, pages 643–650, Lappeeranta, Finland, June 1997. [2] H. Buxton and S. Gong. “Visual Surveillance in a Dynamic and Uncertain World”. Artificial Intelligence, 78(1-2):431–459, 1995. [3] E. François. “Rigid Layers Reconstruction Based on Motion Segmentation”. In Workshop on Image Analysis for Multimedia Interactive Services, Louvain-laNeuve, June 1997. [4] G.A.Jones and P.Giaccone. “Hierarchical Tracking of Motion in Multiple Images”. In IEE Colloqium on Multiresolution Modelling and Analysis in Image Processing and Computer Vision, London, April 1995. [5] M. Hotter, R. Mester, and M. Meyer. “Detection of moving objects using robust displacement estimation including statistical error analysis”. In IEEE International Conference on Pattern Recognition, Vienna, August 1996. [6] J.Segen and S.Pingali. “A camera-based system for tracking people in real-time”. In IEEE International Conference on Pattern Recognition, Vienna, August 1996. [7] R. Mech and M. Wollborn. “A Noise Robust Method for 2D Shape Estimation of Moving Objects in Video Sequences Considering a Moving Camera”. In Workshop on Image Analysis for Multimedia Interactive Services, Louvain-la-Neuve, June 1997. [8] P.L.Rosin and T.Ellis. “Image difference threshold strategies and shadow detection”. In Proceedings of the British Machine Vision Conference, 1995. [9] P.Remagnino, J.Orwell, and G.A.Jones. “Visual Interpretation of People and Vehicle Behaviours using a Society of Agents”. In Italian Association for Artificial Intelligence, Bologna, September 1999. to appear. [10] P.Remagnino, T.Tan, and K.Baker. “Agent Orientated Annotation in Model Based Visual Surveillance”. In Proceedings of IEEE International Conference on Computer Vision, pages 857–862, Bombay, India, January 1998. [11] Y. Shoham. “Agent-Oriented Programming”. Artificial Intelligence, 60:51–92, 1993.