Describing Images Using Qualitative Models and Description Logics

Zoe Falomir; Ernesto Jiménez-Ruiz; M. Teresa Escrig; Lledó Museros

Describing Images Using Qualitative Models and Description Logics

Zoe Falomir Llansola

visibility

…

description

31 pages

link

1 file

Our approach describes any digital image qualitatively by detecting regions/objects inside it and describing their visual characteristics (shape and colour) and their spatial characteristics (orientation and topology) by means of qualitative models. The description obtained is translated into a description logic (DL) based ontology, which gives a formal and explicit meaning to the qualitative tags representing the visual features of the objects in the image and the spatial relations between them. For any image, our approach obtains a set of individuals that are classified using a DL reasoner according to the descriptions of our ontology.

Preprint submitted to SPATIAL COGNITION AND COMPUTATION Taylor & Francis, 2010 Describing Images using Qualitative Models and Description Logics Zoe Falomir, Ernesto Jiménez-Ruiz M. Teresa Escrig, Lledó Museros University Jaume I, Castellón, Spain Our approach describes any digital image qualitatively by detecting regions/objects inside it and describing their visual characteristics (shape and colour) and their spatial characteristics (orientation and topology) by means of qualitative models. The description obtained is translated into a description logic (DL) based ontology, which gives a formal and explicit meaning to the qualitative tags representing the visual features of the objects in the image and the spatial relations between them. For any image, our approach obtains a set of individuals that are classified using a DL reasoner according to the descriptions of our ontology. Keywords: qualitative shape, qualitative colours, qualitative orientation, topology, ontologies, computer vision 1 Introduction Using computers to extract visual information from space and interpreting it in a meaningful way as human beings can do remains a challenge. As digital images represent visual data numerically, most image processing has been carried out by applying mathematical techniques to obtain and describe image content. From a cognitive point of view, however, visual knowledge about space is qualitative in nature (Freksa, 1991). The retinal image of a visual object is a quantitative image in the sense that specific locations on the retina are stimulated by light of a specific spectrum of wavelengths and intensity. However, the knowledge about a retinal image that can be retrieved from memory is qualitative. We cannot re- Author Posting. (c) ’Taylor & Francis’, 2011. This is the author’s version of the work. It is posted here by permission of ’Taylor & Francis’ for personal use, not for redistribution. The definitive version was published in Spatial Cognition & Computation, Volume 11 Issue 1, January 2011. (http://dx.doi.org/10.1080/13875868.2010.545611) 2 Preprint submitted to Spatial Cognition and Computation trieve absolute locations, wavelengths and intensities from memory. We can only recover certain qualitative relationships between features within the image or between image features and memory features. Qualitative representations of this kind are similar in many ways to “mental images” (Kosslyn, 1994; Kosslyn et al., 2006) that people report on when they describe what they have seen from memory or when they attempt to answer questions on the basis of visual memories. Extracting semantic information from images as human beings can do is still an unsolved problem in computer vision. The approach presented here can describe any digital image qualitatively and then store the results of the description as facts according to an ontology, from which new knowledge in the application domain can be inferred by the reasoners. The association of meaning with the representations obtained by robotic systems, also known as the symbol-grounding problem, is still a prominent issue within the field of Artificial Intelligence (AI) (Kuhn et al., 2007; Williams, 2008; Williams et al., 2009). Therefore, in order to contribute in this field, the first application on which our approach has been tested is extracting semantic information from images captured by a robot camera in indoor environments. This semantic information will support robot self-localization and navigation in the future. As Palmer (1999) points out, in an image, different colours usually indicate different objects/regions of interest. Cognitively, this is the way people process images. Therefore, in our approach, a graph-based region segmentation method based on intensity differences (Felzenszwalb and Huttenlocher, 2004) 1 has been used in order to identify the relevant regions in an image. Then the visual and spatial features of the regions are computed. The visual features of each region are described by qualitative models of shape (Falomir et al., 2008) and colour, while the spatial features of each region are described by qualitative models of topology (Egenhofer and Al-Taha, 1992) and orientation (Hernández, 1991; Freksa, 1992). We have adopted description logics (DL) (Baader et al., 2003) as the formalism for representing the low-level information from image analysis and we have chosen OWL 22 (Horrocks et al., 2003; Cuenca Grau et al., 2008), which is based on the description logic SROIQ (Horrocks et al., 2006), as the ontology language. This logic-based representation enables us to formally describe the qualitative features of our images. Our system also includes a DL reasoner, enabling objects from the images and the images themselves to be categorized according to the definitions incorporated into the ontology schema, which enhances the qualitative description of the images with new inferred knowledge. Description logics are fragments of first order logic, therefore they work under the open world assumption (OWA) (Hustadt, 1994), that is, unlike databases, they work under the assumption that knowledge of the world is incomplete. In this paper, the suitability of the OWA for our domain is analyzed and the cases where 1 More details in: http://people.cs.uchicago.edu/ pff/segment/ Web Language: http://www.w3.org/TR/owl2-syntax/ 2 Ontology 2 Preprint submitted to Spatial Cognition and Computation 3 additional reasoning services or the closed world assumption (CWA) would be necessary are detected. Moreover, a partial solution for our setting is proposed. The remainder of this paper is organized as follows. Section 2 describes the related work. Section 3 summarizes our approach for qualitative description of images. Section 4 presents the proposed ontology-based representation of qualitative description of images. Section 5 shows the tests carried out by our approach in a scenario where a robot navigates and then the results obtained are analysed. Finally, Section 6 explains our conclusions and future work. 2 Related Work Related studies have been published that extract qualitative or semantic information from images representing scenes (Socher et al., 1997; Lovett et al., 2006; Qayyum and Cohn, 2007; Oliva and Torralba, 2001; Quattoni and Torralba, 2009). Socher et al. (1997) provide a verbal description of an image to a robotic manipulator system so it can identify and pick up an object that has been previously modelled geometrically and then categorized qualitatively by its type, colour, size and shape. The spatial relations between the predefined objects detected in the image are also described qualitatively. Lovett et al. (2006) propose a qualitative description for sketch image recognition, which describes lines, arcs and ellipses as basic elements and also the relative position, length and orientation of their edges. Qayyum and Cohn (2007) divide landscape images using a grid for their description so that semantic categories (grass, water, etc.) can be identified and qualitative relations of relative size, time and topology can be used for image description and retrieval in databases. Oliva and Torralba (2001) obtain the spatial envelope of complex environmental scenes by analysing the discrete Fourier transform of each image and extracting perceptual properties of the images (naturalness, openness, roughness, ruggedness and expansion) which enable classification of images in the following semantic categories: coast, countryside, forest, mountain, highway, street, close-up and tall building. Quattoni and Torralba (2009) propose an approach for classifying images of indoor scenes in semantic categories such as bookstore, clothing store, kitchen, bathroom, restaurant, office, classroom, etc. This approach combines global spatial properties and local discriminative information (i.e. information about objects contained in the places) and uses learning distance functions for visual recognition. We believe that all the studies described above provide evidence for the effectiveness of using qualitative/semantic information to describe images. However, in the approach developed by Socher et al. (1997), a previous object recognition process is needed before qualitatively describing the image of the scene the robot manipulator has to manage, whereas our approach is able to describe the image of the scene in front of the robot without this prior recognition process because the object characterization is done afterwards using the definitions of our ontology. 3 4 Preprint submitted to Spatial Cognition and Computation The approach of Lovett et al. (2006) is applied to sketches, while our approach is applied to digital images captured from the real robot environment. Qayyum and Cohn (2007) use a grid to divide the image and describe what is inside each grid square (grass, water, etc.), which is adequate for their application but the objects are divided into an artificial number of parts that depend on the size of the cell, while our approach extracts complete objects, which could be considered more cognitive. The approach of Oliva and Torralba (2001) is useful for distinguishing between outdoor environments. However, as this approach does not take into account local object information, it will obtain similar spatial envelopes for similar images corresponding to the indoor environments where our robot navigates, such as corridors in buildings. The approach of Quattoni and Torralba (2009) performs well in recognizing indoor scenes, however it uses a learning distance function and, therefore, it must be trained on a dataset, while our approach does not require training. There are related studies in the literature that examine the possible benefits and challenges of using description logics (DL) as knowledge representation and reasoning systems for high-level scene interpretation (Neumann and Möller, 2008; Dasiopoulou and Kompatsiaris, 2010). Neumann and Möller (2008) also present the limitations of current DL reasoning services in a complete scene interpretation. In addition, they give some useful guidelines for future extensions of current DL systems. Nevertheless, the use of DLs in image interpretation is still presented as an open issue (Dasiopoulou and Kompatsiaris, 2010) because of their inherent open world semantics. Only a few approaches, using DL-based ontologies to enhance high-level image interpretation, can be found in the literature (Maillot and Thonnat, 2008; Johnston et al., 2008; Schill et al., 2009; Bohlken and Neumann, 2009). Maillot and Thonnat (2008) describe images using an ontology that contains qualitative features of shape, colour, texture, size and topology and apply this description to the classification of pollen grains. In the work by Maillot and Thonnat (2008), the regions to describe inside an image are segmented manually using intelligent scissors within the knowledge acquisition tool, while in our approach they are extracted automatically. For the ontology-backend, Maillot and Thonnat (2008) perform, as in our approach, a good differentiation of three levels of knowledge; however, they do not tackle the open world problem of image interpretation. Johnston et al. (2008) present an ontology-based approach to categorize objects and communicate among agents. This approach was innovatively tested at the RoboCup tournament where it was used to enable Sony AIBO robots to recognize the ball and the goal. Similarly to our approach, the authors adopt description logics to represent the domain entities and they use a reasoner to infer new knowledge. In contrast to our approach, the lighting conditions are controlled in the RoboCup tournament and the shape and colour of the objects to search for (ball and goal) are known a priori and are easy to locate using colour segmentation techniques. Moreover, this 4 Preprint submitted to Spatial Cognition and Computation 5 work does not address the problems related to the open world assumption. Schill et al. (2009) describe an interesting scene interpretation approach that combines a belief theory with an OWL-like ontology based on DOLCE (Gangemi et al., 2002). Identified objects are classified into the ontology concepts with a degree of belief or uncertainty. This approach could be considered as complementary to ours, and future extensions may consider the introduction of uncertainty. Bohlken and Neumann (2009) present a novel approach in which a DL ontology is combined with the use of rules to improve the definition of constraints for scene interpretation. The use of rules enables them to combine the open world semantics of DLs with closed world constraint validation. However, the use of rules may lead to undecidability and so their use should be restricted (Motik et al., 2005; Krötzsch et al., 2008). Our approach implements a simpler solution, although it would be interesting to analyze extensions involving the use of rules. Finally, it should be noted that our DL-ontology is not designed for a particular type of robot or scenario. It is based on a general approach for describing any kind of image detected by a digital camera. Other interesting approaches are those that relate qualitative spatial calculus with ontologies (Bhatt and Dylla, 2009; Katz and Cuenca Grau, 2005). Bhatt and Dylla (2009) modelled spatial scenes using an ontology that represents the topological calculus RCC-8 and the relative orientation calculus OPRA. In contrast to our approach, they do not address the problem of extracting and describing objects contained in digital images and their ontology is not based on DL. Katz and Cuenca Grau (2005) exploit the correspondences among DL, modal logics and the Region Connection Calculus RCC-8 in order to propose a translation of the RCC-8 into DL. Despite all the previous studies combining the extraction of qualitative/semantic information and its representation using ontologies, the problem of bringing together low-level sensory input and high-level symbolic representations is still a big challenge in robotics. Our approach is a small contribution to meeting this challenge. 3 Our Approach for Qualitative Description of Images The approach presented in this paper describes any image qualitatively by describing the visual and spatial features of the regions or objects within it. For each region or object in the image, the visual features (Section 3.1) and the spatial features (Section 3.2) are described qualitatively. As visual features, our approach describes the shape and colour of the region, which are absolute properties that only depend on the region itself. As spatial features, our approach describes the topology and orientation of the regions, which are properties defined with respect to other regions (i.e. containers and neighbours of the regions). 5 6 Preprint submitted to Spatial Cognition and Computation A diagram of the qualitative description obtained by our approach is shown in Figure 1. Figure 1: Structure of the qualitative image description obtained by our approach. 3.1 Describing Visual Features of the Regions in the Image In order to describe the visual features of the regions or objects in an image, we use the qualitative model of shape description formally defined in (Falomir et al., 2008), which is summarized in Section 3.1.1, and a new model for qualitative colour naming based on the Hue, Saturation and Lightness (HSL) colour space, which is defined in Section 3.1.2. 3.1.1 Qualitative Shape Description Given a digital image containing an object, our approach for Qualitative Shape Description (QSD) first automatically extracts the boundary of this object using colour segmentation (Felzenszwalb and Huttenlocher, 2004). From all the points that define the boundary of each of the closed objects extracted, a set of consecutive points are compared by calculating the slope between them. If the slope between a point Pi and its consecutive point Pi+1 , denoted by s1 , and the slope between Pi and Pi+2 , termed s2 , are equal, then Pi , Pi+1 and Pi+2 belong to the same straight segment. If s1 and s2 are not equal, Pi , Pi+1 and Pi+2 belong to a curved segment. This process is repeated for a new point Pi+3 , calculating the slope between Pi and Pi+3 (s3 ), and comparing that slope with s1 and so on. The process stops when all the consecutive points of the boundary are visited. P is considered a relevant point if: (1) it belongs to a straight segment and it is the point at which the slope stops being constant; or (2) it belongs to a curved segment and it is the point at which the slope changes its sign. 6 Preprint submitted to Spatial Cognition and Computation 7 Note that the points of a boundary that are considered consecutive are those separated by a pre-established granularity step. For example, if the granularity step is k, the first point considered (Pi ) will be point 1 of the set of boundary points (P Set(1)), Pi+1 will be point k of the set of boundary points P Set(k), Pi+2 will be P Set(2k), Pi+3 will be P Set(3k), etc. This granularity step is set by experimentation as a function of the edge length of the described object: if the edges are long, the granularity step will have a larger value; if they are short, the granularity step will have a smaller value. Finally, a set of relevant points, denoted by {P0 ,P1 ,....PN }, determines the shape of the object. Each of those relevant points P is described by a set of four features <KECP , AP or TCP , LP , CP >, which are defined below. The first feature is the Kind of Edges Connected (denoted by KEC) and it indicates the connection occurring at the relevant point P. This feature is described by the following tags: - line-line, if the point P connects two straight lines; - line-curve, if P connects a line and a curve; - curve-line, if P connects a curve and a line; - curve-curve, if P connects two curves; or - curvature-point, if P is a point of curvature of a curve. If KEC is a line-line, line-curve, curve-line or curve-curve, the second feature to consider is the Angle (denoted by A) at the relevant point. The angle is a quantitative feature that is discretized by using the Angle Reference System or ARS = {◦ , ALAB , AIN T } where, degrees (◦ ) indicates the unit of measurement of the angles; ALAB refers to the set of labels for the angles; and AIN T refers to the values of degrees (◦ ) related to each label. In our approach the ALAB and AIN T used are: ALAB = {very acute, acute, right, obtuse, very obtuse} AIN T = {(0, 40], (40, 85], (85, 95], (95, 140], (140, 180]} On the other hand, if KEC is a curvature-point, the second feature is the Type of Curvature (denoted by TC) at P which is defined by the Type of Curvature Reference System or TCRS = {◦ , TCLAB , TCIN T }, where ◦ refers to the amplitude in degrees of the angle given by the relation between the distances da and db (see Figure 2(a) where the type of curvature of the relevant point Pj is shown with respect to the relevant points Pj −1 and Pj +1 ), that is, Angle(Pj )=2 arctg(da/db), TCLAB refers to the set of labels for curvature; and TCIN T refers to the values of degrees (◦ ) related to each label. In our approach the TCLAB and TCIN T are: TCLAB = {very acute, acute, semicircular, plane, very plane} TCIN T = {(0, 40], (40, 85], (85, 95], (95, 140], (140, 180]} 7 8 Preprint submitted to Spatial Cognition and Computation (a) (b) Figure 2: Characterization of Pj as: (a) a point of curvature, and (b) a point connecting two straight segments. The third feature considered is the compared length (denoted by L) which is defined by the Length Reference System or LRS = {UL, LLAB , LIN T }, where UL or Unit of compared Length refers to the relation between the length of the first edge and the length of the second edge connected by P, that is, ul = (length of 1st edge)/(length of 2nd edge); LLAB refers to the set of labels for compared length; and LIN T refers to the values of UL related to each label. LLAB = {much shorter (msh), half length (hl), a bit shorter (absh), similar length (sl), a bit longer (abl), double length (dl), much longer (ml)} LIN T = {(0, 0.4], (0.4, 0.6], (0.6, 0.9], (0.9, 1.1], (1.1, 1.9], (1.9, 2.1], (2.1, ∞)} The last feature to be considered is the Convexity (denoted by C) at point P, which is obtained from the oriented line built from the previous point to the next point and by ordering the qualitative description of the shape clockwise. For example from Figure 2(b), if point Pj is on the left of the segment defined by Pj−1 and Pj+1 , then Pj is convex; otherwise Pj is concave. Thus, the complete shape of an object is described as a set of qualitative descriptions of relevant points as 3 : [[KEC0 , A0 | TC0 , L0 , C0 ], . . . , [KECn−1 , An−1 | TCn−1 , Ln−1 , Cn−1 ]] where n is the total number of relevant points of the object. Finally, note that the intervals of values that define the qualitative tags representing the features angle, type of curvature and compared length (AIN T , TCIN T and LIN T , respectively) have been calibrated according to our application and system. 3A i | TCi denotes that the angle or the type of curvature that occurs at the point Pi . 8 Preprint submitted to Spatial Cognition and Computation 9 3.1.2 Qualitative Colour Description Our approach for Qualitative Colour Description (QCD) translates the Red, Green and Blue (RGB) colour channels into Hue, Saturation and Lightness (HSL) coordinates, which are more suitable for dividing into intervals of values corresponding to colour names. In contrast to the RGB model, HSL is considered a more natural colour representation model as it is broken down according to physiological criteria: hue refers to the pure spectrum colours and corresponds to dominant colour as perceived by a human; saturation corresponds to the relative purity or the quantity of white light that is mixed with hue; and luminance refers to the amount of light in a colour (Sarifuddin and Missaoui, 2005). Furthermore, as W3C mentions4 , additional advantages of HSL are that it is symmetrical to lightness and darkness (which is not the case with HSV, for example). This means that: (i) in HSL, the saturation component takes values from fully saturated colour to the equivalent grey, but in HSV, considering the value component at the maximum, it goes from saturated colour to white, which is not intuitive; and (ii) the lightness in HSL always spans the entire range from black through the chosen hue to white, while in HSV, the value component only goes halfway, from black to the chosen hue. From the HSL colour coordinates obtained, a reference system for qualitative colour description is defined as: QCRS = {UH, US, UL, QCLAB1..5 , QCIN T 1..5 } where UH is the Unit of Hue; US is the Unit of Saturation; UL is the Unit of Lightness; QCLAB1..5 refers to the qualitative labels related to colour; and QCIN T 1..5 refers to the intervals of HSL colour coordinates associated with each colour label. For our approach, the QCLAB and QCIN T are the following: QCRSLAB1 = {black, dark grey, grey, light grey, white} QCRSIN T 1 = {[0 ul, 20 ul[, [20 ul, 30 ul[, [30 ul, 40 ul[, [40 ul, 80 ul[, [80 ul, 100 ul[ / ∀ UH ∧ US∈[0, 20]} QCRSLAB2 = {red, yellow, green, turquoise, blue, purple, pink} QCRSIN T 2 = {]335 uh, 360 uh] ∧ [0 uh, 40 uh], ]40 uh, 80 uh], ]80 uh, 160 uh], ]160 uh, 200 uh], ]200 uh, 260 uh], ]260 uh, 297 uh] ]297 uh, 335 uh] / ∀ UL ∈ ]40, 55] ∧ US ∈ ]50, 100]} QCRSLAB3 = {pale + QCRSLAB2 } QCRSIN T 3 = { ∀ UH ∧ US ∈ ]20, 50] ∧ UL ∈ ]40, 55] } QCRSLAB4 = {light + QCRSLAB2 } QCRSIN T 4 = { ∀ UH ∧ US ∈ ]50, 100] ∧ UL ∈ ]55, 100] } QCRSLAB5 = {dark + QCRSLAB2 } 4 See the CSS3 specification from the W3C (http://www.w3.org/TR/css3-color/#hsl-color) 9 10 Preprint submitted to Spatial Cognition and Computation QCRSIN T 5 = {∀ UH ∧ US ∈ ]50, 100] ∧ UL ∈ ]20, 40]} The saturation coordinate of the HSL colour space (US) determines if the colour corresponds to the grey scale or to the rainbow scale: QCRSLAB1 and QCRSLAB2 , respectively, in our QCRS. This coordinate also determines the intensity of the colour (pale or strong). The colours in the rainbow scale are considered the strong ones, whereas the pale colours are given an explicit name in QCRSLAB3 . The hue coordinate of the HSL colour space (UH) determines the division into colour names inside each scale. Finally, the lightness coordinate (UL) determines the luminosity of the colour: dark and light colours are given an explicit name in QCRSLAB4 and QCRSLAB5 , respectively. Note that the intervals of HSL values that define the colour tags (QCIN T 1..5 ) have been calibrated according to our application and system. Colour identification depends on illumination, but HSL colour space deals with lighting conditions through the L coordinate, which separates the lightness of the colour while its corresponding hue or colour spectrum remains the same. Finally, our approach obtains the qualitative colour of the centroid of each object detected in the image and the qualitative colour of the relevant points of its shape and the most frequent colour is defined as the colour of the object. Note that colour patterns are not handled at all. 3.2 Describing Spatial Features of the Regions in the Image The spatial features considered for any region or object in the image are its topology relations (Subsection 3.2.1) and its fixed and relative orientation (Subsection 3.2.2). Orientation and topology relations describe the situation of the objects in the two-dimensional space regardless of the proximity of the observer (robot/person) to them. Moreover, topology relations also implicitly describe the relative distance between the objects. 3.2.1 Topological Description In order to represent the topological relationships of the objects in the image, the intersection model defined by Egenhofer and Franzosa (1991) for region configurations in R2 is used. However, as information on depth cannot be obtained from digital images, the topological relations overlap, coveredBy, covers and equal defined in Egenhofer and Franzosa (1991) cannot be distinguished by our approach and are all substituted by touching. Therefore, the topology situation in space (invariant under translation, rotation and scaling) of an object A with respect to (wrt) another object B (A wrt B), is described by: 10 Preprint submitted to Spatial Cognition and Computation 11 Topology = {disjoint, touching, completely inside, container} Our approach determines if an object is completely inside or if it is the container of another object. It also defines the neighbours of an object as all the other objects with the same container. The neighbours of an object can be (i) disjoint from the object, if they do not have any edge or vertex in common; (ii) or touching the object, if they have at least one vertex or edge in common or if the Euclidean distance between them is smaller than a certain threshold set by experimentation. 3.2.2 Fixed and Relative Orientation Description Our approach describes the orientation of the objects in the image using the model defined by Hernández (1991) and the model defined by Freksa (1992). A Fixed Orientation Reference System (FORS) is defined by using the model by Hernández (1991), which obtains the orientation of an object A with respect to (wrt) its container or the orientation of an object A wrt an object B, neighbour of A. This reference system divides the space into eight regions (Figure 3(a)) which are labelled as: FORSlabels = {front (f), back (b), left (l), right (r), left front (lf), right front (rf), left back (lb), right back (rb), centre (c)} In order to obtain the fixed orientation of each object wrt another or wrt the image, our approach locates the centre of the FORS on the centroid of the reference object and its front area is fixed to the upper edge of the image. The orientation of an object is determined by the union of all the orientation labels obtained for each of the relevant points of the object. If an object is located in all the regions of the reference system, it is considered to be in the centre. A Relative Orientation Reference System (RORS) is defined by using Freksa (1992) double cross orientation model. This model divides the space by means of a Reference System (RS) which is formed by an oriented line determined by two reference points a and b. The information that can be represented by this model is the qualitative orientation of a point c wrt the RS formed by the points a and b, that is, c wrt ab (Figure 3(b)). This model divides the space into 15 regions which are labelled as: RORSlabels = {left front (lf), straight front (sf), right front (rf), left (l), identical front (idf), right (r), left middle (lm), same middle (sm), right middle (rm), identical back left (ibl), identical back (ib), identical back right (ibr), back front (bf), same back (sb), back right (br)} In order to obtain the relative orientation of an object, our approach establishes reference systems (RORSs) between all the pairs of disjoint neighbours of that object. The points a and b of the RORS are the centroids of the objects that make 11 12 Preprint submitted to Spatial Cognition and Computation (a) (b) Figure 3: Models of Orientation used by our Approach for Image Description. (a) Hernandez’s orientation model; (b) Freksa’s orientation model and its iconical representation: ’l’ is left, ’r’ is right, ’f’ is front, ’s’ is straight, ’m’ is middle, ’b’ is back and ’i’ is identical. up the RORS. The relevant points of each object are located with respect to the corresponding RORS and the orientation of an object with respect to a RORS is calculated as the union of all the orientation labels obtained for all the relevant points of the object. Note that, as we have already mentioned, in our approach, orientation relations between the objects in the image are structured in levels of containment. The fixed orientation (Hernández, 1991) of a region is defined with respect to its container and neighbours of level, while the relative orientation of a region (Freksa, 1992) is defined with respect to its disjoint neighbours of level. Therefore, as the spatial features of the regions are relative to the other regions in the image, the number of spatial relationships that can be described depends on the number of regions located at the same level of containment, as shown in Table 1. The advantage of providing a description structured in levels of containment is that the level of detail to be extracted from an image can be selected. For example, the system can extract all the information in the image or only the information about the objects whose container is the image and not another object, which could be considered a more general or abstract description of the image. The reason for using two models for describing the orientation of the objects or regions in the image is the different kind of information each provides. According to the classification of reference frames by Hernández (1991), we can consider that: • the reference system or frame in the model developed by Hernández (1991) is intrinsic because the orientation is given by some inherent property of the reference object. This property is defined by our approach by fixing the object front to the upper edge of the image. Therefore, the orientations provided by this model are implicit because they refer to the intrinsic orien- 12 Preprint submitted to Spatial Cognition and Computation 13 Table 1: Spatial features described depending on the number of objects at each level Objects within the same container 1 2 >2 x x Topology x Wrt its Container x x Fixed Orientation x x Topology x Wrt its Neighbours x Fixed Orientation x Relative Orientation x Spatial Features Described tation of the parent object or the object of reference. Here, implicit and intrinsic orientations coincide as the front of all the objects is fixed to the same location a priori. Therefore, the point of view is influenced by the orientation of the image given by an external observer. • in the model developed by Freksa (1992), an explicit reference system or frame is necessary to establish the orientation of the point of view with respect to the reference objects. Moreover, this reference system is extrinsic, since an oriented line imposes an orientation and direction on the reference objects. However, the orientation between the objects involved is invariant to the orientation of the image given by an external observer, because even if the image rotates, the orientations obtained by our RORS remain the same. Therefore, in practice, considering both models, our approach can: (i) describe the implicit orientations of the objects in the image from the point of view of an external observer (robot camera) and regardless of the number of objects within the image, and (ii) describe complex objects contained in the image (which must be composed of at least three objects or regions) in an invariant way, that is, regardless of the orientation of the image given by an external observer (which could be very useful in a vision recognition process in the near future). 3.3 Obtaining a Qualitative Description of Any Digital Image The approach presented here qualitatively describes any image by describing the visual and spatial features of the main regions of interest within it. In an image, different colours or textures usually indicate different regions of interest to the human eye. Therefore, image region segmentation approaches are more cognitive than edge-based image segmentation approaches because the extracted edges are defined by the boundaries between regions and all of them are closed (Palmer, 1999). 13 14 Preprint submitted to Spatial Cognition and Computation In our approach, the regions of interest in an image are extracted by a graphbased region segmentation method (Felzenszwalb and Huttenlocher, 2004) based on intensity differences complemented by algorithms developed to extract the boundaries of the segmented regions. Felzenszwalb and Huttenlocher (2004) mention that the problems of image segmentation and grouping remain great challenges for computer vision because, a useful segmentation method has to: (i) capture perceptually important groupings or regions, which often reflect global aspects of the image; and (ii) be highly efficient, running in nearly linear time in the full number of image pixels. The approach developed by Felzenszwalb and Huttenlocher (2004) is suitable for our approach because it meets the above criteria and it also preserves detail in low-variability image regions while ignoring detail in high-variability regions with adjustment of its segmentation parameters: σ, used to smooth the input image before segmenting it; k, the value for the threshold function in segmentation, the larger the value, the larger the components in the result; and min, minimum size of the extracted regions in pixels enforced by post-processing. Once all the relevant colour regions of an image are extracted, image processing algorithms that obtain the contour of each region are applied. Then, our approach describes the shape, colour, topology and orientation of each region as explained above. As an example, a digital image of the corridors of a building of our university where our offices are located is presented in Figure 4 and its qualitative description is presented in Table 2. Figure 4: Schema of our approach for qualitative image description. Figure 4 shows the original image and the image obtained after extracting the regions obtained by Felzenszwalb and Huttenlocher’s (2004) segmentation method (using as segmentation parameters: σ = 0.4, k = 500 and min = 1000) and the contour of these regions by our boundary extraction algorithms. Note that 14 Preprint submitted to Spatial Cognition and Computation 15 some objects in the original image (such as the door sign, the door handles and the electrical socket) are not extracted in the segmentation process because of their small size (< 1000 pixels). Table 2: An excerpt of the qualitative description obtained for the image in Figure 4. [SpatialDescription, [ 1, [Container, Image], [Orientation wrt Image: front, front left, back left, back], [touching, 2, 8, 9, 13], [disjoint, 0, 3, 4, 5, 6, 7, 11, 12], [completely inside, 10], [Orientation wrt Neighbours: [0, front right, right, back right, back], [2, left, back, back left], (...) [Relative Orientation wrt Neighbours Disjoint: [[0, 4], rm], (...) [[4, 7], br, rf], (...) [[11, 12], rm, rf]] ](...) [ 10, [[Container, 1] [Orientation wrt 1: left, back left, back], [None Neighbours of Level] ] ](...)] [VisualDescription, [ 7, dark-grey, [Boundary Shape, [line-line, right, much shorter, convex] [line-line, right, much longer, convex], [line-line, right, half length, convex], [line-line, right, much longer, convex], [Vertices Orientation, front, back, back, front]], ], (...) [ 10, dark-red, [Boundary Shape, [line-line, obtuse, half length, convex], (...) [line-line, very obtuse, similar length, convex]] [Vertices Orientation, front, front, front right, right, back right (...) ]], ], (...)] Table 2 presents an excerpt of the qualitative description obtained from the digital image presented in Figure 4. Specifically, this table shows the qualitative spatial description of regions 1 and 10 and the qualitative visual description of regions 7 and 10. Note that the object identifiers correspond to those located on the objects in Image regions.jpg in Figure 4. The spatial description of region 1 can be intuitively read as follows: its container is the Image and it is located wrt to the Image at front, front left, back, back left. Its touching neighbours are the regions 2, 8, 9, 13 (Note that some of these are not technically touching but are closer to region 1 than the threshold set 15 16 Preprint submitted to Spatial Cognition and Computation by experimentation for our application). Its disjoint neighbours are the regions 0, 3, 4, 5, 6, 7, 11 and 12 and finally, object 10 is completely inside 1. The fixed orientation of region 1 wrt region 0 is front right, right, back right, back, wrt region 2 it is left, back, back left, wrt region 3 it is back right and in a similar way, the fixed orientation of region 1 is described wrt all its neighbours of level. Finally, the relative orientation wrt the disjoint neighbours of region 1 is given: from region 0 to region 4, region 1 is located right middle (rm); from region 4 to region 7, region 1 is located back right (br) and also right front (rf), from region 11 to region 12, region 1 is located right middle (rm) and right front (rf). The spatial description of region 10 is also given in Table 2: its container is region 1 with respect to which it is located at left, back left, back. Region 10 has no neighbours of level, as it is the only region contained by region 1. The visual description of region 7 in Table 2 shows that its colour is dark grey and that the shape of its boundary is qualitatively described as composed of four line line segments whose angles are all right and convex and whose compared distances are respectively much shorter, much longer, half, much longer. Finally, the orientation of its vertices with respect to the centroid of the region is in a clockwise direction: front, back, back, front. Note that region 10 is described similarly. 4 Formal Representation of Qualitative Descriptions Our approach describes any image using qualitative information, which is both visual (e.g. shape, colour) and spatial (e.g. topology, orientation). Here the use of ontologies is proposed in order to give a formal meaning to the qualitative labels associated with each object. Thus, ontologies will provide a logic-based representation of the knowledge within the robot system. An ontology is a formal specification of a shared conceptualization (Borst et al., 1997) providing a non-ambiguous and formal representation of a domain. Ontologies usually have specific purposes and are intended for use by computer applications rather than humans. Therefore, ontologies should provide a common vocabulary and meaning to allow these applications to communicate with each other (Guarino, 1998). The aim of using a description logics (DL) based ontology is to enhance image interpretation and classification. Furthermore, the use of a common vocabulary and semantics is also intended to facilitate potential communication between agents. The main motives for using DL-based ontologies within our system are: • Symbol Grounding. The association of the right qualitative concept with quantitative data (a.k.a. symbol grounding) and the precise relationships between qualitative concepts is still an open research line (Williams, 2008; Williams et al., 2009). The description logic family was originally called 16 Preprint submitted to Spatial Cognition and Computation 17 terminological or concept language due to its concept-centred nature. Thus, DL-based ontologies represent a perfect formalism for providing high-level representations of low-level data (e.g. digital image analysis). • Knowledge sharing. The use of a common conceptualization (vocabulary and semantics) may enhance communication between agents involved in performing similar tasks (e.g. searching for a fire extinguisher in a university environment). Moreover, the adoption of a standard ontology language gives our approach a mechanism for publishing our qualitative representations of images so that they can be reused by other agents. • Reasoning. The adoption of a DL-based representation allows our approach to use DL reasoners that can infer new knowledge from explicit descriptions. This gives some freedom and flexibility when inserting new facts (e.g. new image descriptions), because new knowledge can be automatically classified (e.g. a captured object is a door, a captured image contains a fire extinguisher). In this section, we present QImageOntology5 , a DL-based ontology to represent qualitative description of images (see Section 4.1), and how we have dealt with the Open World Assumption (OWA) (Hustadt, 1994) in order to infer the expected knowledge (see Section 4.2). 4.1 Three-Layer Representation for QImageOntology QImageOntology has adopted DL and OWL as the formalisms for representing the qualitative descriptions extracted from the images. QImageOntology was developed using the ontology editor Protégé 46 . Additionally the DL reasoner HermiT7 was used for classifying new captured images and for inferring new knowledge. DL systems make an explicit distinction between the terminological or intensional knowledge (a.k.a. Terminological Box or TBox), which refers to the general knowledge about the domain, and the assertional or extensional knowledge (a.k.a. Assertional Box or ABox), which represents facts about specific individuals. QImageOntology also makes a distinction between the general object descriptions and the facts extracted from concrete images. Additionally, our approach includes a knowledge layer within the TBox dealing with contextualized object descriptions (e.g. a UJI office door). This three-layer architecture is consistent with our purposes and image descriptions are classified with the TBox part of QImageOntology. Moreover, the contextualized knowledge can be replaced to suit a particular scenario or environment 5 Available at: http://krono.act.uji.es/people/Ernesto/qimage-ontology/ http://protege.stanford.edu 7 HermiT: http://hermit-reasoner.com/ 6 Protégé: 17 18 Preprint submitted to Spatial Cognition and Computation (e.g. Jaume I University, Valencia City Council). Thus, the three-layer architecture is composed of: • a reference conceptualization, which is intended to represent knowledge (e.g. the description of a Triangle or the assertion of red as a Colour type) that is supposed to be valid in any application. This layer is also known as top level knowledge8 by the community; • the contextualized knowledge, which is application oriented and is mainly focused on the specific representation of the domain (e.g. characterization of doors at Jaume I University) and could be in conflict with other contextbased representations; and • the image facts, which represent the assertions or individuals extracted from the image analysis, that is, the set of particular qualitative descriptions. It is worth noting that the knowledge layers of QImageOntology are considered to be three different modules and they are stored in different OWL files. Nevertheless, both the contextualized knowledge and the image facts layers are dependent on the reference conceptualization layer, and thus they perform an explicit import of this reference knowledge. Currently, the reference conceptualization and contextualized knowledge layers of QImageOntology have a SHOIQ DL expressiveness and contain: 51 concepts (organized into 80 subclass axioms, 14 equivalent axioms and 1 disjointness), 46 object properties (characterized with 30 subproperty axioms, 5 property domain axioms, 10 property range axioms, 19 inverse property axioms, and 2 transitive properties), and 51 general individuals (with 51 class assertion axioms and 1 different individual axiom). An excerpt of the reference conceptualization of QImageOntology is presented in Table 3: partial characterizations of an Object type, the definition of a Shape type as a set of at least 3 relevant points, the definition of a Quadrilateral as a Shape type with exactly 4 points connecting two lines and so on. Table 4 represents an excerpt from the contextualized knowledge of QImageOntology, where four objects are characterized: (1) the definition of the wall of our corridor (UJI Wall) as a pale yellow, dark yellow, pale red or light grey object contained by the image; (2) the definition of the floor of the corridor (UJI Floor) as a pale red object located inside the image and located at back right, back left or back but not at front, front right or front left with respect to the centre of the image; (3) the definition of an office door (UJI Office Door) as a grey or dark grey quadrilateral object located inside the image; (4) the definition of 8 We have created this knowledge layer from scratch. In the near future it would be interesting to integrate our reference conceptualization with standards such as MPEG-7 for which an OWL ontology is already available (Hunter, 2001, 2006), or top-level ontologies such as DOLCE (Gangemi et al., 2002) 18 Preprint submitted to Spatial Cognition and Computation 19 Table 3: Excerpt from the Reference Conceptualization of QImageOntology α1 Image type ⊑ ∃is container of.Object type α2 Object type ⊑ ∃has colour.Colour type α3 Object type ⊑ ∃has fixed orientation.Object type α4 Object type ⊑ ∃is touching.Object type α5 Object type ⊑ ∃has shape.Shape type α6 Shape type ⊑ > 3 has point.Point type α7 Quadrilateral ⊑ Shape type ⊓ = 4 has point.line line α8 is left ⊑ has fixed orientation α9 Colour type : red a fire extinguisher (Fire Extinguisher) as a red or dark red object located inside a UJI Wall. Note that the contextualized descriptions are rather preliminary and they should be refined in order to avoid ambiguous categorizations. Table 4: Excerpt from the Contextualized Descriptions of QImageOntology β1 β2 β3 β4 UJI Wall ≡ Object type ⊓ ∃has shape.Quadrilateral ⊓ ∃is completely inside.Imaget ype ⊓ (∋ has colour.{pale yellow} ⊔ ∋ has colour.{dark yellow} ⊔ ∋ has colour.{pale red} ⊔ ∋ has colour.{light grey}) UJI Floor ≡ Object type ⊓ ∃is completely inside.Image type ⊓ (∋ has colour.{pale red} ⊔ ∋ has colour.{light grey}) ⊓ ∃is back.Image ⊓ ¬(∃is front.Image) UJI Office Door ≡ Object type ⊓ ∃has shape.Quadrilateral ⊓ ∃is completely inside.Image type ⊓ (∋ has colour.{grey} ⊔ (∋ has colour.{dark grey}) UJI Fire Extinguisher ≡ Object type ⊓ ∃is completely inside.UJI Wall ⊓(∋ has colour.{red} ⊔ ∋ has colour.{dark red}) 19 20 Preprint submitted to Spatial Cognition and Computation 4.2 Dealing with the Open World Assumption (OWA) Currently one of the main problems that users face when developing ontologies is the confusion between the Open World Assumption (OWA) and the Closed World Assumption (CWA) (Hustadt, 1994; Rector et al., 2004). Closed world systems such as databases or logic programming (e.g. PROLOG) consider anything that cannot be found to be false (negation as failure). However, Description Logics (and therefore OWL) assume an open world, that is, anything is true or false unless the contrary can be proved (e.g. two concepts overlap unless they are declared as disjoint, or a fact not belonging to the knowledge base cannot be considered to be false). However, some scenarios such as image interpretation, where the set of relevant facts are known, may require closed world semantics. In our scenario, the OWA problem arose when characterizing concepts such as Quadrilateral (see axiom α7 from Table 3), where individuals belonging to this class should be a Shape type and have exactly four sides (i.e. four connected points). Intuitively, one would expect that object1 from Table 5 should be classified as Quadrilateral according to axioms γ1 − γ7 and α7 from Table 3. However, the reasoner cannot make such an inference. The open world semantics have a direct influence in this example since the reasoner is unable to guarantee that shape1 is not related to more points. Table 5: Basic image facts for a shape γ1 Object type : object1 γ2 Shape type : shape1 γ3 has shape(object1, shape1) γ4 has point(shape1, point1) γ5 has point(shape1, point2) γ6 has point(shape1, point3) γ7 has point(shape1, point4) In the literature there are several approaches that have attempted to overcome the OWA problem when dealing with data-centric applications. These approaches (Grimm and Motik, 2005; Motik et al., 2009; Sirin et al., 2008; Tao et al., 2010a,b) have mainly tried to extend the semantics of OWL with non-monotonic features such as Integrity Constraints (IC). Thus, standard OWL axioms are used to obtain new inferred knowledge with open world semantics whereas ICs validate instances using closed world semantics. These approaches have also tried to translate IC validation into query answering using rules (e.g., SWRL, SPARQL) in order to make use of the existing reasoning machinery. Nevertheless, as already discussed 20 Preprint submitted to Spatial Cognition and Computation 21 in the literature, the use of rules may lead to undecidability, so the expressivity of the rules must be restricted (Motik et al., 2005; Krötzsch et al., 2008). Our approach has a partial and much simpler solution that overcomes the OWA limitations for our particular setting. We have restricted the domain of interpretation for each image with the following OWL 2 constructors: • Nominals. We consider that all the relevant facts for an image are known, thus, for each image, QImageOntology concepts are closed using an extensional definition with nominals9 . For example, the class Point type is defined as a set of all points recognized within the image (see axiom γ8 from Table 6 for an image with only 7 points). • Negative property assertion axioms explicitly define that an individual is not related to other individuals through a given property. In our example, the potential quadrilateral individual must have four connected points, but there must also be an explicit declaration that it does not have any more associated points (see axioms γ9 − γ11 from Table 6). • Different axioms for each individual. OWL individuals must be explicitly defined as different with the corresponding axioms, otherwise they may be considered as the same fact, since OWL does not follow the Unique Name Assumption (UNA). In our example points point1-point4 should be declared as different (see axiom γ12 ) in order to be interpreted as four different points for the quadrilateral individual. Table 6: Closing the world for a shape γ8 Point type ⊑ {point1, point2, point3, point4, point5, point6, point7} γ9 ¬has point(shape1, point5) γ10 ¬has point(shape1, point6) γ11 ¬has point(shape1, point7) γ12 point1 6≈ point2 6≈ point3 6≈ point4 It is worth mentioning that QImageOntology also defines, in its reference conceptualization layer, disjoint, range and domain axioms in order to make explicit that two concepts do not overlap and to restrict the use of the properties within the proper concept (e.g. has point only links Shape type with Point type). In summary, our approach proposes restricting/closing the world for each particular image using the above constructors within the image facts layer of QImageOntology. The number of extra axioms to add is reasonable for our setting 9 It is well known that the use of nominals makes reasoning more difficult (Tobies, 2001); however, in this case each image contains a relatively small number of individuals 21 22 Preprint submitted to Spatial Cognition and Computation where processed images contain about 200 concrete individuals with 150 class assertions, 1700 object property assertions, 1000 negative property assertions and 5 different individual axioms. 5 Experimentation and Results As explained in the previous sections, for any digital image, our approach obtains a qualitative image description and a set of facts according to QImageOntology. In Section 5.1, we present how our approach has been implemented and we also describe the results obtained. In Section 5.2, the tests done in different situations within our robot scenario and the evaluation method used are explained. In Section 5.3, the results obtained are analysed. 5.1 Implementation of Our Approach and Description of the Results Obtained Figure 5 shows the structure of our approach: it obtains the main regions or objects that characterize any digital image, describes them visually and spatially by using qualitative models of shape, colour, topology and orientation and obtains a qualitative description of the image in a flat format (see Table 2) and also as a set of OWL ontology facts. Figure 5: Method overview The ontology facts obtained (image facts layer), together with the reference conceptualization layer and the contextualized knowledge layer have been automatically classified using the ontology reasoner HermiT, although another rea- 22 Preprint submitted to Spatial Cognition and Computation 23 soner (e.g. FaCT++10 or Pellet11 ) could have been used. The new inferred knowledge is intended to be reused in the near future by the robot in order to support the decision-making process in localization and navigation tasks. As an example, note that from the qualitative description of the digital image in Figure 5 (shown in Table 2) and the contextualized descriptions shown in Table 4 the reasoner infers that Object 1 is a UJI Wall as it is a pale yellow object located completely inside the image, that Objects 7 and 8 are UJI Office doors as they are dark grey quadrilaterals located completely inside the image, that Object 10 is a UJI Fire Extinguisher as it is a dark red object located completely inside a UJI Wall (Object 1), and finally, that Object 12 is a UJI Floor as it is a pale red object situated back left and back right with respect to the centre of the image. 5.2 Testing and Evaluating Our Approach A collection of digital images extracted from the corridors of our building at Jaume I University (UJI) (our robot scenario) have been processed by our approach and new information has been inferred. Table 7 presents a selection of the tests, where images containing regions classified by our reasoner as UJI Walls, UJI Office Doors, UJI Fire Extinguishers and UJI Floors are shown. Our testing approach and evaluation method is described next. First, our robot explored our scenario with its camera and more than 100 photographs were taken at different locations and points of view with the aim of finding out what kind of objects could be segmented and described by our approach. Walls, floors, office doors, dustbins, fire extinguishers, electrical sockets, glass windows, etc. were properly qualitatively described. The walls, the floor, the office doors and the fire extinguishers were selected as the objects of interest and we adjusted the parameters of the segmentation method to the specific lighting conditions of each test and to define the minimum size of the objects to capture. Second, around 30 photos containing those objects at different locations and points of view were selected and described qualitatively and using description logics. The proper classification of the ontology facts obtained in accordance with QImageOntology was checked using Protégé as front-end and a HermiT reasoner. Around 80% of the selected photos (25/30) were correctly classified and some borderline cases appeared because of: • the adjustment of segmentation parameters: in some cases a door region is joined to a wall region and extracted as a whole region whose shape is not a quadrilateral, and therefore, a door cannot be characterized by our approach. 10 FaCT++: 11 Pellet: http://owl.man.ac.uk/factplusplus/ http://clarkparsia.com/pellet/ 23 24 Preprint submitted to Spatial Cognition and Computation • the colour identification: some extracted regions can be composed of more than a colour, for example the same quantity of dark-red and black relevant points or pixels can define a fire-extinguisher and, therefore, defining the correct colour of the object in those cases is difficult, as our approach does not deal with patterns. 5.3 Analysing Our Results The results obtained show that our approach can characterize regions of images in our robot scenario as walls, floors, office doors and fire extinguishers, under different illumination conditions and from different points of view. The extraction of the main regions in the image depends on the segmentation parameters used (Felzenszwalb and Huttenlocher, 2004). These parameters are adjusted in order to determine the level of detail extracted from the image. In our tests, regions of small size (such as door signs or handles) are not extracted so as to avoid obtaining much more detail than is needed for our characterization of objects. However, the regions of all tested images that are most easily perceived by the human eye have been obtained and described without problems. The characterization of qualitative colours using our approach depends on the illumination. This is the main reason that the colour of some objects is defined with different colour names, for example, when identifying doors (grey or dark grey) or walls (pale yellow, dark yellow or light grey). However, the colour names used in the characterization of an object are very similar from a human point of view and the use of different colour names in an object definition is not a problem for our approach. Therefore, the problems involving different lighting conditions are resolved in this way. Moreover, it should be noted that our qualitative model for image description provides much more information than is later used in the contextualized descriptions of our ontology, which define new kinds of objects based on this information. This is an advantage, as our system could apply this extra information to the characterization of new regions or objects presented in other robot scenarios, where more precise information may be needed in order to differentiate types of regions or objects. For example, our approach has defined UJI Office Doors as dark grey or grey quadrilaterals contained completely inside the image. This definition could have been extended by adding that the relevant points of the quadrilateral must be located two at front and two at back with respect to the centroid of the quadrilateral. Although this information is not needed by the system in order to distinguish UJI Office Doors from other objects in our scenario, it could be used in other scenarios in the future. Finally, as future applications in robotics, we believe that our approach could be usefully applied for general and concrete robot localization purposes. By extending our approach for characterizing objects to a different scenario (e.g. laborato- 24 Preprint submitted to Spatial Cognition and Computation 25 Table 7: Some images of the corridors of our building containing UJI Fire Extinguishers, UJI Walls, UJI Office Doors and UJI Floor Image Described Objects Inferred Information Object 6 is a UJI Fire Extinguisher. Objects 1, 4, 5, 7, 8 and 9 are UJI Walls. Object 7 is a UJI Fire Extinguisher. Object 1 is a UJI Wall. Object 3 is a UJI Office Door. Object 10 is a UJI Fire Extinguisher. Objects 0-6, 9, 11, 12 and 13 are UJI Walls. Objects 8 and 7 are UJI Office Doors. Object 12 is a UJI Floor. Object 10 is a UJI Fire Extinguisher. Objects 1-4, 6 and 16 are UJI Walls. Objects 12 and 13 are UJI Office Doors. Objects 6 is a UJI Fire Extinguisher. Objects 0-5 and 8 are UJI Walls. Object 2 is a UJI Office Door. Object 9 and 10 are UJI Floors. ries/classrooms/libraries or outdoor areas), it could be used for general localization, that is, for determining the kind of scenario the robot is navigating through. Moreover, by defining a matching process for comparing qualitative descriptions of images taken by the robot camera, we could recognize descriptions correspond- 25 26 Preprint submitted to Spatial Cognition and Computation ing to similar or possibly the same visual landmarks and those landmarks could be used to localize the robot specifically in the world. 6 Conclusions This paper presented a novel approach to represent the qualitative description of images by means of a DL-based ontology. Description logics enable us to balance the need for expressive power with good computational properties for our setting. Our approach obtains a visual and a spatial description of all the characteristic regions/objects contained in an image. In order to obtain this description, qualitative models of shape, colour, topology, and fixed and relative orientation are applied. These qualitative concepts and relations are stored as instances of an ontology and contextualized descriptions that characterize kinds of objects are defined in the ontology schema. Currently, we do not have a contextualized ontology definition for every possible object detected in an image (e.g. printer, office desk or chair). Nevertheless, our approach can automatically process any random image and obtain a set of DL-axioms which describe it visually and spatially. Our approach has been tested using digital images of the corridors of our building at the university (our robot scenario) and results show that our approach can characterize regions of the image as walls, floor, office doors and fire extinguishers, under different illumination conditions and from different observer viewpoints. As future work on the qualitative description of images, we intend to: (1) extend our model in order to introduce distances captured by the robot laser sensor for obtaining depth information for the images described; and (2) combine our model, which can describe unknown landmarks, with an invariant feature detector, such as SIFT (Lowe, 2004), for detecting known landmarks in the image. Moreover, as further work on our DL-based ontology of images we intend to (1) integrate a reasoner into the robot system, so that the new knowledge obtained can be provided to the robot in real time; (2) reuse non-standard reasoning services such as modularization to improve scalability when dealing with images with a large set of objects; (3) integrate our current ontology with other domain ontologies (e.g., DOLCE (Gangemi et al., 2002)) and standards such as MPEG-7 (Hunter, 2006); (4) extend our current ontology in order to characterize other objects from other robot environments. Acknowledgments This work has been partially supported by Generalitat Valenciana (BFPI06/219, BFPI06/372), Spanish MCyT (TIN2008-01825/TIN) and Universitat Jaume I Fundació Bancaixa (P11A2008-14). We acknowledge Ismael Sanz, Rafael Berlanga and the reviewers of the journal for the valuable feedback and comments they 26 Preprint submitted to Spatial Cognition and Computation 27 made us to improve this manuscript. We also thank Pedro F. Felzenszwalb for making his image segmentation source code available in his webside. References Baader, F., Calvanese, D., McGuinness, D. L., Nardi, D., and Patel-Schneider, P. F., editors (2003). The Description Logic Handbook: Theory, Implementation, and Applications. Cambridge University Press. Bhatt, M. and Dylla, F. (2009). A qualitative model of dynamic scene analysis and interpretation in ambient intelligence systems. Robotics for Ambient Intelligence, International Journal of Robotics and Automation, 4(3). Bohlken, W. and Neumann, B. (2009). Generation of rules from ontologies for high-level scene interpretation. In International Symposium on Rule Interchange and Applications, RuleML, volume 5858 of Lecture Notes in Computer Science, pages 93–107. Springer. Borst, P., Akkermans, H., and Top, J. L. (1997). Engineering ontologies. Int. J. Hum.-Comput. Stud., 46(2):365–406. Cuenca Grau, B., Horrocks, I., Motik, B., Parsia, B., Patel-Schneider, P., and Sattler, U. (2008). OWL 2: The next step for OWL. J. Web Semantics, 6(4):309–322. Dasiopoulou, S. and Kompatsiaris, I. (2010). Trends and issues in description logics frameworks for image interpretation. In Artificial Intelligence: Theories, Models and Applications, 6th Hellenic Conference on AI, SETN, volume 6040 of Lecture Notes in Computer Science, pages 61–70. Springer. Egenhofer, M. J. and Al-Taha, K. K. (1992). Reasoning about gradual changes of topological relationships. In Frank, A. U., Campari, I., and Formentini, U., editors, Theories and Methods of Spatio-Temporal Reasoning in Geographic Space. Intl. Conf. GIS—From Space to Territory, volume 639 of Lecture Notes in Computer Science, pages 196–219, Berlin. Springer. Egenhofer, M. J. and Franzosa, R. (1991). Point-set topological spatial relations. International Journal of Geographical Information Systems, 5(2):161–174. Falomir, Z., Almazán, J., Museros, L., and Escrig, M. T. (2008). Describing 2D objects by using qualitative models of color and shape at a fine level of granularity. In Proc. of the Spatial and Temporal Reasoning Workshop at the 23rd AAAI Conference on Artificial Intelligence, ISBN: 978-1-57735-379-9. Felzenszwalb, P. F. and Huttenlocher, D. P. (2004). Efficient graph-based image segmentation. International Journal of Computer Vision, 59(2):167–181. 27 28 Preprint submitted to Spatial Cognition and Computation Freksa, C. (1991). Qualitative spatial reasoning. In Mark, D. M. and Frank, A. U., editors, Cognitive and Linguistic Aspects of Geographic Space, NATO Advanced Studies Institute, pages 361–372. Kluwer, Dordrecht. Freksa, C. (1992). Using orientation information for qualitative spatial reasoning. In Frank, A. U., Campari, I., and Formentini, U., editors, Theories and methods of spatio-temporal reasoning in geographic space, volume 639 of LNCS, pages 162–178. Springer, Berlin. Gangemi, A., Guarino, N., Masolo, C., Oltramari, A., and Schneider, L. (2002). Sweetening ontologies with dolce. In 13th International Conference on Knowledge Engineering and Knowledge Management, EKAW, volume 2473 of Lecture Notes in Computer Science, pages 166–181. Springer. Grimm, S. and Motik, B. (2005). Closed world reasoning in the Semantic Web through epistemic operators. In Proceedings of the OWLED Workshop on OWL: Experiences and Directions, volume 188 of CEUR Workshop Proceedings. CEUR-WS.org. Guarino, N. (1998). Formal ontology in information systems. In International Conference on Formal Ontology in Information Systems (FOIS98), Amsterdam, The Netherlands, The Netherlands. IOS Press. Hernández, D. (1991). Relative representation of spatial knowledge: The 2-D case. In Mark, D. M. and Frank, A. U., editors, Cognitive and Linguistic Aspects of Geographic Space, NATO Advanced Studies Institute, pages 373– 385. Kluwer, Dordrecht. Horrocks, I., Kutz, O., and Sattler, U. (2006). The even more irresistible SROIQ. In KR 2006, pages 57–67. Horrocks, I., Patel-Schneider, P. F., and van Harmelen, F. (2003). From SHIQ and RDF to OWL: the making of a web ontology language. J. Web Sem., 1(1):7–26. Hunter, J. (2001). Adding multimedia to the Semantic Web: Building an MPEG7 ontology. In Proceedings of the first Semantic Web Working Symposium, SWWS, pages 261–283. Hunter, J. (2006). Adding multimedia to the Semantic Web: Building and applying an MPEG-7 ontology. In Stamou, G. and Kollias, S., editors, Chapter 3 of Multimedia Content and the Semantic Web. Wiley. Hustadt, U. (1994). Do we need the closed world assumption in knowledge representation? In Baader, F., Buchheit, M., Jeusfeld, M. A., and Nutt, W., editors, Reasoning about Structured Objects: Knowledge Representation Meets 28 Preprint submitted to Spatial Cognition and Computation 29 Databases, Proceedings of 1st Workshop KRDB’94, Saarbrücken, Germany, September 20-22, 1994, volume 1 of CEUR Workshop Proceedings. CEURWS.org. Johnston, B., Yang, F., Mendoza, R., Chen, X., and Williams, M.-A. (2008). Ontology based object categorization for robots. In PAKM ’08: Proceedings of the 7th International Conference on Practical Aspects of Knowledge Management, volume 5345 of LNCS, pages 219–231, Berlin, Heidelberg. Springer-Verlag. Katz, Y. and Cuenca Grau, B. (2005). Representing qualitative spatial information in OWL-DL. In Proceedings of the Workshop on OWL: Experiences and Directions, OWLED, volume 188 of CEUR Workshop Proceedings. CEURWS.org. Kosslyn, S. M. (1994). Image and brain: the resolution of the imagery debate. MIT Press, Cambridge, MA, USA. Kosslyn, S. M., Thompson, W. L., and Ganis, G. (2006). The Case for Mental Imagery. Oxford University Press, New York, USA. Krötzsch, M., Rudolph, S., and Hitzler, P. (2008). ELP: Tractable rules for OWL 2. In The Semantic Web, 7th International Semantic Web Conference, ISWC, volume 5318 of Lecture Notes in Computer Science, pages 649–664. Springer. Kuhn, W., Raubal, M., and Gärdenfors, P. (2007). Editorial: Cognitive semantics and spatio-temporal ontologies. Spatial Cognition & Computation: An Interdisciplinary Journal, 7(1):3–12. Lovett, A., Dehghani, M., and Forbus, K. (2006). Efficient learning of qualitative descriptions for sketch recognition. In 20th International Workshop on Qualitative Reasoning. Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60:91–110. Maillot, N. and Thonnat, M. (2008). Ontology based complex object recognition. Image Vision Comput., 26(1):102–113. Motik, B., Horrocks, I., and Sattler, U. (2009). Bridging the gap between owl and relational databases. J. Web Sem., 7(2):74–89. Motik, B., Sattler, U., and Studer, R. (2005). Query answering for owl-dl with rules. J. Web Sem., 3(1):41–60. 29 30 Preprint submitted to Spatial Cognition and Computation Neumann, B. and Möller, R. (2008). On scene interpretation with description logics. Image and Vision Computing, 26(1):82–101. Oliva, A. and Torralba, A. (2001). Modeling the shape of the scene: A holistic representation of the spatial envelope. Int. J. Comput. Vision, 42(3):145–175. Palmer, S. (1999). Vision Science: Photons to Phenomenology. MIT Press. Qayyum, Z. U. and Cohn, A. G. (2007). Image retrieval through qualitative representations over semantic features. In Proceedings of the 18th British Machine Vision Conference (BMVC), pages 610–619. Quattoni, A. and Torralba, A. (2009). Recognizing indoor scenes. Computer Vision and Pattern Recognition, IEEE Computer Society Conference on, 0:413–420. Rector, A. L., Drummond, N., Horridge, M., Rogers, J., Knublauch, H., Stevens, R., Wang, H., and Wroe, C. (2004). OWL pizzas: Practical experience of teaching OWL-DL: Common errors and common patterns. In Motta, E., Shadbolt, N., Stutt, A., and Gibbins, N., editors, EKAW, volume 3257 of Lecture Notes in Computer Science, pages 63–81. Springer. Sarifuddin, M. and Missaoui, R. (2005). A new perceptually uniform color space with associated color similarity measure for contentbased image and video retrieval. In Multimedia Information Retrieval Workshop, 28th annual ACM SIGIR conference, pages 3–7. Schill, K., Zetzsche, C., and Hois, J. (2009). A belief-based architecture for scene analysis: From sensorimotor features to knowledge and ontology. Fuzzy Sets and Systems, 160(10):1507–1516. Sirin, E., Smith, M., and Wallace, E. (2008). Opening, closing worlds - on integrity constraints. In Proceedings of the Fifth OWLED Workshop on OWL: Experiences and Directions. CEUR Workshop Proceedings. Socher, G., Geleit, Z., et al. (1997). Qualitative scene descriptions from images for integrated speech and image understanding. Technical report: http://www.techfak.uni-bielefeld.de/techfak/persons/gudrun/pub/d.ps.gz. Tao, J., Sirin, E., Bao, J., and McGuinness, D. (2010a). Extending OWL with integrity constraints. In Proceedings of the International Workshop on Description Logics (DL), volume 573 of CEUR Workshop Proceedings. CEURWS.org. Tao, J., Sirin, E., Bao, J., and McGuinness, D. L. (2010b). Integrity constraints in OWL. In Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence. AAAI Press. 30 Preprint submitted to Spatial Cognition and Computation 31 Tobies, S. (2001). Complexity results and practical algorithms for logics in knowledge representation. Technical report, RWTH Aachen, Germany. PhD thesis. Williams, M. (2008). Representation = grounded information. In PRICAI ’08: Proceedings of the 10th Pacific Rim International Conference on Artificial Intelligence, volume 5351 of LNCS, pages 473–484, Berlin, Heidelberg. Springer-Verlag. Williams, M., McCarthy, J., Gärdenfors, P., Stanton, C., and Karol, A. (2009). A grounding framework. Autonomous Agents and Multi-Agent Systems, 19(3):272–296. 31

Log In

Describing Images Using Qualitative Models and Description Logics

Related papers

Related papers

Related topics