Academia.eduAcademia.edu

Surveillance and Identity - Conceptual Framework and Formal Models

2017, Journal of Cybersecurity

Surveillance is recognised as a social phenomenon that is commonplace, employed by governments, companies and communities for a wide variety of reasons. Surveillance is fundamental in cybersecurity as it provides tools for prevention and detection; it is also a source of controversies related to privacy and freedom. Building on general studies of surveillance, we identify and analyse certain concepts that are central to surveillance. To do this we employ formal methods based on elementary algebra. First, we show that disparate forms of surveillance have a common structure and can be unified by abstract mathematical concepts. The model shows that (i) finding identities and (ii) sorting identities into categories are fundamental in conceptualising surveillance. Secondly, we develop a formal model that theorizes identity as abstract data that we call identifiers. The model views identity through the computational lens of the theory of abstract data types. We examine the ways identifier...

Journal of Cybersecurity, 3(3), 2017, 145–158 doi: 10.1093/cybsec/tyx010 Advance Access Publication Date: 15 December 2017 Research paper Research paper Surveillance and identity: conceptual framework and formal models 1 Senior Lecturer on Security and Cybercrime, Institute of Criminal Justice Studies, Faculty of Humanities and Social Sciences, University of Portsmouth, Portsmouth, UK; 2Professor of Computer Science, Department of Computer Science, College of Science, Swansea University, Swansea, UK *Corresponding author: E-mail: [email protected] Received 23 June 2016; accepted 26 October 2017 Abstract Surveillance is recognized as a social phenomenon that is commonplace, employed by governments, companies and communities for a wide variety of reasons. Surveillance is fundamental in cybersecurity as it provides tools for prevention and detection; it is also a source of controversies related to privacy and freedom. Building on general studies of surveillance, we identify and analyse certain concepts that are central to surveillance. To do this we employ formal methods based on elementary algebra. First, we show that disparate forms of surveillance have a common structure and can be unified by abstract mathematical concepts. The model shows that (i) finding identities and (ii) sorting identities into categories are fundamental in conceptualizing surveillance. Secondly, we develop a formal model that theorizes identity as abstract data that we call identifiers. The model views identity through the computational lens of the theory of abstract data types. We examine the ways identifiers depend upon each other; and show that the provenance of identifiers depends upon translations between systems of identifiers. Key words: surveillance, social sorting, identity, abstract data types, formal methods Introduction Surveillance is an integral part of everyday life as many technologies employed in our physical and virtual environments have long been capable of monitoring and recording our activities cf. [1]. The ubiquitous cameras that monitor our physical environment, in order to improve the safety and security of people and property, are but the most visible tip of the surveillance iceberg. The invisible bulk is made of software that record data about actions and events cf. [2]. Our professional lives have long been conducted through software systems, and recently, our personal lives have become dependent on software systems through social media. Our home and neighbourhood environments are next to succumb to software, through the internet of things, e.g. [3, 4, 5]. That our lives are being captured and represented by digital data, collected by many independent sources for different purposes, is an important sociological phenomenon. The translation of all kinds of data into digital form, and the C The Author 2017. Published by Oxford University Press. V aggregation and unification of all kinds of data sources through computer networks are important technological phenomena. Surveillance is enormously controversial as it impacts on the multitude of notions that make up privacy and freedom for individuals; on the conduct of economic and social life of societies; and on the legal, political, and military foundations of the state [6, 7]. With this broad view, David Lyon has given a general description of surveillance as ‘the focused, systematic and routine attention to personal details for purposes of influence, management, protection or detection’ [8, 14]. In establishing surveillance as a general social issue, Lyon has proposed that surveillance has three main purposes [8]: keeping control, which is the historic purpose pursued by employers, police, and government; ii. social sorting, pursued by companies in marketing and managing customers; and i. 145 This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected] Downloaded from https://academic.oup.com/cybersecurity/article/3/3/145/4748787 by guest on 07 November 2021 Victoria Wang1,* and John V. Tucker2 Journal of Cybersecurity, 112017, Vol. 3, No. 3 146 iii. mutual monitoring, pursued in peer to peer in social networks, real and virtual. Informal Definition. An identifier for an entity is data that is associated with the entity for the purposes of distinguishing it among other entities in some context and for some purpose. Identifiers are the main focus of our paper. As a starting point for our conceptual analysis, we assume that: Principle. Entities are recognised only through the data that act as their identifiers for a context. Entities are observed only through the data that represent their behaviour in a context. This hypothesis is widely applicable. First, the surveillance context is determined by selecting aspects of an entity’s behaviour that can be captured in data, and by observations made by testing for attributes of the data. Secondly, the identity of an object is reduced to measurements, and the identity of a person is reduced to forms of evidence that are also data, including records of personal testimony and formal registration, as well as biometrics [15, 16, 17]. The idea is simpler and more palatable when one considers the virtual world, which creates hugely many more contexts that are, and can only be, made of data. Users have many identities, some of which they create in a state of anonymity. The Principle is perfectly at home in the virtual world of cybersecurity. Technically, the operations and tests on identifiers combine to make systems of identifiers. Although designed for specific contexts, they often have unforeseen applications. Since identifiers are data, clearly the systems of identifiers are actually examples of what computer scientists call abstract data types [18, 19]. The theory of abstract data types characterizes data through its operations and tests, which may be specified by axioms to make them close to the application domain and independent of implementations. The theory uses algebra to model any form of data, and tools to design and build software. The idea of a general theory of identity based on abstract data types is new. 1 For example, in the UK, accurate identification of an individual usually depends on a passport [11], a driver’s licence (DVLA) and, for some, the National Identity Register [12]. Accuracy increases if identification involves fingerprints [13], iris scans [14] and DNA [15]; see also [16]. What is surveillance? Let us begin with an abstract informal description of a large class of surveillance systems. Informal Definition. A surveillance system observes the behaviour of people and objects in a context that may be real or virtual. 2 The mathematical ideas we use use to launch our models are sets, functions and relations, which are described in any number of textbooks on discrete mathematics, e.g. [20, 21]. The theory of abstract data types is more demanding algebraically, e.g. [18]. Downloaded from https://academic.oup.com/cybersecurity/article/3/3/145/4748787 by guest on 07 November 2021 Thanks to the ubiquity of digital technologies, the aims and methods of social sorting — the categorization of personal data — is becoming most prominent. In this article, we examine theoretically the general ideas of surveillance and one of its component concepts that of identity. Identity is fundamental to contemporary surveillance practices [9, 10, 11]. Surveillance technologies rely on identity management systems to provide information, which vary in accuracy.1 For instance, for social sorting to work, identity needs to be just precise enough to enable categorizations to be useful in an application. We seek completely abstract models that can be formalized and analysed mathematically. First, we develop a general definition of surveillance that captures the notion in diverse situations, and we illustrate the general definition with some disparate examples. This definition shows that the three main types of surveillance have the same structures, and that the essence of surveillance is indeed sorting and categorization. Our analysis applies to entities that are objects or people, real or virtual, belonging to a specific context. A most important component idea of our definition of surveillance is that of the identity of the people or objects observed. We introduce the general concept of identifiers, which are data designed to recognize an entity. Here is our idea: Foremost among identifiers are those that are supposed to identify people. The notion of a personal identifier proves to be as informative as it is subtle. To understand identity we need to examine the ways identifiers are issued and how they depend upon other identifiers. We show that the provenance of identifiers is an essential idea. We consider principles of how identifiers are to be compared and when they might be deemed equivalent; this requires notions of translations between different systems of identifiers. All of these concepts are motivated by some informally described examples, and then formalized mathematically using elementary algebra. The examples of surveillance and identity we use refer to situations both in everyday life and in cybersecurity. The everyday examples make the point our concepts apply to traditional forms of identity and security. The cybersecurity examples give a glimpse of the abundance of identity issues in securing software systems. Identity is a central concept in hashing, encryption, communication protocols, certification, and their roles in the maintaining the trustworthiness of transactions, encoding of access controls, tracing events, forensics, etc. In their mathematical form, the concepts create precise and general definitions that cover a great range of examples. The article is in two parts: on surveillance and on identity. The Section ‘What is surveillance?’ exemplifies the central ideas informally; their formalization begins in the Section ‘A formal model of surveillance’. But why formalize? Formalizations make notions precise. They uncover and classify the possible structures and interpretations of ideas. Our formalization can be likened to the way formal logic has long been used in philosophy to clarify the nature of truth, arguments and reasoning. Later, we reflect on the role of abstract concepts and formal methods to give insights in sociological contexts (the sub-Section ‘Formalizing identity and social theory’). Our aim is to discover general concepts and principles associated with surveillance and to analyse them, mapping their interconnections and implications by means of mathematical ideas and structures. Formalization has proved to be a fundamental practical tool in the development of software. Formal models are needed by technologists designing surveillance systems and their safeguards, in order to develop tools for testing and reasoning. Such formal methods play a role in software engineering where security by design is an objective. We believe that raising the status of identity and modelling identity using abstract data types will be useful in making and maintaining cybersecurity software. The mathematical pre-requisites are modest.2 To address a readership from different disciplines and professions, we explain some very basic mathematical ideas. While this will seem laborious and unnecessary to those familiar with formal methods, others may benefit when ideas are shown to arise naturally in thinking about identity, and appreciate more readily the benefits that formalization delivers. Journal of Cybersecurity, 112017, Vol. 3, No. 3 The surveillance system classifies behaviours by means of attributes, and identifies people and objects with those attributes. A surveillance system consists of the following components and methods: 1. Entity. Entities that possess behaviour in space and time. 2. Observable behaviour. Methods for obtaining and recording data about behaviours. 3. Attribute. Methods for defining and recognizing attributes of behaviour data. 4. Identity. Methods for generating data that identifies entities in the context. Entity: Cars Observable Behaviour: The registration mark, its time of arrival and departure at the location surveillance system communicates the registration mark to the Driver and Vehicle Licensing Agency (DVLA), or to one of its approved agents, to determine the name and address of the keeper. The output of these actions is the identity of the keeper. Indeed, in this necessary second stage, there is a transformation of identity data from the registration mark to the name and address of the keeper. Note that finding the actual driver may require further independent action. In the case of speeding, where laws are involved, the driver’s record will contain characteristics such as a driving penalty history. Example 2: Social Sorting — Customer Accounts. Consider a client’s e-account with some provider, such as a bank or shop. Typically, an account has the following components: an account number that identifies the account; a user name and password as a form of identity used to gain access to the account; a set of characteristics of the account, such as personal details and the scope and limits of services; and an account history that records past transactions and allows for new transactions, queries, preferences, etc. The account history is the behaviour of the account; it is observed to check that terms and conditions are met by the client, or that no unusual pattern of transactions has been carried out, or to generate suggestions for new products and services. Observations might also include standard monitoring data about user logins and login attempts, duration, location, etc. For example: Entities: Credit card accounts Observable Behaviour: Transactions: date, payee, location, sum, etc. Attributes: Credit transactions limit, minimum payments, unusual Identity: Credit card number Example 3: Mutual Monitoring — Social Media Accounts. Social media connect people who have personal or professional interests in common. Systems such as Facebook, Twitter, WeChat, Instagram, LinkedIn, and Academia.edu attract large numbers of users. Individuals register with a system and create an account and a network of other users to suit their needs. Abstractly, an account has a structure similar to that of a customer account for a bank or shop. The behaviour of the account is a history of postings, status updates, linkages and interactions. In social networking, individuals voluntarily reveal very detailed information about themselves to their networks, including their personal history, tastes, opinions and activities; the behaviours could be termed personas. From the point of view of surveillance, two phenomena are of interest: (i) individuals can and do watch over people in their networks, and (ii) the data of the account holders belong to companies that can collect and use the information for commercial or other purposes. Illustrating the components: Attributes: Duration of stay above a particular limit Entities: Member accounts Identity: Registration marks Observable Behaviour: Personal declarations, posts, comments, connections, location, etc. Following the ANPR stages described above, the registration mark is communicated to a database relevant to the application. For example, the database may be used to check an attribute, such as a payment (tax, charge or toll), having been made for that registration mark. The surveillance system knows the identity of the car, but not necessarily the driver. Suppose we take the entities to be people. To find the driver, an independent process involving only personal identities begins. The vehicle is registered to a person called the keeper of the vehicle, who must be located and contacted. In the UK, the operator of the Attributes: Targeted opinions in posts on specific topics, interactions with other members, unusual interactions Identity: Usernames A formal model of surveillance We have defined surveillance informally as a process that identifies entities on detecting certain properties of their behaviour. We will define this process formally. Downloaded from https://academic.oup.com/cybersecurity/article/3/3/145/4748787 by guest on 07 November 2021 In practice, what is observable about the behaviour are the attributes of data; these characterize the context for the surveillance. Depending upon the context, we may expect the attributes to be based upon laws, rules, norms, conventions, policies, practices, expectations, etc. Indeed, when the purpose of surveillance is control, they may seek to catch deviations. The definition is neutral and does not imply deviance. The definition does require precise formulations of attributes for a process of categorization. The data that is used to identify entities can be numbers, texts, sounds and images. Here are three simple examples to prepare for our abstract formalizations. Example 1: Control Motor — Vehicles. Automatic Number Plate Recognition (ANPR) is a technology that observes vehicles and records registration marks. Typical applications are checking on vehicle speed, managing car parking and collecting tolls, cf. [22]. The technology was functioning in the late 1970s. Today, ANPR is a component of hundreds of thousands of surveillance systems owned by both public and private organizations. Consider some ANPR applications in terms of our abstract definition. In such surveillance systems, the entities are vehicles at a particular location and time. The vehicles may be in motion (as with speed checks), or may be arriving or leaving a location (as with car parking and congestion zones in cities). The vehicles are observed by cameras that create images and the software that processes the images varies according to the behavioural attributes under observation (e.g. breaking an average speed limit over a stretch of road, or overstaying a parking time limit). In particular, optical character recognition establishes the registration mark of vehicles. A registration mark is an alpha-numeric name that identifies a vehicle uniquely in a human-centred way in a national register of vehicles. The mark links to information about the characteristics of the vehicle. Thus, to the surveillance system, the identity of an entity is this registration mark. For example, a surveillance system for car parking based on an ANPR has the form: 147 Journal of Cybersecurity, 112017, Vol. 3, No. 3 148 Context: entities and their behaviour Let Entity be a set of entities whose behaviour is to be observed. Let Behaviour be the set of all possible behaviours in space and time of all the entities of Entity. The nature of behaviour and its models we consider in stages. Deterministic Behaviour. Suppose that each entity e 2 Entity has one and only one behaviour in space and time, i.e. its behaviour is deterministic. In this case, there is a single-valued mapping [[_]]: Entity ! Behaviour Now in many cases, the space Behaviour of all possible behaviours of the entities can be taken to be a subset of the set Trace of all possible traces; thus, Behaviour  Trace. When applying the behaviour mapping [[_]] to an entity e 2 Entity we get a trace, which is a map [[e]]: Time ! Action. Therefore, for e 2 Entity and t 2 Time, we have such that The mapping provides a formal model or semantics for the behaviour of the entity. Taken together we have formalized a context for the surveillance as an algebraic structure: Example: Twitter. Twitter processes data called tweets. At the heart of a tweet is a message made from at most 140 characters, but a tweet is composed of more data. For simplicity, a tweet can be thought of as a vector of data drawn from sets of the following kind: Context ¼ (Entity, Behaviour j [[_]]: Entity ! Behaviour). Non-deterministic Behaviour. Suppose that each entity has more than one possible behaviour in space and time, i.e. its behaviour is non-deterministic. In this case, there is a relation [[_]]: Entity  Behaviour such that Text The text that is the status update.3 Identity A string that uniquely labels the tweet. Contributor The author(s) of the tweet. Time The time when this Tweet was created.4 Location The geographic location (longitude, latitude) of this Tweet as reported by the user or application.5 Retweet Status and number of retweets. [[e, b]] () b 2 Behaviour is a possible behaviour of the entity e 2 Entity. Favourite Number of favourites. The context for the surveillance is a relational structure Tweet ¼ Text  Identity  Contributor  Time  Location  Retweet  Favourite. We let the set of all possible tweets be Context ¼ (Entity, Behaviour j [[_]]: Entity  Behaviour). Alternately, in the non-deterministic case, if the elements of Behaviour are sets of possible behaviours of an entity, the relation can be replaced with a map returning sets. We will focus on the deterministic case. The behaviours need to be modelled formally. How might this be done? There are several options. Behaviour as streams of data A way to formalize behaviours is to think of entities performing a sequence of actions or events taking place in time. Let Time be a set of time points generated by a clock of some kind; for example, say Time ¼ {0, 1, 2, . . ., t, . . .}. Let Action be a set of actions or events characteristic of the entities. The behaviour of an entity is conceived of as a stream a(0), a(1), a(2), . . ., a(t), . . . of actions or events in time, where a(t) 2 Action for all t 2 Time. Such sequences will be termed traces: Definition. A trace is an association of actions or events to time points and is formalized by a total mapping a: Time ! Action Now Twitter generates and processes streams of tweets, i.e. sequences of tweets indexed by time. Thus, the behaviour can be modelled by traces that are streams of tweets of the form a(0), a(1), a(2), . . ., a(t), . . . 2 Tweet, which is represented by a map a: Time ! Tweet. Let Behaviour be the set of all possible traces of these kinds. Typical user operations on tweets are ‘embedding tweets’, ‘responding to tweets’, and ‘favouring’, ‘unfavouring’, and ‘deleting tweets’, which induce operations on traces. Depending upon the circumstances, monitoring tweet feeds is called ‘curation’, ‘filtering’, or ‘surveillance’. Monitoring Twitter can be done in a number of ways via application programming interfaces (APIs), which define instructions for developers to build new systems. Twitter’s Search API allows users to define criteria (keywords, usernames, locations, named places, etc.) to search among existing tweets. Twitter’s Streaming API redirects a sample of tweets, based upon a user’s criteria, as these tweets appear. The sample is less than 1% [23]. Twitter’s Firehose API delivers 100% of all publicly available tweets that match users’ criteria as they are made. The Twitter Firehose is complex and requires a subscription. Twitter’s monitoring services have tools to detect non-compliance with Twitter policies (e.g. aggressive following and unfollowing). such that for all t 2 Time a(t) ¼ the action or event taking place at time t 2 Time. Let Trace be the set of all possible traces. 3 4 Using the UTF-8 representation for Unicode. Measured by Coordinated Universal Time (UTC). Identity: identifying entities To identify entities in a context whose behaviours have certain properties, the entities need to be labelled, marked or named in some 5 Using the geoJSON standard. Downloaded from https://academic.oup.com/cybersecurity/article/3/3/145/4748787 by guest on 07 November 2021 [[e]](t) ¼ the action or event of entity e taking place at time t 2 Time. [[e]] ¼ the behaviour of the entity e 2 Entity. Journal of Cybersecurity, 112017, Vol. 3, No. 3 way. Our notion of identifier, defined in the Introduction, is designed to do just this. 6 Each entity e 2 Entity has an identifier that is used to denote the entity in a context. The association of identifiers with entities can be complicated as we will see shortly. In order to formalize surveillance, we must formalize the assignment of identifiers to entities. Definition. Let Identifier be a set of possible identifiers for the entities of Entity. There is a relation id  Identifier  Entity such that If id(i, e) then we say that identifier i names entity e. Let anon be a datum that is not in the set Identifier of identifiers for the entities; anon indicates anonymity, i.e., an entity not named. We will need the set Identifier [ {anon}. We will develop the notion of identifiers in the second part of the article (the Section ‘What is identity’? onwards). Here, let us note that since the association of identifiers to entities is a relation, thus many identifiers can be allocated to an entity and, conversely, many entities can have the same identifier. Later, in the Section ‘A formal model of identity?’, we will simplify the discussion, focussing on the case that the association is a function id: Identifier ! Entity. ¼1 if [[e]]62 Propi. Note that entities whose behaviours do not lie in Propi are mapped to the empty set 1, and are ignored and not identified, i.e. they will remain anonymous. Definition. An entity e in a context is anonymous under surveillance with attributes Surv(Prop) if Surv(Propi)(e) ¼ 1 for 1  i  k. Minimal Case. Consider surveillance that seeks just one identifier for any entity whose behaviour satisfies some Propi. This view of surveillance is reformulated thus: Definition. Surveillance is defined as follows: for 1  i  k define, Surv(Propi): Entity ! Identifier [ {anon} for e 2 Entity by Surv(Propi)(e) ¼ (some i) id(i, e) if [[e]]2 Propi ¼ anon if [[e]] 62 Propi. Thus, given the collection Prop ¼ Prop1, . . ., Propk of properties, surveillance is specified by a collection of functions: for 1  i  k, Surv(Propi): Entity ! Identifier [ {anon}. If convenient, these may be combined as a k-tuple, Surv(Prop): Entity ! (Identifier [ {anon})k where Surveillance: detecting attributes The elements of Behaviour formalize the activity of the entities under surveillance. To formalize what it is we are to detect, we suppose that Prop ¼ Prop1, . . ., Propk is a collection of sets of behaviours, i.e. for 1  i  k, Propi  Behaviour. The entities of interest are those whose behaviours lie in some Propi; in symbols, Entity(Propi) ¼ {e 2 Entity: [[e]]2 Propi}. General Case. Entities in a context are known by their identifiers. Formulations of surveillance can seek for any entity e satisfying a Propi, i. at least one identifier i for e; ii. a subset of the identifiers of e; or iii. all of the identifiers of e. These options have the form of a selection or choice operation selectid: Entity ! P(Identifier) where P(Identifier) is the set of all subsets of Identifier, and selectid(e)  {i 2 Identifier j id(i, e)}. Definition. Surveillance is formulated as follows: for 1  i  k define, Surv(Propi): Entity ! P(Identifier) for e 2 Entity by Surv(Propi)(e)¼ selectid(e) if [[e]]2 Propi 6 In computing, the term ‘identifier’ is well established. It is data made of syntax that names or labels a computational entity; commonly, it is an alphanumeric string that defines components in a programming language, such as variables, operators, procedures, programs etc. Our adoption of the word for data Surv(Prop)(e) ¼ (Surv(Prop1)(e), . . ., Surv(Propk)(e)). Combining these ideas, we define formally a very general notion of a surveillance system. Definition. A surveillance system for entities in a context is a structure of the form SurvSys(Prop) ¼ (Entity, Identifier [ {anon}, Behaviour j anon, id, [[_]], Propi, Surv(Propi) 1  i  k), consisting of the non-empty sets Entity, Identifier, Behaviour, the constant anon, and the k þ 1 relations id  Identifier  Entity, Propi, and the k þ 1 mappings [[_]]: Entity ! Behaviour Surv(Propi): Entity ! Identifier [ {anon} for 1  i  k. The definition expresses a minimal general form of a surveillance system as an algebraic structure, which is a semantic model of an abstract data type [18]. The theory of abstract data types was created to model the essential components of any computing system in a precise way. Thus, designers can use algebraic methods when thinking formally about the processes of user specification and subsequent technological implementation [25]. Of course, any actual surveillance system will involve many technologies to obtain and associated with a context is essentially a large-scale generalization. The purpose of the notion is close to that of the idea of a pure name in [24]. The term is in use occasionally in some social discussions of identity. Downloaded from https://academic.oup.com/cybersecurity/article/3/3/145/4748787 by guest on 07 November 2021 id(i, e) () the data i 2 Identifier is assigned to entity e 2 E. 149 Journal of Cybersecurity, 112017, Vol. 3, No. 3 150 process data. These technologies may suggest some new abstract components that need to be formalized and understood theoretically. Roughly speaking, system design has the following form: Design Problem. The essence of the design problem is: Surveillance and social sorting In surveillance studies, social sorting is the categorization of people and results in a classification used to treat people differently [26]. Although originally formulated to understand the social impact of surveillance by companies and institutions, our formal definition shows that sorting is essential to the abstract conception of surveillance and, therefore, that sorting is inherent in the surveillance of entities of all kinds. We will formalize the sorting of entities using simple notions of categorization and partition; however, sorting can be problematic because the sorting of identifiers is more complex than the sorting of entities. Sorting entities In our definition of surveillance the collection Prop of properties of entities lead to a categorization of entities that can be treated differently. What is a categorization? Definition. Let Entity be a set of entities. A categorization of entities is a collection of subsets S1, S2, . . ., Sk  Entity that include all the entities, i.e. S1 [ S2 [ . . . [ Sk ¼ Entity. An entity e lies in at least one of the sets and possibly several. In this loose idea, we may have categories overlapping and having interesting internal structure, e.g. they may be nested and form a hierarchy under the set inclusion ordering. Commonly, and most simply, we may want the sets not to overlap so that an entity e lies in one, and only one, of the sets: Definition. The categorization is a partition if for 1 n, m k, we have Sn\Sm ¼ 1. Sorting identifiers Surveillance observes data about behaviours of entities — not entities — and recognizes only identifiers for entities — not the entities themselves. Thus, surveillance delivers a categorization of identifiers, not a categorization of entities, which makes the notion subtle. Definition. Let Identifier be a set of identifiers. A categorization of identifiers is a collection of subsets S1, S2, . . ., Sk  Identifier that includes all the identifiers, i.e. S1 [ S2 [ . . . [Sk ¼ Identifier. if i 2 Sn and i and j name the same entity then j 2 Sn Our definition of surveillance delivers a categorisation of identifiers, namely: Sn ¼ image(Surv(Propn)) – {anon}. To make a complete categorization is a process that depends upon knowledge of the equality of identifiers for entities (see the sub-Section ‘Generating identifiers’). We now turn to theorizing identity. What is identity? Identity has become almost purely a matter of data. People and objects are named, numbered, labelled or otherwise denoted by data relevant to a context. People belong to many contexts: they can be citizens, patients, drivers, voters, employees, customers, crime suspects, etc., each with different identities managed by different kinds of identity management systems. Physical or virtual, each identity system is based on an abstract data type of some kind. To distinguish between entities in a context, identifiers need not reflect any aspect of the entity or have any meaning at all, however in practice they are loaded with information. Case studies reveal that the following processes are fundamental: i. ii. iii. iv. v. creation and re-creation of identifiers; comparison of identifiers; inter-dependence of identifiers; transformation of identifiers; revocation of identifiers. Identifiers are composite objects: identifiers are commonly built from other identifiers. Personal identifiers are those that we rely upon to distinguish uniquely a human being. They are guarantees of peoples’ identities in contexts that demand physical identity. In the UK, the basic, most rigorous personal identifiers are associated with birth, marriage and death certificates, passports, medical and dental records, driving licenses, National Insurance (NI) records, tax records, etc. Biometric data — such as photographs, fingerprints, iris scans, blood groups, and DNA — are also involved. Biometric data are physical measurements, but they are represented and processed digitally. In this section, we examine informally some concepts, principles, and examples of identity prior to providing a formal definition and the outline of a theory in the next sections. Downloaded from https://academic.oup.com/cybersecurity/article/3/3/145/4748787 by guest on 07 November 2021 1. Specification. To define the desired surveillance system by specifying an abstract data type for SurvSys(Prop). 2. Implementation. To choose technologies to generate data to a. represent the behaviours of the entities; b. represent the identities of the entities; c. observe behaviours and detect those behaviours having the attributes in Prop; d. recognize the identity of entities having the properties in Prop. Again, an identifier i lies in at least one of the sets and possibly several. A categorization of identifiers is less likely to be a partition. However, the structure must also be measured against the entities that the identifiers name. Given an entity e there can be identifiers i and j for e that lie in different sets. This means that the categorization of identifiers does not lead directly to a neat categorization of entities. Distinctions between different identifiers for the same entity may be ambiguities that are meaningful. For example, data integration combines sets of identifiers from different contexts that share the same entities. Categorizations of identifiers arise in many ways, not least by the analysis of data sets using clustering and classification techniques in machine learning [27]. Ideally, our categorization of identifiers can be transformed into one that corresponds with the entities: Definition. The categorization S1, S2, . . ., Sk of identifiers is complete for the entities if for all i, j 2 Identifier, and any 1  n  k, Journal of Cybersecurity, 112017, Vol. 3, No. 3 Recall from the Introduction that identifiers can be ‘any data intended to separate entities in a context’. What is this data? For example, a name for an entity is an identifier. By a name for an entity we commonly mean data made from symbols. In computing systems, there are many syntactic schemes for naming hardware and software entities using alphanumeric strings; usually, the aim is to make a symbolic identifier unique to the entity in a context. The relationship between entities and identifiers can be complicated. Consider these four identifier-entity ratios: Surveillance returns identifiers that can narrow the search for entities but may not pin down the particular entity of interest. Searches take place on identifiers and, as we have noted, that an identifier can easily point to many distinct entities. Thus, many-toone associations are important because: Search Principle. If an association is many-one then to find an entity, we can search for any one of a set of alternate identifiers for that entity. If an association is one-one then there is one and only one identifier for that entity. The following point about narrowing the search for identifiers is obvious but certainly is profoundly important practically: Enumeration Principle. The addition of a number, reference code, extension tag, time stamp, or hash code may turn a manyone association into a one-one association. The use of numbers to uniquely distinguish entities in a context is old and universal, helping to determine uniquely all sorts of entities, such as people (by membership numbers); invoices, orders and payments (by reference numbers); and consumer products (by serial and barcode numbers). Reference codes do the same using alphanumerical strings. The use of extension tags often structures identifiers as paths in a tree and, like time stamps, separate entities, narrow searches, and can isolate entities uniquely. Hashing produces long binary or hex numbers as code for an identifier. Example 1: Cars. Recall Example 1 in the Section ‘What is surveillance?’, which illustrates one-one and many-one associations. In the UK, each car is assigned a registration mark; the current system was introduced on 1 September 2001. In general, each registration mark consists of seven characters with a defined format. From left to right, the characters consist of: (i) a local memory tag or area code, consisting of two letters that indicates the local registration office; (ii) a two-digit age identifier, which changes twice a year, in March and September; and (iii) a three-letter sequence which uniquely distinguishes each of the cars displaying the same initial four-character area and age sequence. The association of registration marks to cars is one-one at any time. However, with permission of the DVLA, registration marks can be transferred from one vehicle to another. Thus, the marks are unique identifiers that are time7 In the Internet of Things, processors are embedded in products and places of all kinds. Thus, there is a need for many more IP addresses, prompting an dependent; they are not permanent unique identifiers for the vehicles. There are identifiers for vehicles that are permanent: in the UK, the vehicle identification number (VIN) consists of 17 characters that identify the manufacturer (three characters), the type of vehicle (six characters), and finally distinguishes each of the cars with these characteristics (eight characters). The VINs obey some international standards. A car has one and only one registered keeper. The registered keeper is the person who is legally responsible for the car, and need not to be the owner of the vehicle. One purpose of the mark is to identify the keeper: thus, the association of a registration mark to a keeper is unique. However, a person can be a registered keeper of as many cars as he/she wants. Thus, the association of registration marks to keepers is many-one. The registration document (V5) for a car identifies the car by registration mark and VIN, and its keeper. Many people have insurance policies that enable them to drive any car with the owner’s permission. Thus, the driver of a car on a particular occasion may be only loosely connected to the keeper. The association between registration marks and drivers is complicated being one-many and time dependent, and incomplete in terms of formal documentation. Example 2: Communications. This example demonstrates both many-one and many-many associations. When connecting a computer to the Internet, a number is needed called an Internet Protocol (IP) address that uniquely identifies the machine in the network; this number is 32 bits under Internet Protocol Version 4.7 In some computer networks, such as networks local to an organization or company, there is an IP address for the machine that does not change; these are called static IP addresses. In this context, the association of computers to IP addresses is one-one. More commonly, at home, IP addresses are generated by an Internet Service Provider in response to a customer’s need for Internet access. Thus, over time IP addresses can change and the association of IP addresses to a particular computer is many-one. Developing this example, if more than one computer is accessing the Internet at the same time in a period, from the same service, then the association between IP addresses and computers is many-many. The changing status seems to be natural in time-dependent associations of identifiers. However, each computer does have an identifier, called its MAC address (48 bits under IEEE 802), that identifies the device uniquely throughout its life. So, the association is one-one and time independent. Example 3: Addresses. This example demonstrates a one-many association. In the UK, between 1959 and 1974, a system of postal codes was introduced to enable the automation of postal services. Typically, each address or location is assigned at most one postcode but a postcode can be assigned to more than one unit or building. The association between postcodes and buildings/addresses is onemany. Thus, postcodes are a system of identifiers that do not uniquely determine addresses. Local authorities determine addresses. Postcodes have found many uses and are used routinely in commercial transactions, navigation, and, more significantly, in calculating insurance, designing social policy and funding, and academic social studies — all of which are examples of social sorting. For any system of identifiers for entities in a context, the questions arise: Identifier Generation. How does the system create and delete identifiers for entities? upgrade of standards from Internet Protocol Version 4 to Internet Protocol Version 6 [28]. Downloaded from https://academic.oup.com/cybersecurity/article/3/3/145/4748787 by guest on 07 November 2021 1. Many–One Associations. Each identifier is assigned to one entity, but different identifiers can be assigned to the same entity. 2. One–One Associations. Different identifiers are assigned to different entities. 3. One–Many Associations. An identifier can be assigned to more than one entity but each entity has only one identifier. 4. Many–Many Associations. An identifier is assigned to more than one entity and, vice versa, an entity can be assigned more than one identifier. 151 Journal of Cybersecurity, 112017, Vol. 3, No. 3 152 Identifier Authentication. Given two identifiers, how do we decide whether or not they are associated with the same entity? Entity Authentication. Given an entity and identifier, how do we verify whether or not the identifier is associated with the entity? A formal model of identity We now consider formally the idea of a system of identifiers for the entities under observation. There are three aspects arising from our discussion of examples: assigning identifiers, comparing identifiers and basic personal identifiers. We will continue to use the formal notations introduced earlier in our formal definition of surveillance in the sub-Section ‘Identity: identifying entities’. id: Identifier ! Entity such that id(i) ¼ the entity e 2 Entity named by the data i 2 Identifier. The structure becomes an algebra: IdSys ¼ (Identifier, Entity j id: Identifier! Entity). Recalling the Search and Enumeration Principles in the Section ‘What is identity?’, we will focus on systems having this many-one property. Since the purpose of the identifiers is to recognise the entities that we are interested in, the following equivalence relation on Identifier is basic: Definition. For any i1 and i2 2 Identifier, we say that they are entity-equivalent if they are associated with the same entity: in symbols, i1 en i2 if, and only if, id(i1) ¼ id(i2). The identifier captures and narrows down detection of entities. Thus, we can strengthen the system of identifiers if we can satisfy this condition: Definition. A system of identifiers IdSys is said to satisfy the oneto-one property if the map id satisfies: for any i1 and i2 2 Identifier, if id(i1) ¼ id(i2) then i1 ¼ i2. Assigning identifiers Definition. Let Identifiers be a non-empty set of identifiers and Entity a non-empty set of entities. Suppose that identifiers have been assigned to entities by means of a relation id  Identifier  Entity such that id(i, e) () the data i 2 I, called an identifier, is assigned to entity e 2 E. We define the set of entities named by identifier i by ent(i) ¼ {e 2 Entity j id(i, e)} The map id is one-to-one or injective, and entity-equivalence en is ¼. Example 2: Cars. Recalling Example 1 in the Section ‘What is identity?’, the association of registration marks to cars is one–one. Generating identifiers How are identifiers generated for a set of entities in practice? First, some input data is presented to the system that has to be examined and approved according to some set of rules. Definition. Let the initial data presented to a system in order to create an identifier be called a form. Let Form be the set of all possible forms for the system. The creation of an identifier is a mapping of the type: and the set of all identifiers naming entity e by generate: Form ! Identifier. id(e) ¼ {i 2 Identifier j id(i, e)}. These sets are projections of the relation id. The maps ent(i) and id(e) are needed to formalize the types of association in the Section ‘What is identity?’. This idea is our most general definition: Definition. A system of identifiers is a structure, A form f 2 Form is the background data needed to create the identifier generate(f). We can refine this idea by separating the processing of the data from the release of the identifier. Let the processing of the form be represented by a function check: Forms ! {0, 1} IdSys ¼ (Identifier, Entity j id  Identifier  Entity). Example 1: Post Codes and Passwords. Recall Example 3 in the Section ‘What is identity?’: a postcode can be assigned to more than one building so the association is a one–many relation postcode: Postcode  Address. Similarly, accounts are assigned one password, but passwords can be common to different accounts (e.g. proper names, birthdays, etc.). The association is a one–many relation password: Password  Username. Examples suggest that the following special case is most important. that tests the data in a form f 2 Form for consistency against the system’s rules. We assume that check(f) ¼ 1 means the form is accepted and check(f) ¼ 0 means the form is rejected. We represent the next stage — if and when an identifier is to be issued — by a function issue: Forms ! Identifier which uses some or all of the data in f 2 Form to make an identifier. The two stages are represented by composing the functions to make the new function Downloaded from https://academic.oup.com/cybersecurity/article/3/3/145/4748787 by guest on 07 November 2021 Entity authentication is stronger than identifier authentication. The notion is attractive but not subtle for what does it mean to be ‘given an entity’? In much theory and practice, the entity is actually ‘given’ by means of another identifier. We examine the relationship between identifiers in the Section ‘Comparing identifiers’. Example 4: Physical Verification of Entities. A biometric is an identifier that is designed to be verified by means of a physical process of identity authentication. The physical process involves instruments that make measurements, which are processed by software, and whose specifications involve probability theory. Questions arise about accuracy, equivalence across authenticating equipment, software portability, and, indeed, the probabilistic assumptions. However, the intention is clear: through biometrics, physical reality verifies personal identity. Definition. A system of identifiers IdSys is said to satisfy the many-one property if each identifier is assigned to one and only one entity but an entity may have more than one identifier. In this case, the relation becomes a single-valued mapping Journal of Cybersecurity, 112017, Vol. 3, No. 3 generate: Forms ! Identifier [ {reject} where generate(f) ¼ if check(f)¼1 then issue (f) else reject. The idea of the form is seen in the familiar procedures of enrolment and registration required when applying to join organizations, schemes and services etc. Personal identity Examples Example 1: Biometrics. Biometric identifiers are measurable qualities that can be used to describe and label the physical characteristics of individuals and enable the automatic recognition of people. Physiological and behavioural characteristics are related to the body, and there are many: some 9 leading biometrics, and a further 17 biometrics under development, are discussed in [16]. All of these physical measurements end up in software. The association of a biometric to people is expected to be highly reliable because it is expected to be one-one. Biometric digital technologies emerged in the 1960s with automatic fingerprint recognition [29, 30] – perhaps, the best understood automatic process [31]. The operational tests used to measure biometrics are of course, approximate, due to technical constraints, error margins and costs. Thus, that biometric data manifests a one–one identity association is a matter of probability, especially high probability. Increasingly accurate measurements are desirable or necessary. Although identical twins share very similar DNA, they are not identical [32]. The environment affects the genetics, possibly even in the womb. But the complexity of testing is considerable and is a research area [33]. Recently, public attention was drawn to these points when identical twins were identified by DNA evidence as suspects in a series of sexual assaults, in Marseille, France, and soon after in Reading, England. In the case of Marseille, after 10 months incarceration, one of the twins confessed [34]; in the case of Reading, mobile phone evidence revealed the offender [35]. At the time advanced DNA tests were not applied to separate the twins. Example 2: Citizenship. In the UK, for example, an individual can or must register with state organizations devoted to health, employment, citizenship, and transport, and with local government organizations devoted to residence and elections. Everyone registered with the National Health Service has his/her unique number, which is linked to his/her health record. Each NHS number is made up of 10 alpha-numerics. Everyone gets a National Insurance (NI) number just before he/she turns 16. An individual’s NI number makes sure his/her NI contributions and taxes are only recorded against her/her name. The format of the number is two prefix letters, six digits, and one suffix letter. In the new style red passport, in addition to the biometrics, there is a passport number that must be nine characters and all characters must be numeric. Finally, each driving licence has a number made up of 18 alpha-numerics, which codes part or all of (i) the surname; (ii) the date of birth; (iii) the first names; (iv) sex; (v) licence issue; and (vi) checks. In these cases of registration, numbers are added to identifiers in order to ensure that each of these associations is one–one. The ways in which the British state knows its citizens is complicated; plans in 2006 for (re-)introducing a national identity register were abandoned in 2011 [36, 37]. Formal personal identifiers We have emphasized how systems of identity are designed to separate entities in contexts, how they are established with widely varying standards of rigour, and that they are combined and compared in all sorts of unanticipated ways. The fundamental personal identifiers mentioned in the sub-Section ‘Examples’ are much used because they carry weight: with the authority of the state, people are identified in basic contexts for citizenship, employment, tax and health. Definition. A personal identity system has the form PIdSys ¼ (Identifier, Person j pid: Identifier! Person) and satisfies the uniqueness property, namely two different people are assigned different data and the function pid is one–one. In practice, the data assigned to a person invariably includes a number or alpha-numeric code precisely in order to enforce the uniqueness property. All systems of identity need to be analysed by studying comparisons that involve mapping between different systems of identity, but this is especially true of personal identity systems. Provenance of identifiers Generating identifiers using other identifiers Creating identifiers is an everyday occurrence: we open accounts, register for services, buy products, etc. For many of these actions, we rely on a handful of pre-existing identifiers. In the UK, to open a bank account, we give a proof of our identity and our current address, e.g. using a passport and a recent utility bill. To order a product or service, an address and a credit card account number are usually sufficient for the vendor to dispatch: notice the dependency on the bank identifier. At face value, the quality of a bank identifier is guaranteed by the databases of the state (passport, driver’s licence) and local organizations (utility providers, local authorities). The passport provides a high quality identifier based on a birth certificate, a photograph and possibly other biometric data. Example after example, illustrates the general point that: Principle. The creation of new identifiers is dependent upon preexisting identifiers. The quality of an identifier is essentially a matter of its reliability, which in turn depends on i. its provenance, i.e. the process involved in establishing the identifier; and ii. scope, i.e. the context(s) in which it is accepted. In the case of people, a passport and a driving licence are standard examples of high-quality identifiers with a rigorous provenance and wide application [38, 39]. In the case of a bank, where it is a now a priority to check on identity of existing customers, the process of identification can be clumsy and discriminatory, as women can experience when using both their maiden name (in their profession) and married name (in their personal life), which are often not linked rigorously in practical ways. The dependence of one identifier upon another may be illustrated in an identity dependence tree. Downloaded from https://academic.oup.com/cybersecurity/article/3/3/145/4748787 by guest on 07 November 2021 Of greatest interest is surveillance in which the entities are people. A fundamental problem is how identifiers can actually identify a specific individual. An individual’s identity involves many characteristics — social, biographical, psychological and biometric — all of which can be presented digitally. A person identifier is very special data as it is fundamental to theories of trust, privacy and surveillance. Consider some examples of assigning data to individuals. 153 154 Journal of Cybersecurity, 112017, Vol. 3, No. 3 Example 1: Bank Account. Consider the role of identifiers in opening a bank account (in the UK), which is depicted in Fig. 1. Establishing the identifier ID1 of the account holder involves providing evidence using five other identifiers: the validity of ID1 depends upon, or is reduced to, the validities of ID2–ID6. Some of these identifiers have a special status, in that they are designed to reliably denote an individual. In the example, these personal identifiers are guaranteed by the state (ID4) and biometric data (ID5); in the latter case, ID6 is used to allow a passport to be issued by post, without face-to-face interaction. ID2 is used to confirm the validity of the account holder’s address. The identifiers that appear in the nodes of the tree suggest that there can be quite complicated dependencies between systems of identifiers for the same or, more commonly, different contexts. The identifier is made by aggregating pre-existing identifiers: the bank identifier in Fig. 1 is the sum of the identifiers for current address, birth and image, etc. Since identifiers are often built from other identifiers, of central importance is the process of comparing identifiers and relating one type of identifier to another. Indeed, there must be translations between distinct systems for these identifiers for such methods to work. All of these observations and ideas can be formalized to make a precise and general mathematical framework for analysing identifiers. The identity dependence tree is a flexible notion with many more applications than proving personal identity. Example 2: Namespaces. Namespaces are sets of identifiers that use symbols to label, organize and classify entities by names. The names can have a tree structure that enables them to be reused and to form a hierarchy. For example, the names for directories, folders, files, and web domains, etc. are made by concatenating names and denote paths in a tree: the web address http://www.swansea.ac.uk/library/archive-and-research-collec tions/hocc for the History of Computing Collection is a node belonging to the archives which in turn belong to the library of Swansea University. Indeed, there is no shortage of computing contexts where identity dependence trees are used. Domain name systems (e.g. URLs), directory services for networks (e.g. Microsoft’s Active Directory), email addresses (e.g. X500), authentication systems (e.g. Kerberos), and public key infrastructures (e.g. blockchains) are natural sources of rules and structures for creating identifiers. Example 3: Identity Fraud. The creation of new personal identities requires many identifiers to be fabricated: birth certificates, driving licences, employment histories, etc. The practicalities for the USA are discussed extensively in Kevin Mitnick’s memoir [40]. When a fugitive, his method for creating a new identity in different states can be depicted as an identity dependence tree. More generally, Mitnick’s success at social engineering is based on his extensive preparation, which focussed on researching identifiers that he would use in masquerades in the technical, commercial and government contexts of phone system companies, computer and phone manufacturers, and state agencies. The complexity of computing systems suggests that tracing the provenance of a component may lead to circularity and so there may be a need for graphs of identifiers with cycles. Generating identifiers from identifiers Now suppose that to generate an identifier for an entity the input data involves other identifiers that must be presented to verify some of the new data (such as personal identity). The general ideas of the sub-Section ‘Generating identifiers’ can be reformulated with provenance in mind. We revise the processing of the form with a function with new variables: check: Forms  Identifier1  . . .  Identifierk ! {0, 1} that tests the data in a form f 2 Form and the information available from identifiers i1, . . ., ik for consistency against the system’s rules. Again, we assume that check(f, i1, . . ., ik) ¼ 1 means that the form is accepted and check(f, i1, . . ., ik) ¼ 0 means that the form is rejected. The identity of an entity with identifier i depends upon the identifiers i1, . . ., ik. This idea is formalised by re-representing the function generate: Forms ! Identifier [ {reject} (in section ‘Generating identifiers’) by the new function generate: Forms  Identifier1  . . .  Identifierk! Identifier. There are now two ways of creating the identifiers and defining generate, defined by two principles: Provenance Principle: Verification. The data in f 2 Form is sufficient to create an identifier i. The data in the identifiers i1, . . ., ik are used only to confirm or validate the data in f. Here the function has the form generate(f, i1, . . ., ik) ¼ if check(f, i1, . . ., ik)¼1 then issue (f) else reject noting that issue (f) does not need to know the validation identifiers. Secondly, we have the more demanding case: Downloaded from https://academic.oup.com/cybersecurity/article/3/3/145/4748787 by guest on 07 November 2021 Figure 1: Dependency tree of identifiers Journal of Cybersecurity, 112017, Vol. 3, No. 3 155 Provenance Principle: Inheritance. The data in f 2 Form to create an identifier i is inherited from the data in the identifiers i1, . . ., ik. In this case, the function has the form generate(f, i1, . . ., ik) ¼ if check(f, i1, . . ., ik)¼1 then issue (f, i1, . . ., ik) else reject. Comparing identifiers Reductions between systems of identifiers Consider the case where a set Entity of entities has two systems of identifiers: IdSys1 ¼ (Entity, Identifier1 j id1: Identifier1!Entity), IdSys2 ¼ (Entity, Identifier2 j id2: Identifier2!Entity). How can we relate or compare these systems? One simple case is when the identifiers in Identifier1 can be associated or matched with one or more identifiers in Identifier2, and vice versa. This means that given an identifier i 2 Identifier1 of an entity e 2 Entity, we can find corresponding identifiers in Identifier2 that are also identifiers for e. This is formalized as follows: Definition. Let IdSys1 and IdSys2 be systems of identifiers for Entity. A matching relation r: Identifier1  Identifier2 for the systems of identifiers IdSys1 and IdSys2 compares identifiers as to whether or not they are associated with the same entity in the following sense: for every i 2 Identifier1 and j 2 Identifier2, r(i, j) if, and only if, id1(i) ¼ id2(j). Different conditions on a matching relation can be found in examples, depending upon the properties of id1 and id2. An important and common case is: given an identifier i 2 Identifier1 of an entity e 2 Entity, we can find some corresponding identifier in Identifier2 that is also an identifier for e. This is formalized as follows: Definition. Let IdSys1 and IdSys2 be systems of identifiers for Entity. The system of identifiers IdSys1 is said to reduce to the system of identifiers IdSys2 if there is a single-valued reduction mapping f: Identifier1 !Identifier2 that calculates for each identifier in Identifier1 a corresponding identifier in Identifier2 for the same entity in the following sense: for every i 2 Identifier1, id1(i) ¼ id2(f(i)). We write IdSys1  IdSys2 or, more simply and conveniently, id1  id2 (see: Fig. 2). This is but one formalization of the process of comparing the identifiers of Identifier1 to those of Identifer2. Another option would be to return a selection, or all, of the equivalent identifiers. Because the notion of identifier is so abstract, the notion of reduction is very general. Mappings between identifiers are ubiquitous in computing 8 A one-way function is easy to compute but hard to invert. Figure 2: Transformation of identifiers systems and employ many algorithmic techniques. Reductions can be found in situations where alternate terms like ‘translating’, ‘binding’, ‘matching’, and ‘tracing’ are used. Example 1: Tracing. Consider the set Keep of keepers of vehicles in the UK and two systems of identity for this set of entities. Suppose, for simplicity, each keeper has one car and each keeper has a unique address. Each car has a registration mark. Let the first system be Reg ¼ (Keep, Regmk j reg: Regmk ! Keep). Every keeper has an address assigned by the postal service so let Add ¼ (Keep, Address j addr: Address ! Keep). Then the Driver and Vehicle Licensing Agency (DVLA) is responsible for the determining the keeper’s address from the registration mark, which is defined formally by the reduction map red: Regmk ! Address such that for every registration mark r 2 Regmk, reg(r) ¼ addr(red(r)). We say that the system of identities Reg is reducible to Add. Example 2: Hashing. In cybersecurity, hashing techniques provide examples of reductions. For example, consider hashing in managing passwords. Hashing involves a one-way function h: Password ! {0, 1}k where h(w) is a data used to separate w in some context.8 There are many hashing algorithms, such as the secure hash algorithms SHA-256 and SHA-512; and there are methods to enhance their security such as salting, where random strings are added to the passwords to separate common passwords from each other. Thus, hash codes are identifiers and the hash function and salting qualify as reductions. Example 3: Binding. Connections between computing entities require various degrees of reliability and, in secure contexts, trust. In computing, a binding is a mapping associating distinct entities in hardware or software. Commonly, bindings are mappings between syntactic spaces (e.g. namespaces) enabling binding to connect syntactic and semantic entities, or to create layers in software stacks, or create secure chains of identity in cryptography. The term binding has general application and several common forms of binding qualify as reductions between systems of identifiers in our sense. Example 4: Certification. Certification is a security process that seeks to increase trust in identity. It is intended to reduce risks of man-in-the-middle vulnerabilities. In communications, such as calling a webpage, certification can flag doubts about a website. In cryptography, a public key certificate is used to confirm the ownership of a public key. The certificate validates the binding of a public–private key pair to an entity, using a digital signature generated by a certificate authority. Downloaded from https://academic.oup.com/cybersecurity/article/3/3/145/4748787 by guest on 07 November 2021 Access to data belonging to different contexts is desirable in surveillance, intelligence analysis, and academic research; it is undesirable in social and personal contexts as it undermines privacy and freedom. Access is regulated by legal instruments. 156 Definition. The system of identifiers IdSys1 is said to be equivalent to the system of identifiers IdSys2 if there are reduction mappings f: Identifier1 ! Identifier2 and g: Identifier2 !Identifier1 that can exchange identifiers in Identifier1 and corresponding identifiers in Identifier2. We write IdSys1  IdSys2 or, more simply and conveniently, id1  id2. Structuring the space of identifiers Lemma. Let IdSys(Entity) be the set of all identity systems for the non-empty set Entity of entities. The reduction relation  on IdSys(Entity) is reflexive and transitive; and  is an equivalence relation on IdSys(Entity). Proof. Let IdSys ¼ (Entity, Identifier j id: Identifier ! Entity). Trivially, id  id using the identity function Identifier ! Identifier as reduction map; so reduction is reflexive. To show transitivity, let IdSys1 ¼ (Entity, Identifier1 j id1: Identifier1!Entity), IdSys2 ¼ (Entity, Identifier2 j id2: Identifier2!Entity), IdSys3 ¼ (Entity, Identifier3 j id3: Identifier3!Entity). and suppose id1  id2 by f: Identifier1! Identifier2 and id2  id3 by g: Identifier2! Identifier3. Then, for i 2 Identifier1, and j 2 Identifier2, we have id1(i) ¼ id2(f(i)) and id2(j) ¼ id3(g(j)). Composing, f and g we have, id1(i) ¼ id3(g(f(i))) and id1  id3. It is easy to show that  is symmetric. Using the equivalence relation  on IdSys(Entity), we define the set of equivalence classes: IdSys(Entity) ¼ IdSys(Entity)/. The equivalence classes have the standard form of [id] for id 2 IdSys(Entity). The ordering relation  on IdSys(Entity) induces ordering relation  on IdSys(Entity) by [id1]  [id2] () id1  id2. It is easy to check that  is a partial ordering on IdSys(Entity). Furthermore, the ordering  has the least upper bound property: for any [id1], [id2] 2 IdSys(Entity), there is an element [id] such that: i. [id] is an upper bound: [id1]  [id] and [id2]  [id]; ii. no lower element is a bound: if [id1]  [id0]  [id] then either [id1] 5 [id0] or [id0] 5 [id]. To show this we construct an identity system as follows. Given id1, id2 2 IdSys(Entity), take the disjoint union Identifier1 丣 Identifier2 of the sets of Identifier1, and Identifier2 and define id1 丣 id2: Identifier1 丣 Identifier2 ! Entity. wherein given i 2 Identifier1 丣 Identifier2, (id1 丣 id2)(i) ¼ id1(i) if i 2 Identifier1 (id1 丣 id2)(i) ¼ id2(i) if i 2 Identifier2. It is easy to show that [id1 丣 id2] satisfies conditions (i) and (ii). The construction (Identifier1 丣 Identifier2 j id1 丣 id2) is called a co-product of the identity systems. If the sets Identifier1 and Identifier2 are disjoint (as is often the case) then the carrier is their union. Example: Combining Identifiers. Integration of identity data can be tentatively explored using coproducts. Consider making a system of identifiers for entities that are contracts, for which personal identity and current location must be validated. A space of identifiers may be built using the coproducts of pairs of validating systems of identifiers: passports, driver licences, identity cards for identity, and utility bills, local tax declarations for addresses. A partial ordering with the least upper bound property is called an upper semilattice [41]. Thus, gathering together these arguments we have the theorem: Theorem. The reduction relation  on IdSys(Entity) forms an upper semilattice. Corollary. The process of creating new identifiers by inheriting existing identifiers forms an algebraic structure IdSys(Entity) that is an upper semilattice under the reduction relation. Equivalently, any upper semilattice can be reconstructed as an algebraic structure with a binary operation ^ that is associative, commutative, and idempotent [40]. In this form we would have the structure IdSys(Entity) ¼ (IdSys(Entity)/ j ^) with binary operation of least upper bound defined by [id1] ^ [id2] ¼ [id1 丣 id2]. Further properties of the upper semilattice IdSys(Entity) can be developed depending upon properties of the associations and reductions. Concluding remarks Employing simple examples and arguments from first principles, we have used formal methods to analyse precisely concepts involved in surveillance and identity. The formal analysis shows that disparate forms of surveillance can be unified by abstract mathematical definitions, and that (i) finding identities, and (ii) sorting identities into categories, are fundamental in conceptualizing surveillance. The formal analysis of identity shows that the idea of identity can be considered to be exclusively a matter of data, and its diversity can be unified by abstract mathematical definitions. It also shows that (i) comparing identifiers, and (ii) translating between systems of identifiers, are fundamental to understanding identity. Downloaded from https://academic.oup.com/cybersecurity/article/3/3/145/4748787 by guest on 07 November 2021 Reductions are an important concept that occur widely. To conclude, we introduce some concepts and propositions to reveal the richness of the reduction notion and signal the possibility of advanced classification methods. In the Section ‘Provenance of identifiers’, we discussed the combination of identifiers. The process of creating new identifiers from old introduces algebraic operations on spaces of identity systems. One choice of algebraic structure, the semilattice, organizes the space of all possible identity systems using reduction. Journal of Cybersecurity, 112017, Vol. 3, No. 3 Journal of Cybersecurity, 112017, Vol. 3, No. 3 Developing a theory of identity code: Numbers ! Identifier. Classical computability theory is a mathematical theory of what can and cannot be computed on numbers [47], especially binary and decimal, etc. It has been applied to establish the scope and limits of computation on arbitrary data using such maps as code; these maps are called numberings in computability theory [48]. Thus, there is a second ready-made theory that can applied to develop this threelayer model: id 8 code: Numbers ! Identifier ! Entity and theorize what is, and what is not, computable about systems of identifiers. The conception of identity analysed here is inspired by and abstracts ideas about abstract data types and encodings cf. [49]. surveillance tools (XKeyscore, Tempora, etc.) for target discovery and development. Let us observe that new surveillance contexts arise — or are recognized — as more of our professional and social activities are carried out by abstract technological systems rather than by direct face-to-face interactions. To make use of these systems, an individual needs to give over some of his/her identity to distinguish himself/herself from other users in the context. Thus, rather than having a single and holistic identity, individuals now have many separated and overlapping identities, which amplifies hugely the scope of a theory of identifiers. The physical and the virtual are converging; indeed, it seems that the physical world is being sucked into the virtual and a virtual world is being created that is self-contained. It certainly exerts a strong influence on the physical world, and shows signs of autonomy. Thus, the components of monitoring and surveillance — context, entity, observable behaviour, attribute and identity — will seem natural in a world held together by data and software. The multiplicity of contexts and identities, and the possibility of the autonomy of the virtual world, requires the nature of identity to be theorized. The formal framework we offer here is a rigorous analysis of the conceptual structure of surveillance; there ought to be others. What can formalization contribute? Guided by the theory of abstract data types, our formalization of identity aspires to: i. establish and explore principles that assume identity is a matter of data and their implications; ii. make precise essential concepts and classify abstractly methods of identification; iii. provide a unified point of view that illuminates the design of many real systems; iv. explore the role of identity in aspects of security studies, including monitoring and surveillance, personal privacy and trusted translations and interactions. At this stage, these aims require a great deal of further work. Finally, let us observe that if a social science topic is closely associated with abstract technologies that collect and process data effectively then the specification of the software tools — i.e. what the tools are designed to do for users — can be formalized in the same way as we have approached the problem here. Thus, sociological notions that motivate, shape and are ultimately represented in software, can be defined in a formal framework which can be mathematically analysed. In short, sociological theories about human activities that are closely associated with abstract software systems can be expected to have formal models, mathematical theories, as well as oodles of data arising from their use. To isolate, define and analyse ideas is the raison d’être of formal methods, though in new areas their mathematical nature presents obstacles to their reception and appreciation. The use of formal methods to express and analyse general notions is established in areas of philosophy and linguistics but seems to be rare in social studies. Given software’s colonization of professional and social life, and its promotion of monitoring and Big Data, the role of formal methods to theorize social concepts and problems is destined to grow. Acknowledgement Formalizing identity and social theory Ours is an investigation into ideas about surveillance and identity, wherein our models are developed from first principles. It may seem far from the world implied by the revelations of Snowden, with its We thank two anonymous referees for valuable suggestions that improved an earlier version of this paper. This research was partially supported by the EPSRC project Data Release - Trust, Identity, Privacy and Security (EP/ N028139/1 and EP/N027825/1). Downloaded from https://academic.oup.com/cybersecurity/article/3/3/145/4748787 by guest on 07 November 2021 The theoretical analysis presented here is intended to analyse examples and formalize intuitions and ideas. This formal approach is new and inevitably modest, but it can serve as a basis for further conceptual and mathematical investigations relevant to the making and regulation of surveillance systems. Two conceptual and four technical further directions seem to us to be desirable. Conceptually, our theory of identifiers could be used to theorize privacy, interpreted as the control identity. (A formal notion of anonymity was included in the sub-Section ‘Surveillance: detecting attributes’.) The second direction is to use identifiers to explore secure access control policies in computer systems (e.g. role-based access control). Turning to technical directions, first the notion of context in the sub-Section ‘Behaviour as streams of data’ can be developed with various semantic models. Further behavioural features can be formalized, such as the interaction of entities. There are options for formalizing streams – infinite or finite, total or partial streams in discrete or continuous time – and for behaviours modelled as nondeterministic and concurrent processes [42]. Secondly, logics can be used to develop specification and reasoning about attributes. There are several candidates, such as many sorted first order logic and its many derivatives and its extensions – equational, Horn, and temporal logic [43]; and many valued logics [44]. Logics bring with them tools that would expand the scope of the theory and applications. Thirdly, the idea of a system of identifiers needs to be developed mathematically. For example, identifiers commonly reference characteristics of the entities that may be essential to their deployment and application; this information is an additional component that would enrich the mathematical structure of contexts and their identifiers. The idea of the form does provide background initial information, but the characteristics of an entity may need updating because of the behaviour of the entity in time. Systems of identifiers are instances of abstract data types [25], whose extensive general theory based on many sorted algebras and equations [18, 19, 45, 46] can add significantly to the theory and practice of identity. Fourthly, identifiers are assumed to be digital objects. To model this aspect, identifiers must themselves be coded by bits, i.e. by finite binary strings over {0, 1}. This introduces a new digital layer beneath the identifiers in which all computation actually takes place, simulating functions on user data by functions on binary numbers. This digital layer is a source of constraints on the theory of identifiers. The digital layer can be modelled by maps of the form 157 158 References 26. Lyon D (ed). Surveillance as Social Sorting: Privacy, Risk and Digital Discrimination. Routledge, 2003. 27. Chen M, Mao S, Liu Y. Big data: a survey. Mobile Network and Applications 2014; 19:171–209. 28. Minoli D. Building the Internet of Things with IPv6 and MIPv6: The Evolving World of M2M Communications. Wiley, 2013. 29. Trauring M. On the automatic comparison of fingerprint ridge patterns. Nature 1963; 197:938–40. 30. Wayman JL. The scientific development of biometrics over the last 40 years. In: Leeuw K, Bergstra JA (eds), The History of Information Security: A Comprehensive Handbook. Elsevier, 2007, 263–74. 31. Maltoni D, Maio D, Jain A. et al. Handbook of Fingerprint Recognition. Springer, 2009. 32. Choi CQ. Copy that: identical twins are not genetically identical. Scientific American 2008;298:24–26. 33. Twin DNA test: Why identical criminals may no longer be safe, 5 January 2014. http://www.bbc.co.uk/news/magazine-25371014 (14 March 2017, date last accessed). 34. Mystery of which identical twin committed a series of rapes in France is finally solved as one brother confesses after he was given away by a stutter. http://www.dailymail.co.uk/news/article-3225467 (14 March 2017, date last accessed). 35. Identical twins need never be tried for same crime after DNA breakthrough. http://www.telegraph.co.uk/news/science/science-news/105110 87/Identical-twins-need-never-be-tried-for-same-crime-after-DNA-break through.html (14 March 2017, date last accessed). 36. Caplan J, Torpey J. (eds). Documenting Individual Identity: The Development of State Practices since the French Revolution. Princeton University Press, 2001. 37. Higgs E. Identifying the English: A History of Personal Identification 1500 to the Present. Bloomsbury, 2001. 38. Lloyd M. The Passport: The History of Man’s Most Travelled Document. Queen Anne’s Fan, 2016. 39. Castile M. Driver’s License. Bloomsbury, 2015. 40. Mitnick K. Ghost in the Wires. Little Brown & Company, 2011. 41. Birkhoff G. Lattice Theory, 3rd edn. American Mathematical Society, 1995. 42. Bergstra JA, Ponse A, Smolka SA. (ed). Handbook of Process Algebra. Elsevier, 2001. 43. Manzano M. Extensions of First-Order Logic. Cambridge University Press, 2005. 44. Gottwaldov S. A Treatise on Many-Valued Logics. Studies in Logic and Computation, vol. 9, Research Studies Press, 2001. 45. Meseguer J, Goguen JA. Initiality, induction, and computability. In: Nivat M, Reynolds JC (eds), Algebraic Methods in Semantics. Cambridge University Press, 1986. 46. Meinke K, Tucker JV. Universal algebra. In: Abramsky S, Gabbay D, Maibaum, T (eds), Handbook of Logic in Computer Science. Volume I: Mathematical Structures. Oxford University Press, 1992, 189–411. 47. Griffor ER. Handbook of Computability Theory. Elsevier, 1999. 48. Ershov Y. Theory of numberings. In: Griffor ER (ed.), Handbook of Computability Theory. Elsevier, 1991, 473–503. 49. Stoltenberg-Hansen V, Tucker JV. Effective algebras. In: Abramsky S, Gabbay D, Maibaum T (eds), Handbook of Logic in Computer Science. Volume IV: Semantic Modelling. Oxford University Press, 1995, 357–526. Downloaded from https://academic.oup.com/cybersecurity/article/3/3/145/4748787 by guest on 07 November 2021 1. Haggerty K, Ericson R. The surveillance assemblage. British J Sociol 2000; 51:605–662. 2. Introna L, Wood D. Picturing algorithmic surveillance: the politics of facial recognition systems. Surveillance & Society 2004; 2:177–98. 3. Evans D. The Internet of Things: how the next evolution of the internet is changing everything. CISCO White Paper, 2011. 4. Hopper A. Sentient computing. Phil Trans Royal Society 2000; 358: 2349–58. 5. Thrift N. The ‘sentient’ city and what it may portend. Big Data & Society 2014; 1:1–21. 6. Foresight. Future Identities Changing identities in the UK: The next 10 years (Final Project Report). Government Office for Science, 2013. 7. Anderson D. A Question of Trust. Report of the Investigatory Powers Review. HM Stationary Office, 2015. 8. Lyon D. Surveillance Studies: An Overview. Polity Press, 2007. 9. Ball K, Haggerty KD, Lyon D. The Routledge Handbook of Surveillance Studies. Routledge, 2012. 10. Wills D. Surveillance and Identity: Discourse, Subjectivity and the State. Ashgate, 2013. 11. Torpey J. The Invention of the Passport: Surveillance, Citizenship and State. Cambridge University Press, 2000. 12. UK Government. The Identity Card Act 2006: Elizabeth II. Chapter 11. The Stationary Office, 2006. 13. Cole S. Suspect Identities: A History of Fingerprinting and Criminal Identification. Harvard University Press, 2001. 14. Lyon D. Under my skin: from identification papers to body surveillance. In: Caplan C (ed.), Documenting Individual Identity: The Development of State Practices in the Modern World. Princeton University Press, 2001. 15. Wallace H. The UK National DNA Database – Balancing crime detection, human rights and privacy. Science & Society 2006; EMBO reports 7: S26–S30. 16. Vacca J. Biometric Technologies and Verification Systems. Elsevier, 2007. 17. van der Ploeg I. Biometrics and the body as information: narrative issues of the socio-technical coding of the body. In: Lyon D (ed.), Surveillance as Social Sorting: Privacy, Risk and Digital Discrimination. Routledge, 2003. 18. Ehrich HD, Loeckx J, Wolf M. Specification of Abstract Data Types. Wiley, 1996. 19. Goguen J, Thatcher J, Wagner E. An initial algebra approach to the specification, correctness and implementation of abstract data types. In: Yeh R (ed.), Current Trends in Programming Methodology, IV. Prentice-Hall, 1978. 20. Lipschutz S, Lipson M. Discrete Mathematics, 3rd edn. Schaum, 2009. 21. Makinson D. Sets, Logic and Maths for Computing. Springer, 2012. 22. Pieri E. Emergent policing practices: operation shop a looter and urban space securitisation in the aftermath of the Manchester 2011 riots. Surveillance & Society 2014; 12:38–54. 23. Morstatter F, Pfeffer J, Huan L, et al. Is the Sample Good Enough? Comparing Data from Twitter’s Streaming API with Twitter’s Firehose. Arxiv, 2013. http://arxiv.org/abs/1306.5204v1 (14 March 2017, date last accessed). 24. Needham RM. Names. In: Mullender S. (ed.), Distributed Systems. ACM Press, 1989, 89–101. 25. Liskov B, Zilles S. Programming with abstract data types. In: Proceedings of the ACM SIGPLAN Symposium on Very High Level Language. ACM Press, 1974, 50–59. Journal of Cybersecurity, 112017, Vol. 3, No. 3