Academia.eduAcademia.edu

Graph-theoretic autofill

2015

Imagine a website that asks the user to fill in a web form and -- based on the input values -- derives a relevant figure, for instance an expected salary, a medical diagnosis or the market value of a house. How to deal with missing input values at run-time? Besides using fixed defaults, a more sophisticated approach is to use predefined dependencies (logical or correlational) between different fields to autofill missing values in an iterative way. Directed loopless graphs (in which cycles are allowed) are the ideal mathematical model to formalize these dependencies. We present two new graph-theoretic approaches to filling missing values at run-time.

GRAPH-THEORETIC AUTOFILL arXiv:1512.03199v1 [cs.HC] 10 Dec 2015 MICHAEL MAYER AND DOMINIC VAN DER ZYPEN Abstract. Imagine a website that asks the user to fill in a web form and – based on the input values – derives a relevant figure, for instance an expected salary, a medical diagnosis, or the market value of a house. How to deal with missing input values at run-time? Besides using fixed defaults, a more sophisticated approach is to use predefined dependencies (logical or correlational) between different fields to autofill missing values in an iterative way. Directed loopless graphs (in which cycles are allowed) are the ideal mathematical model to formalize these dependencies. We present two new graph-theoretic approaches to filling missing values at run-time. 1. Introduction The Internet offers many online calculators that provide relevant figures based on the values entered in a web form. Examples are salary calculators, web-based medical advice, and tools to compute the typical rent of an appartment, just to name a few. Usually, all input fields are mandatory, i. e. cannot be left blank by the user. While this minimizes programming effort by the website provider, it might force the user to make up or guess unknown information (like the living area of a 4-room appartment whose monthly rent is to be estimated by the calculator) just for the sake of completeness, or the user will even search the web for a simpler calculator without ever returning. More user-friendly web forms offer fixed default values at least for some of the fields. On one hand, this approach is attractive thanks to its low programming effort. On the other hand, the stronger the input fields are interrelated, the less appropriate a fixed default might be in certain cases. While a default living area of 80 m2 might serve as a good default for a typical appartment, it is certainly unrealistic for a studio or a 10-room penthouse with indoor swimming-pool. What are alternatives to handle missing input values on web forms? One option is the reduced feature approach from statistical predictive modelling, see e. g. [5] or [6]. There, for each combination of available fields, an individual calculator is applied. The drawback is obvious. Even for a low number p of optional input fields, the programming effort exploses, as 2p calculators have to be derived, implemented and supported. An elegant compromise between offering over-simplistic fixed defaults and the unfeasible reduced feature approach is to autofill missing values by prespecified functions of other input values, i. e. using dynamic defaults for certain fields. This can either be made visible to the user or invisibly applied before calling the underlying formula of the calculator. Consider as example a calculator for the ideal weight based on body height, sex and age. There, the autofill strategy could be as follows. 1 2 MICHAEL MAYER AND DOMINIC VAN DER ZYPEN • Sex s: Mandatory (1: male, 0: female) • Age x: Autofilled by height z in cm using the formula   ⌊(z − 30)/130 · 16 + 1⌋ if 30 ≤ z ≤ 160, x = f (z) := 40 if z > 160,   1 if z < 30. • Body height z: Autofilled by height and sex by ( 162 + 16s z = g(x, s) := ⌊(x − 1)/16 · 130 + 30.5⌋ if x > 16, if x ≤ 16. Thus, the web form with the fields “age”, “sex” and “height” can be considered to be filled (in the sense as the underlying weight formula can be applied) as soon as sex and either of height and age is provided. Or to express the same thought in its negative sense: Even if we have specified replacement functions for age and height, the web form cannot be autofilled if age as well as height are left blank. In more complex situations, also the order in which the replacement functions are applied could matter, e. g. if sex would be autofilled by a function of the height. The purpose of this article is to use the notion of directed graphs (in which cycles are allowed) to provide a theoretical framework to determine if an arbitrarily complex web form (or any other multivariate input, e. g. the argument list of a function written in a programming language like C) can be filled by a fixed set of replacement functions and the partial input provided so far by the user. To do so, each input field is associated with a vertex in a graph and each replacement function with at least one directed edge between these vertices. Note that our considerations do not depend on how the replacement functions are defined or how accurate they are. In practice they will be chosen based on expert knowledge, literature, logical rules or by exploring statistical relationships. Suppose we are given an observation in which knowing the value of field A would enable you to guess the value of field B. A natural way to represent this is to draw an arrow from A to B: A→B Notice that the relation “B can be guessed if we know A” is sometimes asymmetric, that is, only works in one direction, as the following example illustrates: Let A stand for “gender” and B for “number of pregnancies had so far”. Then if you know that some observation represents a male then you can infer that the value of B must be 0. On the other hand if we know that the number of pregnancies is 0, the individual at hand can be either male or female, depending on the dataset at hand even with roughly equal probability. So this shows that it is natural to choose directed graphs, essentially points connected by arrows, as a model for representing the conclusions that can be made between some fields based on replacement functions. The terms of a directed graph, and other terms, will be defined rigorously in the next section, and for a good introduction into graph theory, we point the reader to [1]. The left side of Figure 1 shows the graph corresponding to the example of the weightcalculator. GRAPH-THEORETIC AUTOFILL Sex Age 3 Sex Age Height Pregnant Height Figure 1. On the left: Graph for weight-calculator. On the right: A more complex situation with additional field. We consider two kinds of replacement functions. In Section 2 we look at the case of “complete determination”, that is we are only allowed to calculate the value of a certain field Y if all the values of the fields from which an arrow leads into Y are known or already derived from other fields. Thus, a replacement function can only be evaluated if all arguments are made available in some way as in the case of the weight-calculator. Next, Section 3 applies to the situation of “partial determination”, where we are allowed to calculate the value of a certain field Y if only some of the values of the fields from which an arrow leads into Y are known or derived from other fields, i. e. some missing or undetermined arguments in the replacement functions are allowed. 2. Filling with complete determination A directed self-loopless graph (which we will refer to as directed graph for simplicity) is a tuple G = (V, E) where V is a finite non-empty set and E ⊆ V × V \ {(v, v) : v ∈ V }. We call the set V the set of vertices, and they represent the data fields or variables. An edge represents a determination arrow: If (v, w) ∈ V we can make some conclusion from the variable v to variable w. Definition 2.1. Let x, y ∈ V . Then there is a directed path from x to y if there is n ∈ N and a map p : {0, . . . , n} → V such that (1) p(0) = x, p(n) = y;  (2) for k ∈ {0, . . . , n − 1} we have p(k), p(k + 1) ∈ E. If n ≥ 1 and p : {0, . . . , n} → V is a directed path from x to x we call p({0, . . . , n}) ⊆ V , that is the image of p, a cycle. 4 MICHAEL MAYER AND DOMINIC VAN DER ZYPEN Note that the above definition implies that a cycle has at least 2 elements. Definition 2.2. For v ∈ V we denote the set of incoming vertices by In(v) = {w ∈ V : (w, v) ∈ E} and we let A(G) = {v ∈ V : In(v) = ∅}. A(G) represents the fields for which no replacement function are defined. Thus, they are always mandatory (or, without affecting our theory, need a constant default). Definition 2.3. Let I ⊆ V and let v ∈ V . Then we say that I determines the vertex v if In(v) 6= ∅ and In(v) ⊆ I. In other words, I stands for the arguments of the replacement function for v which is represented in the graph by the determination arrows that come in from In(v). Definition 2.4. We denote the set of vertices determined by I by Dtm(I). Inductively we define the “determination closure” of I ⊆ V in the following way: (1) I0 := I; (2) for n ≥ 0 set In+1 := In ∪ Dtm(In ). We say that I ⊆ V fills V if there is n ∈ N such that In = V . Note that trivially, V itself fills V . Consequently, if all fields associated with I are entered, then the remaining fields can be autofilled iteratively by the prespecified replacement functions. The following lemmata are mainly used to derive a characterization theorem for filling subsets. Lemma 2.5. If I ⊆ V fills V then A(G) ⊆ I. Proof. Trivial, follows from definition of determination.  Lemma 2.6. Suppose I ⊆ V and suppose that C is a cycle with I ∩ C = ∅. Then  I1 ∩ C = I ∪ Dtm(I) ∩ C = ∅. Proof. We prove the contrapositive. Suppose there is c∗ ∈ C such that c∗ ∈ I1 . Since C is a cycle, we have C ∩ In(c∗ ) 6= ∅, say d ∈ C ∩ In(c∗ ). Now c∗ ∈ I1 implies In(c∗ ) ⊆ I by definition, therefore d ∈ In(c∗ ) ∩ C ⊆ I ∩ C, so I ∩ C 6= ∅.  An inductive application of Lemma 2.6 shows that any subset that fills V intersects every cycle in G = (V, E): Lemma 2.7. If I ⊆ V fills V and C ⊆ V is a cycle, then I ∩ C 6= ∅. Proof. Let I ⊆ V be any subset of V and let C ⊆ V be a cycle. Applying Lemma 2.6 inductively implies that (⋆) if I ∩ C = ∅ then In ∩ C = ∅ for all n ∈ N. GRAPH-THEORETIC AUTOFILL 5 So if I fills V then In = V for some n, and in particular In ∩ C 6= ∅ for that n. The contrapositive of (⋆) directly implies I ∩ C 6= ∅.  This helps us to prove a characterization theorem for filling subsets I ⊆ V . Theorem 2.8. Let G = (V, E) be a self-loopless directed graph, and let I ⊆ V . Then the following statements are equivalent: (1) I is filling; (2) A(G) ⊆ I, and for every cycle C in (V, E) we have C ∩ I 6= ∅. A web form can thus be autofilled as soon as values are provided (1) for all fields without replacement function, and (2) for one field per cycle. Proof of Theorem 2.8. (1) =⇒ (2). Taken care of by Lemmas 2.5 and 2.7. (2) =⇒ (1). We assume that I ⊆ V is not filling and v ∈ I for every v ∈ V with In(v) = ∅ and construct a cycle C ∗ that does not intersect I (i.e. C ∗ ∩ I = ∅). Since I is not filling, we have In 6= V for all n ∈ N. Since V is finite, the increasing sequence I = I0 ⊆ I1 ⊆ I2 ⊆ . . . stabilizes at some N ∈ N, that is there is N ∈ N such that Ik = IN for all k ≥ N , and of course IN 6= V . So we pick z0 ∈ V \ IN . Note that In(z0 ) 6= ∅ because all vertices v with In(v) = ∅ are included in I by assumption, and z0 ∈ / I. Since IN +1 = IN we have that z0 is not determined by IN which means In(z0 ) 6⊆ IN . So we pick z1 ∈ In(z0 ) \ IN . Inductively, and similarly to above, we pick zk+1 ∈ In(zk ) \ IN for all k ∈ N. Then we consider the set Z = {zk : k ∈ N}. By construction Z has empty intersection with IN and therefore also with I ⊆ IN . Moreover we have (zk+1 , zk ) ∈ E for all k. Because V is finite, the set Z must contain a cycle C ∗ and we have C ∗ ∩ I = ∅.  A cycle C is said to be minimal if it is minimal amongst all cycles with respect to ⊆ (that is, whenever C ′ ⊆ C and C ′ is a cycle, then C ′ = C). Sometimes, minimal cycles are called elementary, for instance in [2]. Remark 2.9. It is easy to see that Theorem 2.8 can be made a bit simpler: in order to verify that a certain subset I ⊆ V is filling, it suffices to check that A(G) ⊆ I and that I intersects every minimal cycle. Example 2.10. In the introductory example of a weight-calculator, we have considered the three input fields “Sex”, “Age” and “Height” along with two replacement functions. As shown on the left hand side of Figure 1, the fields are represented by the three vertices in the graph G. The directed edge from “Height” to “Age” stands for the replacement function f for age based on height. Finally, the replacement function g for height depending on age and sex 6 MICHAEL MAYER AND DOMINIC VAN DER ZYPEN is represented by the two other edges pointing to “Height”. By our definitions, we can for instance make the following statements: The vertex set {Sex, Age} determines {Height}. The vertex {Height} determines {Age}. The set A(G) equals {Sex} (no edge points to it). By Theorem 2.8, the vertex set {Sex} is not filling, i. e. by entering only sex, the other fields cannot be autofilled. The reason is that the set {Sex} does not intersect the cycle formed by the two vertices {Age, Height}. (5) There are exactly three filling subsets for G: {Sex, Age}, {Sex, Height} and (trivially) also {Sex, Age, Height} as they all contain A(G) and at least one element of the only cycle. (1) (2) (3) (4) For complex graphs, the application of Theorem 2.8 might need algorithmic support to identify A(G) and particularly the cycles to check during runtime if partial input already fills the remaining fields. To do so, the following algorithms can be applied. Algorithm 2.11 (Identifying A(G)). A convenient representation for a directed graph is the adjacency matrix: The vertex are numbered 1, . . . , n and we assign a binary n × n-matrix MG ∈ Z2n×n to G by setting MG [i, j] = 1 if and only if (i, j) ∈ E, and MG [i, j] = 0 otherwise. Now A(G) is easily identified: i ∈ A(G) if and only if MG [·, i] (that is the ith column vector) is the constant 0 vector. Algorithm 2.12 (Identifying cylces). Different algorithms exist for finding cycles in a graph, see e. g. [2] for a solution and further references. The next result link graph filling to the concept of directed acyclic graph (DAG). Theorem 2.13 (Connection to DAGs). Let G = (V, E) be a self-loopless directed graph, and let I ⊆ V . Furthermore denote by G′ = (V, E ′ ) with E ′ = E \ {(x, i) : x ∈ V and i ∈ I} the subgraph without edges pointing to any v ∈ I. Then the following are equivalent: (1) I is filling in G = (V, E); (2) I is filling in G′ = (V, E ′ ) and G′ is a DAG. Proof. (1) =⇒ (2). First, we prove that I is filling in G = (V, E ′ ). Let In′ be the determination closures of I in the graph G′ . With In we denote the determination closures of I in G. For v ∈ V we denote by In′ (v) the set of incoming vertices in the graph G′ , and by In(v) the set of incoming vertices in G. As I is filling in G by assumption, we have IN = V for some N ∈ N, so it suffices to show that In′ ⊇ In for all n ∈ N. We proceed by induction. Clearly I0′ = I0 = I. Suppose we have Ik′ ⊇ Ik for some k and ′ ′ automatically since ⊇ Ik+1 . Let v ∈ Ik+1 . If v ∈ Ik we get v ∈ Ik+1 show that Ik+1 ′ ′ v ∈ Ik ⊆ Ik ⊆ Ik+1 . If v ∈ / Ik then we have trivially v ∈ / I. Moreover, the fact that Ik determines v in G implies in particular that In(v) 6= ∅, so we pick some incoming vertex j ∈ In(v). Since v ∈ / I we have (j, v) ∈ E ′ by the definition of E ′ , so (A) In′ (v) 6= ∅ in G′ . GRAPH-THEORETIC AUTOFILL 7 Since Ik determines v in the graph G, we get In(v) ⊆ Ik , and by definition of G′ we get In′ (v) ⊆ In(v). Combining this gives In′ (v) ⊆ Ik ⊆ Ik′ by induction assumption, and together ′ with (A) this implies that Ik′ determines v in the graph G′ , so v ∈ Ik+1 , which finishes the proof of (1). Next, we show that G′ is a DAG. Assume that G′ contains a cycle C. By Theorem 2.8 we know that I ∩ C 6= ∅, so pick v ∗ ∈ I ∩ C. Because C is a cycle there is a bijection p : {0, . . . , n} → C such that (p(k), p(k + 1)) ∈ E ′ for k ∈ {0, . . . , n − 1} and (p(n), p(0)) ∈ E ′ . We can pick p such that p(0) = v ∗ . But by definition of E ′ we have (p(n), v ∗ ) ∈ E \ E ′ , contradiction. (2) =⇒ (1). This is easily verified by noticing that Dtm(I) in the graph G′ = (V, E ′ ) equals Dtm(I) in G = (V, E) (we don’t even need the assumption that G′ = (V, E) is a DAG).  Since a DAG G′ = (V, E ′ ) does not contain cycles, Theorem 2.8 ensures that I ⊆ V is filling if and only if A(G′ ) ⊆ I. This provides a second possibility to verify if a subset I ⊆ V of vertices of a directed graph G = (V, E) is filling or not: Delete all edges pointing to vertices in I and check if the resulting subgraph G′ is a DAG with A(G′ ) ⊆ I. If yes, I is filling. Example 2.14. In Example 2.10 we have utilized Theorem 2.8 to show that, amongst others, the set I := {Sex, Age} is filling. Alternatively, we consider the subgraph G′ without edges pointing to I (see left side of Figure 2.14). Obviously, G′ is a DAG with I = A(G′ ) and thus, I is filling in the original graph G. The right side of Figure 2.14 shows that I ′ = {height} alone is not filling: Although the subgraph G′′ is a DAG, the subset A(G′′ ) = {Sex, Height} 6⊆ I ′ . Sex Age Sex Age Height Height Figure 2. On the left: Subgraph G′ without edges pointing to candidate set I := {Sex, Age} (highlighted). Right: Subgraph G′′ without edges pointing to I ′ = {Height}. Both subgraphs are DAGs. Since A(G′ ) ⊆ I, I is filling. But A(G′′ ) 6⊆ I, thus I ′ is not filling. While in our simple examples it is easy to verify if a (sub-)graph is a DAG, in more involved settings the following algorithm could be used systematically. A simple algorithm to check if G is a DAG uses the following algorithm. 8 MICHAEL MAYER AND DOMINIC VAN DER ZYPEN Algorithm 2.15 (Directed path). Whenever one deals with directed paths in graphs, Dijkstra’s Algorithm as described by himself in [3] is one of the most useful tools: it finds the shortest path (if there is any) between vertices. So let G be a directed graph on n vertices and let MG be its adjacency matrix. There is a directed path from vertex i to vertex j if and only if MGk [i, j] > 0 for some k ∈ {1, . . . , n − 1}. Algorithm 2.16 (Checking whether G is a DAG). Since in a DAG with n vertices, we have no path of length n, Algorithm 2.15 gives the following criterion: G is a DAG if and only if the sum of all traces of MG1 , MG2 , . . . , MGn−1 is 0. Other algorithms exist to identify if a directed graph is acyclic, see e. g. [4]. Definition 2.17. We say that a filling subset I ⊆ V is minimal (with respect to ⊆) if no proper subset of I is filling. The following example shows that two different minimal filling subsets do not necessarily have the same number of elements: Example 2.18. Let V = {1, 2, 3} and E = {(1, 2), (2, 1)} ∪ {(2, 3), (3, 2)}. 1 2 3 Then I = {1, 3} is minimal filling, and also J = {2}, but I and J differ in cardinality. Note that, even if we could visit each vertex of the graph by starting at K = {1} and following the directed edges, K is not filling: The reason is that {2} is not completely determined by K alone (there is also edge pointing from {3} to {2} and {3} is only determined through {2}). Put differently, K1 = K ∪ Dtm({1}) = {1}, thus K1 = K2 = K3 = · · · = {1}. Therefore K = {1} is not filling. Can we choose a minimal filling subsets such that it intersects every cycle (or every minimal cycle) at exactly 1 vertex? Unfortunately not: Example 2.19. Let V = {0, 1, 2}, let  E = (V × V ) \ (k, k) : k ∈ {0, 1, 2} , and let G = (V, E). Note that A(G) = ∅. If we take K = {k} for some k ∈ {0, 1, 2}, it is easily verified that Dtm(K) = ∅. So Kn = K for all n ∈ N, and therefore K is not filling. So if I ⊆ V is minimal filling, it contains at least 2 vertices of V = {1, 2, 3}, say {k1 , k2 } ⊆ I for k1 6= k2 ∈ {1, 2, 3}. But then C = {k1 , k2 } is a minimal cycle by definition of E, and |I ∩ C| > 1. By Theorem 2.8 or Theorem 2.13 it is not hard to verify whether a given set of vertices is filling or not. However, in general there will be no simple algorithm to identify the smallest possible filling subset of V . The following greedy algorithm will, however, usually provide a good approximation. Algorithm 2.20. (1) Identify the set I = A(G) of all vertices without incoming edges. GRAPH-THEORETIC AUTOFILL 9 (2) Choose all vertices W = {w1 , . . . , wm } in all minimal cycles that do not intersect I. Denote by ni the number of cycles intersected by wi , 1 ≤ i ≤ m. (3) If W 6= ∅, pick any wi ∈ W with maximal ni and set I := I ∪ {wi } (4) Go to Step 2 as long as W contains at least two elements. Otherwise stop with I as solution. The algorithm will terminate after maximal n − 1 iterations with n being the number of vertices. Due to its greedy nature, it might miss the optimal solution in certain hypothetical cases. Example 2.21. To illustrate the algorithm and also the power of Theorems 2.8 and 2.13, take the graph G on the right hand side of Figure 1. There are no vertices without incoming edge, thus A(G) is empty. There are four minimal cycles that contain the following vertices. (1) (2) (3) (4) {Age, Pregnant} {Sex, Pregnant} {Height, Age, Pregnant} {Sex, Height, Age, Pregnant} Consequently, after the first iteration of the algorithm up to Step 3, the candidate filling set I consists of the vertex “Pregnant” (hits highest number of cycles and A(G) is empty). Furthermore, W consists all vertices, so we start with a second iteration but now W is left empty and the algorithm stops. Thus, even if only the field “Pregnant” is entered, the remaining fields can be autofilled without further input. Of course there are also filling subsets without containing “Pregnant”, e. g. I ′ = {Age, Sex} (apply Theorem 2.8). Figure 2 presents the two subgraphs without edges pointing to I = {Pregnant} (left picture) resp. to I ′ (right picture), illustrating the idea of Theorem 2.13. Sex Age Sex Pregnant Height Age Pregnant Height Figure 3. Subgraphs resulting from removing the incoming edges to I = {Pregnant} resp. I ′ = {Age, Sex} (both highlighted) in the situation of Example 2.21. Since I resp. I ′ contain the vertices without incoming edges and since the subgraphs are acyclic, both I and I ′ are filling in the original graph without removed edges . 10 MICHAEL MAYER AND DOMINIC VAN DER ZYPEN 3. Filling with partial determination Given I ⊆ V and v ∈ V we say that I determines the vertex v with partial input if In(v)∩I 6= ∅, that is if any argument in a replacement function is provided. For short, we say in that case that I p-determines v. We denote the set of vertices p-determined by I by pDtm(I). Inductively we define the “p-determination closure” of I ⊆ V in the following way: (p) (1) I0 := I; (p) (p) (2) for n ≥ 0 set In+1 := In ∪ pDtm(In ). (p) We say that I ⊆ V p-fills V if there is n ∈ N such that In = V . Example 3.1. In Example 2.18, I = {1} is p-filling (but not filling) because I1 = {1, 2} and I2 = {1, 2, 3}. There is a first, trivial characterization of p-filling subsets of a graph G = (V, E): Proposition 3.2. Let G = (V, E) be a self-loopless directed graph, and let I ⊆ V . Then the following are equivalent: (1) I is p-filling; (2) A(G) ⊆ I, and for every vertex x ∈ V \ I there is a directed path from some j ∈ I to x. However, we can do much better than that. Next, we specify in a mathematical way what it means to “collapse” points x, y ∈ V when there are directed paths between theses points in either direction. Definition 3.3. If G = (V, E) is a directed graph and x, y ∈ V we say that x, y are strongly connected, in symbols x ≃ y if there exists a directed path from x to y and vice versa. Again, it is straightforward to verify that ≃ is an equivalence relation on V . The set of elements equivalent to x ∈ V is denoted by [x]≃ , and we call it the strongly connected component (scc) containing x. The set of scc’s on V with respect to ≃ is denoted by V /≃ . Note that by construction, every v ∈ V lies in a unique scc, the scc’s are mutually disjoint, and all scc’s are non-empty. We put a directed graph structure on V /≃ and set G/≃ = (V /≃ , E/≃ ) where E/≃ = {(C, D) ∈ V /≃ × V /≃ : C 6= D and there are c ∈ C, d ∈ D such that (c, d) ∈ E}. The following is an elementary observation: Lemma 3.4. If G is a directed graph, G/≃ is a DAG. Before we show how strongly connected components come into play for finding p-filling sets, we need some basic observations. Lemma 3.5. (1) If G = (V, E) is a DAG, then for all v ∈ V there is x ∈ A(G) and a directed path from x to v. GRAPH-THEORETIC AUTOFILL 11 (2) Let G = (V, E) be any graph. If C, D ∈ G/≃ and there is a direct path in G/≃ from C to D then for all c ∈ C, d ∈ D there is a directed path in G from c to d. Combining these observations lead us to the following: Proposition 3.6. Let G = (V, E) be any graph, and let v ∈ V . Then there is a strongly connected component C0 ∈ A(G/≃ ) such that for every c0 ∈ C0 there is a directed path in G from c0 to v. Proof. Let [v]≃ be the strongly connected component containing v. Lemma 3.4 says that G/≃ is a DAG. By Lemma 3.5 (1) there is C0 ∈ A(G/≃ ) and a directed path in G/≃ from C0 to [v]≃ . Finally, Lemma 3.5 (2) implies that there is a directed path in G from any c0 ∈ C0 to v.  Theorem 3.7. Let G = (V, E) be a directed graph, I ⊆ V , and consider the graph G/≃ . Then the following statements are equivalent: (1) I is p-filling; (2) I intersects every strongly connected component C ∈ A(G/≃ ). Proof. (1) =⇒ (2). Suppose C ∗ ∈ A(G/≃ ) and let I ⊆ V \ C ∗ . We show that I is not p-filling for G (p) (p) by establishing that In ∩ C ∗ = ∅ for all n ∈ N. The statement is true for I0 = I. Assume (p) that Ik ∩ C ∗ = ∅. The fact that C ∗ ∈ A(G/≃ ) means by definition of A(·) and by definition of E≃ that there is no edge in E coming into C ∗ from the outside – or, more precisely, for all (p) x ∈ V \ C ∗ and c ∈ C ∗ we have (x, c) ∈ / E. Therefore, for all c ∈ C ∗ we have c ∈ / Ik+1 , that is (p) (p) (p) C ∗ ∩ Ik+1 = ∅. This inductive argument proves that In ∩ C ∗ = ∅ for all n ∈ N, so In 6= V for all n, and therefore I is not p-filling. (2) =⇒ (1). Fix v ∈ V . By Proposition 3.2 we need to establish that if I intersects any strongly connected component (= element of A(G/≃ )), then there is a directed path from some i ∈ I to v. The proposition then implies that I is filling. Use Proposition 3.6 to find C0 ∈ A(G/≃ ) such that for every c0 ∈ C0 there is a directed path in G from c0 to v. Since I intersects C0 by assumption, pick j0 ∈ I ∩ C0 . So there is a directed path in G from j0 ∈ I to v. So by Proposition 3.2, I is filling because v was chosen arbitrarily.  Algorithm 3.8. In [7], Tarjan introduced an algorithm that identifies for any graph G = (V, E) its strongly connected components in time O(|V | + |E|). Note that this theorem implies that minimal p-filling subsets intersect every member of A(G/≃ ) at exactly one point. So this implies: Corollary 3.9. All minimal (with respect to ⊆) p-filling subsets have the same cardinality. This is in sharp contrast to the situation in the previous section, where minimal filling subsets can have different cardinalities (see Example 2.18). 12 MICHAEL MAYER AND DOMINIC VAN DER ZYPEN Moreover, Theorem 3.7 gives an efficient algorithm to find minimally p-filling sets: Identify the strongly connected components that don’t have an incoming edge (that is, A(G/≃ )), and pick a vertex from each. Example 3.10. In the introductory example of a weight-calculator with the vertices V = {Age, Height, Sex}, we could modify the replacement function ( 162 + 16s if x > 16, g(x, s) := ⌊(x − 1)/16 · 130 + 30.5⌋ if x ≤ 16 for height z based on (non-missing) age x and sex s by a function that allows one of the two arguments to be missing, for instance by   if (x > 16 or x missing) and s non-missing, 162 + 16s ′ g (x, s) := 170 if x > 16 and s missing,   ⌊(x − 1)/16 · 130 + 30.5⌋ if x ≤ 16. Then, in our graph-theoretic autofill framework, the vertex “Height” would partially be determined by “Sex” and “Age” and “Age” (partially) determined by “Height”. The graph on the left side of Figure 1 would be equivalent (under ≃) to the DAG G/≃ with the two strongly connected components {Sex} and {Age, Height} as vertices. By Theorem 3.7, any subset of V containing A(G/≃ ) = {Sex} would be filling. Generally, to use partial determination requires more effort to properly define the replacement functions compared to complete determination as these functions also need to treat missing input in a reasonable way. However, usually a smaller subset of input values is required to fill the remaining values. 4. Conclusion Loopless directed graphs turn out to be the ideal framework for representing different kinds of “inference” or “implication”. In this article, we used directed graphs in the context of missing values in vectors. In web forms used today, the set of mandatory fields is fixed. Our approach offers something new: flexible and smart filling of missing values. References [1] R. Diestel. Graph Theory, Springer Verlag, 2010. [2] D. Johnson. Finding all the elementary circuits of a directed graph, SIAM Journal on Computing 4(1): 77–84, 1975. [3] E. Dijkstra. A note on two problems in connexion with graphs, Numerische Mathematik 1: 269–271, 1959. [4] R. E. Tarajan. Edge-disjoint spanning trees and depth-first search, Acta Informatica 6(2): 171–18, 1976. [5] M. Saar-Tsechansky and F. Provost. Handling Missing Values when Applying Classification Models, Journal of Machine Learning Research 8: 1625–1657, 2007. [6] D. Schuurmans and R. Greiner. Learning to classify incomplete examples, Computational Learning Theory and Natural Learning Systems IV: Making Learning Systems Practical: 87–105, MIT Press, Cambridge MA, 1997. [7] R. E. Tarjan. Depth-first search and linear algorithms for graphs, SIAM Journal on Computing, 1(2): 146–160, 1972. GRAPH-THEORETIC AUTOFILL Consult AG, CH-8050 Zurich, Switzerland E-mail address: [email protected] Federal Office of Social Insurance, CH-3003 Bern, Switzerland E-mail address: [email protected] 13