Academia.eduAcademia.edu

A Framework for Controlling Non-Symbolic Numerical Stimuli

Non-symbolic numerical stimuli play a crucial role in numerical cognition. Physical properties such as surface area, density, and item circumference are inherently correlated with quantity. The correlations between physical properties and quantity mask the mechanism underlying numerical perception. Non-symbolic stimuli are generated using different generation methods (GMs) aimed at controlling physical properties. The way a GM controls physical properties affects numerical judgments. Here, using a novel data-driven approach, we provide a methodological review of non-symbolic stimuli GMs developed since 2000. Annotators tagged the GMs’ control over physical properties. Next, GMs were qualitatively analyzed for different property controls, terminology, and definitions. The tagging and qualitative analysis provided data suitable for quantitative analysis of GMs. We found that the field thrives with a wide variety of GMs aimed at tackling new methodological and theoretical ideas. Howeve...

1 A Framework for Controlling Non-Symbolic Numerical Stimuli Yoel Shilat1, 2* ∙ Avishai Henik1, 2 ∙ Hanit Gallili2, 3 ∙ Shir Wasserman4 ∙ Moti Salti5 1 Department of Psychology, Ben-Gurion University of the Negev, Beer Sheva 84105, Israel 2 The Zelman Center for Brain Science Research, Ben-Gurion University of the Negev, Beer-Sheva, Israel 3 School of Brain Sciences and Cognition, Ben-Gurion University of the Negev, Beer-Sheva, Israel 4 Behavioral Sciences Program, Ben-Gurion University of the Negev, Beer-Sheva, Israel 5 Brain Imaging Research Center, Ben-Gurion University of the Negev, Beer-Sheva, Israel * Corresponding author: [email protected]; Phone: +972-8-6477209 2 Abstract Non-symbolic numerical stimuli play a crucial role in numerical cognition. Physical properties such as surface area, density, and item circumference are inherently correlated with quantity. The correlations between physical properties and quantity mask the mechanism underlying numerical perception. Nonsymbolic stimuli are generated using different generation methods (GMs) aimed at controlling physical properties. The way a GM controls physical properties affects numerical judgments. Here, using a novel data-driven approach, we provide a methodological review of non-symbolic stimuli GMs developed since 2000. Annotators tagged the GMs’ control over physical properties. Next, GMs were qualitatively analyzed for different property controls, terminology, and definitions. The tagging and qualitative analysis provided data suitable for quantitative analysis of GMs. We found that the field thrives with a wide variety of GMs aimed at tackling new methodological and theoretical ideas. However, the field lacks a common language and a method to incorporate new ideas in the existing literature. Furthermore, these shortcomings impair the comparison, replication, and reanalysis of previous studies in light of new ideas. We present guidelines for GMs that will hopefully contribute to the field. First, researchers should define controlled properties explicitly and consistently. Second, researchers should provide the code package used to generate stimuli. Third, researchers should also provide the actual stimuli, coupled with the behavioral and neuronal responses. This last guideline would enable researchers to reanalyze previously obtained data, enabling incorporating new ideas in the context of prior research. Keywords: Numerical cognition ∙ Numerosity ∙ Physical properties ∙ Reproducibsility ∙ Comparability ∙ Reanalysis 3 A Framework for Controlling Non-Symbolic Numerical Stimuli Introduction Recent numerical comparison studies struggle to control the physical properties of non-symbolic arrays. An array of items can be described by its number of items or by its physical properties. The physical properties of any item array have a natural and inherent correlation with the array’s quantity (Dehaene, 1999; Leibovich et al., 2017; Mehler & Bever, 1967). Increasing or decreasing the number of items necessarily changes at least one of the array’s physical properties (see Figure 1). Consider three large apples lying on the kitchen counter. Adding a fourth large apple would increase the total surface area and circumference of the apple’s stack (array). Early studies show an association between performance in numerical judgments and the physical properties of the presented stimuli (e.g., French, 1953; Frith & Frith, 1972; Piaget, 1968). In the contemporary study of numerical cognition, the physical properties of a numerical array are considered in two opposing manners. Some treat them as a biasing factor distracting us from the thing we want to study—the ability to judge numerosity (e.g., Halberda et al., 2008; Piazza et al., 2004). Others treat them as a key to understanding numerical capacity (Gebuis & Reynvoet, 2011a, 2011b; Leibovich et al., 2017). Researchers from all views create non-symbolic numerical stimuli to study numerical cognition. Regardless of their view of numerical cognition, researchers must address the natural correlation between the array’s physical properties and its quantity to prevent confounds. Importantly, there are abundant ways to create non-symbolic numerical stimuli relying on different theoretical and methodological prisms (DeWind et al., 2015; Gebuis & Reynvoet, 2011a, 2011b; Katzin et al., 2019; Piazza et al., 2004; Salti et al., 2017). The different ways the stimuli are created change the correlations between numerical and physical properties. The different correlations between numerical and physical properties bias behavioral and neuronal results (Clayton et al., 2015; DeWind & Brannon, 2016; Gebuis & Reynvoet, 2011a, 2011b; Kuzmina & Malykh, 2022; Leibovich et al., 2017; Salti et al., 2017; Smets et al., 2015). The different correlations between numerical and physical properties make it harder to compare different studies (Clayton et al., 2015; Smets et al., 2015), and limit the ability to examine previous studies through new theoretical prisms. 4 Figure 1 A Natural Correlation between Physical Properties and Numerosity Note. Increasing or decreasing the number of items necessarily changes at least one of the array’s physical properties. The left panel depicts three items, and the right one displays four items. A list of some of the arrays’ physical properties is displayed on the left and right of the arrays. Adding one item will necessarily change one or more of the arrays’ physical properties. Increasing the number of items to four in the array in the top row would increase the total surface area and circumference of the array. In contrast, increasing the number of items to four in the array in the bottom row, while keeping the array’s total surface area and circumference constant, will result in decreasing the array’s average diameter and area (denoted as the item surface area). Non-symbolic numerical stimuli are multifaceted and can be classified according to different taxonomies; for example, according to a global-local distinction (Guy & Medioni, 1993; Navon, 1977), or intrinsic-extrinsic distinction (Gebuis & Reynvoet, 2011a; Salti et al., 2017). Individual physical properties can also be defined in multiple ways. For example, there are seven different definitions for density in the literature (see our Qualitative analysis). The diverse ways to address, control, manipulate, and analyze physical properties make it harder to achieve a common scientific ground (Stalnaker, 2002). The lack of common ground hinders efficient comparisons of results coming from different studies that used different methods to generate stimuli. Hereafter, we will denote the Generation Methods to generate non-symbolic numerical stimuli as GMs. A GM is defined as a unique algorithm for creating visual non-symbolic stimuli according to predefined numerical and physical parameters. Different GMs use different experimental controls and manipulations that generate different correlations 5 between numerical and physical properties (Gebuis & Reynvoet, 2011a; Katzin et al., 2019; Salti et al., 2017). The different correlations might bias the results of studies. Smets et al. (2015) found that two different GMs suggested by Piazza and Dehaene and by Gebuis and Reynvoet led to different Weber fractions and performance accuracy (Gebuis & Reynvoet, 2011a; Piazza et al., 2004). Clayton et al. (2015) have also found that two different GMs led to different accuracy and congruity effects. When the stimuli were created using Gebuis and Reynvoet’s (2011b) GM, participants exhibited the canonical congruency effect, while the Panamath GM (Halberda et al., 2008) led to the opposite congruency effect. Using a stimuli set created by the Panamath GM, the participants’ accuracy was higher in incongruent trials than in congruent trials. When Clayton et al. (2015) compared the stimuli produced by Gebuis and Reynvoet’s (2011a) GM to Panamath GM stimuli (Halberda et al., 2008), they discovered that the relationship between the arrays’ convex hull area, total surface area, and quantity were different between the two different stimuli sets (Clayton et al., 2015). To conclude, comparing the results coming from different experiments with stimuli generated by different GMs is limited. The variety of GMs also limits the ability to examine previous studies through new theoretical prisms. For example, Yousif and Keil (2019) suggested that arrays’ additive area has a role in numerical judgments. Previous studies have not recorded data on arrays’ additive area, nor have they provided their stimuli. Therefore, it is impossible to test this idea retrospectively because there is no way to extract the relevant information from previous studies. Another example comes from our own experience. We lately suggested that the shape of the convex hull is important for numerical judgments (Katzin et al., 2019; Shilat et al., 2021). Previous studies have not recorded data on the shape of the convex hull and have not provided their stimuli. Therefore, it is impossible to test this idea retrospectively. Science evolves and new ideas constantly emerge. To prioritize new ideas, it is best to test them using diverse methods. Testing new ideas on previously published data could facilitate the “future scientist” and enable more rapid progression. The current review The purpose of the current review is to create common ground for future studies and to provide tools to test new ideas on already existing data. This review starts by describing the current state of the field in terms of a common language, followed by mapping the different controls employed by the various GMs. For each GM we examined how the physical properties were defined and how they were controlled. This was followed by a quantitative analysis of the distribution of the frequency of these 6 different controls. Describing the field and mapping it provided a data-driven review of the different GMs. Methods Procedure overview We studied GMs of non-symbolic numerical stimuli, designed to control the physical properties of non-symbolic numerical stimuli. We started by identifying and collecting the different GMs. This provided qualitative raw data on the different ways to control physical properties. Human annotators tagged the different controls in each of the identified methods, making the qualitative data suitable for quantitative analysis of the distributions and trends in the GMs’ control over physical properties. Figure 2 presents an overview of the structure and pipeline of this review. Figure 2 The Structure and Pipeline of this Review 7 Note. The workflow of constructing the current review. We focused on GMs published between the years 2000 and 2021. Focusing on these years allowed a broad and exhaustive look on one hand, and on the other, narrowed it to relevant GMs used in contemporary research. Figure 3 reflects the growing number of relevant publications related to GMs since the year 2000. The interest in GMs has been steadily increasing since the year 2000. Because the term “Generation Methods” is not commonly used, we ran a search for the term “non-symbolic numerical stimuli” in Semantic Scholar (semanticscholar.org). Figure 3 presents the results of this search. Figure 3 The Number of Publications Related to GMs has Increased Throughout the Years Note. The figure displays the number of publications related to the GMs field between the years 19902021. The number of publications related to GMs has constantly grown over the years. Publication year linearly predicted the number of publications related to GMs (see Supplementary Information). 8 Although this review deals with GMs from the year 2000, we wanted to provide a wider scope of results. The results were obtained by running the query “non-symbolic numerical stimuli” in Semantic Scholar. The x-axis presents the article’s publication year. The y-axis presents the number of publications. The year 2000 is marked with a gray arrow. The blue line denotes a regression fit line predicting the number of numerical cognition publications according to their publication year. The gray shaded mark denotes a 95% confidence interval for the regression line. GMs database The identification of the reviewed GMs list (see Figure 2) started by collecting a list of initial methods based on our knowledge of the literature (N = 9). The initial search was followed by a search for additional GMs (N = 13). We used the names of the publications in the initial methods list as input for searching for additional methods. The additional GMs search was conducted in three different ways, providing data on previous and/or follow-up GMs:(1) previous GMs were gathered by searching within the references of each of the identified GMs publications; (2) follow-up GMs were found by performing a search for the citations of the nine initial GMs via the Google Scholar academic search engine (scholar.google.com); (3) previous and follow-up GMs were searched via the Connected Papers platform (connectedpapers.com) – a free web tool that builds a visual network graph of the papers related to the “origin paper” that was used as the search query input. The network graph is created by aggregating similar articles according to their semantic similarity and overlapping citations (Kaur et al., 2022). The Connected Papers platform provides data on prior and future (i.e., “derivative”) works related to the “origin paper”. The graph’s data is retrieved from Semantic Scholar – a free semantic search engine for multiple academic disciplines (Fricke, 2018; Gusenbauer, 2019; Gusenbauer & Haddaway, 2020; Jones, 2015). All in all, our GMs Database included twenty-two GMs that we have identified. Identifying the different controls Different GMs were designed within different theoretical and methodological frameworks. Accordingly, they approach the problem of property control from different perspectives. The complexity of the different GMs makes it very hard to create a natural language processing (NLP) algorithm that can be used to tag the properties controlled. Moreover, the diversity of contexts in which the data can be tagged makes it impossible to create an a priori set of over-reaching general tagging rules and requires manual tagging. Therefore, human annotators manually tagged the different controls 9 used by the GMs. To keep our tagging process as data-driven as possible we designed a straightforward and parsimonious annotation scheme comprised of three steps (see Figure 4). First, annotators searched for physical properties that were stated by the authors as the properties they controlled for. Second, annotators marked the properties definitions in GMs articles and tested their consistency throughout the text. A definition was tagged as inconsistent if there was a discrepancy between the property definition within the manuscript and the way this property was controlled for. We did not impose previous definitions on the identified controls and relied on the authors’ properties definitions whenever they were available. Each annotator independently read all GMs manuscripts, and both compiled an initial list of possible controls. In the three times the annotators disagreed, they discussed the issue and reached a joint decision. Whenever the annotators identified a new control that did not appear in the initial list, they added it to the tagging list if and only if the following two criteria were met: (1) they could not find an equivalent to the given property under a different name, and (2) the property synonyms were not found under a thesaurus entry (Danner, 2014). The results of the tagging process are available online (see Supplementary Information). Figure 4 Annotation Scheme for Controlled Properties Note. The annotation scheme was used for identifying and tagging the experimental control used by the different GMs. It was designed in a data-driven manner, so that the GMs’ control was tagged based on the publications themselves and not on a priori assumptions. The process started by identifying possible properties that were controlled by the authors; first, by searching the manuscripts for explicit statements on properties controlled, and then making sure these properties were defined. Then, the annotators searched and tagged the control types used by the GMs and the properties they controlled for. Control types. As is clear by now, the close relationships between physical and numerical properties in non-symbolic stimuli cannot be overlooked because using different controls is associated with different behavioral 10 and neuronal responses (Clayton et al., 2015; DeWind & Brannon, 2016; Gebuis & Reynvoet, 2011a, 2011b; Kuzmina & Malykh, 2022; Piazza et al., 2004; Salti et al., 2017; Smets et al., 2015). Accordingly, it is required to design experiments in which certain properties (e.g., total surface area) will not predict other properties (usually, numerosity). There is also a need to control physical properties such that they will have similar discriminability. The methods to control these factors will be denoted as Control Types (N = 5). Trivially, numerical comparison stimuli are comprised of two arrays, therefore we divided the tagged control types into Between Arrays Controls )N = 3), defined as a control on both arrays, and Within Arrays Controls )N = 2), defined as a control for each of the separated arrays. We discuss the different aspects of the different control types in Table 1. 11 Table 1 Control Types Between arrays control Definition Rational Two arrays are congruent when they display a high degree of perceptual similarity in one or more of their dimensions (Egner, 2007) Congruency is used to prevent the observers from predicting one property using another one, mainly using a physical magnitude to predict numerosity Shapes heterogeneity between An array is heterogeneous within shapes when its items are in different shapes. An array is homogenous within shapes when it is comprised of items of the same shape A shape heterogeneous array can make it harder to compute some of the stimuli properties to predict numerosity. In contrast, a homogenous array will make it easier to compute the total or average items’ sizes and circumferences and use it to predict numerosity when compared to items that vary in their shape (Aulet & Lourenco, 2021) Example A congruent condition in which the more numerous array also cover a larger area, and an incongruent condition in which the more numerous array covers a smaller area. When congruent and incongruent stimuli appear in equal proportions the two dimensions (e.g., total area and numerosity) cannot predict one another Two stimuli are controlled for saliency when A salient property may be used by the observers Two arrays, one with three dots and the other Saliency the ratios of the controlled properties are to perform the task even when it is not relevant with four dots. Also, the item surface area of similar-to-equal. A property has a higher degree to the task at hand (Salti et al., 2017) the more numerous array is ten times larger of saliency when the ratio of this property in than the less numerous one. The items’ surface both arrays is considerably lower than the ratios area is considerably more salient than of other properties numerosity as its ratio is much higher. Equalizing the numerical ratio to the items’ 1 3 surface area ratios to a ratio of instead of 10 4 will match the saliency of these properties When the items’ shapes vary between the Two arrays. One is comprised of triangles and Shapes heterogeneity between Two array shapes are heterogenous-between when the array’s items’ shapes vary between arrays, the arrays are more dissociable. the other is comprised of crescents (Sophian & the two arrays. Two array shapes are Furthermore, it is harder for the observers to Chu, 2008) homogenous-between when both arrays are compute some of the arrays’ physical properties comprised of the same shapes using the same algorithm and observers must use different algorithms to compute stimuli properties Within arrays control Definition Rational Example Congruency A heterogenous within shapes array will be comprised of dots, triangles, and crescents. In contrast, in a homogenous within shapes array all items will be dots 12 Sizes heterogeneity within An array is heterogenous within sizes when its Using size heterogenous within arrays elements vary in their sizes and is homogenous eliminates some of the associations between within sizes when its elements are the same size items’ individual and collective sizes to numerosity and avoids related confounds (Gebuis & Reynvoet, 2011a; Guillaume et al., 2020; Marchant et al., 2013). When the items are size heterogeneous within, it is harder for participants to assess the average items’ area and to use it for numerical decisions. In contrast, when all items are homogenous, it increases some of the natural correlations between quantity and physical properties related to the items’ spacing, sizes, and the items’ coverage of the display (Guillaume et al., 2020; Rousselle et al., 2004; but see also DeWind et al., 2015) In a heterogenous within sizes array the items in the same array have a different surface area, but two arrays of different quantities can have the same total surface area. In contrast, in a homogenous within sizes array the items in the same array have the same surface area, and two arrays of different quantities will have a different total surface area Note. A list of the tagged control types. The control types were divided according to within arrays control and between arrays control. 13 Controlled properties. The final controlled properties list included thirteen properties. Table 2 details all the physical properties, alternative terms, and definitions. The different properties were grouped according to the previously suggested taxonomy of intrinsic and extrinsic properties (Gebuis & Reynvoet, 2011a; Salti et al., 2017). Notably, Piazza et al. (2004) also discussed intrinsic and extrinsic properties but used the terms intensive and extensive parameters. Eventually, we used Shilat et al.’s (2021) definition for the intrinsic-extrinsic taxonomy. Accordingly, intrinsic properties describe information extracted from individual array items. In contrast, extrinsic properties describe information extracted from the array as a whole (Shilat et al., 2021; similar to Xu & Spelke, 2000). Intrinsic and extrinsic properties can be relatively independent of one another, especially when using different-sized items. For example, increasing the extrinsic property of the array’s convex hull area by increasing the distance between the items will not affect the intrinsic property of the array’s total surface area as they are independent. The intrinsic-extrinsic distinction reflects refined nuances of stimuli control indicating an increased focus of a GM on methodological issues. 14 Table 2 Physical Property Definitions and Alternative Terms Intrinsic Properties Item surface area Total surface area Luminance Formula 𝑛 ∑ 𝑖=1 𝑛 𝜋𝑟𝑖2 𝑛 ∑ 𝜋𝑟𝑖2 𝑖=1 𝑐𝑑 𝑚2 Definition Alternative Terms Individual items’ average surface area Item size, surface individual, average dots size, average surface area The sum of the items’ surface areas Note True area, total filled area, summed continuous extended, cumulated surface area, aggregate surface, total occupied area, cumulative area, overall surface The items’ brightness or intensity relative to Contrast, total brightness The GMs inconsistently the background. defined luminance and referred to it in different manners. Some of the GMs consider luminance homologous to the total surface area (Guillaume et al., 2020; Piazza et al., 2004; Rousselle et al., 2004; Soltész et al., 2010; Yousif & Keil, 2019). Another portion of the GMs referred to luminance in a more nuanced and complex manner (Dakin et al., 2011; Lourenco et al., 2012; Ross, 2003). When using items of the same color and a homogenous background, luminance can be operationally defined and calculated in the same manner as the total surface area. For example, when using black dots with a white 15 Average diameter Total circumference Additive area Inter distance Extrinsic Properties Density Open space Apparent closeness Convex hull area 𝑛 2𝑟𝑖 ∑ 𝑛 Individual items’ average diameter 𝑖=1 𝑛 The sum of the items’ circumferences 𝑖=1 𝑛 The sum of the items’ dimensions (Yousif & Keil, 2019) ∑ 2𝜋𝑟𝑖 ∑ 2𝑟𝑖 + 2𝑟𝑖 background. Nevertheless, luminance has a different theoretical meaning then the total surface area (Kadosh et al., 2008; Mareschal & Baker, 1998; Pinel et al., 2004). 𝑟 Average radius (∑𝑛𝑖=1 𝑖 ) was 𝑛 considered as homologic to average diameter 𝑖=1 The average distance between dot centers, 2 + (𝑦 2 ) calculated as the average of the shortest open ( ) ) 𝑚𝑖𝑛(√ x − x − 𝑦 ⅈ+1 ⅈ ⅈ+1 ⅈ 𝑛−1 𝛴ⅈ=1 path connecting all the array’s dots 𝑛−1 Definition Formula There are several definitions (see Qualitative analysis 2nd section) Similar to density but was inconsistently defined ( see Qualitative analysis 2nd section). Also, open space has a different theoretical meaning than density but might be considered as a proxy of density (Sophian & Chu, 2008) The stimulus overall scaling. Increasing the apparent closeness is equivalent to zooming in on a stimulus, such that it subtends a larger visual angle without changing its relative proportions (DeWind et al., 2015) The area of the smallest convex polygon that contains all objects in the array Contour length, sum of the items’ perimeters Inter-item spacing Alternative terms Note Coverage, element spacing Area extended, total envelope, field area, global occupied field, field area Convex hull area can be calculated by dividing the polygon into triangles, calculating their areas, and summing them 16 Average occupancy Spatial frequency 𝐶𝑜𝑛𝑣𝑒𝑥 ℎ𝑢𝑙𝑙 𝑛 The average space individual dots sustain within and around their physical size Sparsity, average field area per item, the inverse of the density The number of spatial cycles of a visual event (such as an object or color code) within a given image area, usually measured in pixels (Boreman, 2021; De Valois & Switkes, 1980; Efford, 2000) Note. The Table lists tagged physical properties textual definition and properties’ alternative terms, divided according to the taxonomy of intrinsic and extrinsic properties. Whenever applicable, we also provide the formula defining each property. As most of the GMs use dots, the relevant equations refer to circles. When 𝑛 denotes the number of dots, 𝑟 denotes the dot radiuses, 𝑖 denotes the dot indexes, 𝑐𝑑 denotes candela units (Trezona, 2000), 𝑚 denotes meters, and 𝑥 and 𝑦 denote the cartesian coordinates of the dot centers. We referred to specific GMs if the GMs used a unique definition of a property or if a property appeared only in one GM. The properties are ordered so that similar properties that rely on the same variable or constant are grouped together. 17 Results Qualitative analysis Generation methods list analysis. Figure 5 provides an overview of the different GMs (N = 22), the properties they controlled (N = 13), and the ways the properties were controlled (i.e., control types, N = 5). The properties were divided according to the intrinsic-extrinsic distinction. We named the GMs according to their respective publication authors. Some of the GM publications were co-authored by authors participating in more than one publication. Provided data. To fully understand the results obtained using different GMs there is a need to understand the difference between the different stimuli sets and compare them in the context of a theoretical perspective. To gain knowledge regarding stimuli reproduction and the comparability of the different GMs, we tagged the different types of data supplied by the GMs. We found three types of data-sharing in the literature providing knowledge of the stimuli sets and GMs: (1) methods that shared their stimuli sets as pictures, such that they could be reanalyzed after publication (e.g., Shilat et al., 2021, which is not a GM); (2) methods that provided a way to reproduce their stimuli by providing a software package or graphical user interface (GUI) that enables stimuli reproduction (e.g., Guillaume et al., 2020);and (3) methods that shared a detailed report on the stimuli features (such as the ratio of physical properties in each picture) but did not share the pictures themselves (Yousif & Keil, 2019). We found that none of the GMs shared their stimuli as pictures. Another issue is that only one GM provided detailed data but not a software package or GUI to reproduce the stimuli. Therefore, we decided to unify these different tags into a unified tag and named it reproducibility. Figure 5 displays the status of the GMs reproducibility (N = 12). Control Type is not related to the controlled properties. We found that the types of controls used by the GMs and the controlled properties had a low-to-no dependency on one another. For example, a GM can control for the congruency of the average diameter to the arrays numerosity. However, the same method can control the stimuli saliency by imposing it on other physical properties, such that the arrays’ convex hull area will be equated to their numerosity. Any type of control the GMs impose on the different physical properties does not entail the use of other control types. 18 Figure 5 List of All GMs, their Control Types, and Controlled Properties Note. The figure presents a list of all GMs and the results of tagging their control types, controlled properties, and reproducibility. The GMs controlled a wide variety of properties and a wide variety of property combinations. Each row depicts a different GM (N = 22). The GMs are ordered according to their respective publication year and alphabetical order. The rightmost column presents data on the GMs reproducibility, marked with an open lock. The rest of the columns present data on the GMs’ controls, marked using circles. The controls were divided into control types (N = 5) and controlled properties (N = 13). Whenever a control type was used, a blue circle appears. The controlled properties 19 were divided into intrinsic and extrinsic properties and are marked in dark and light pink circles, respectively. Generation methods properties definitions and terminology. Properties definitions, and the case of density. The definitions of the different properties are inconsistent or inexistent. Density, for example, is inconsistently defined in the literature (Dakin et al., 2011; De Marco & Cutini, 2020). We chose to focus on density because it is the extrinsic property most of the GMs have attempted to control for (~59% of the GMs (15/22); see Table 3. The different definitions for density rely on a combination of three characteristics of the non-symbolic array: (1) the items’ number; (2) the area on which the items are scattered; and (3) the items’ distances. Notably, density was not defined in ~33.3% (5/15) of the GMs that stated that they controlled for it (Halberda et al., 2008; Halberda & Feigenson, 2008; Huntley-Fenner & Cannon, 2000; Piazza et al., 2004; Rousselle et al., 2004). Importantly, there were inconsistencies in definition, even within a manuscript. Thirty percent (3/10) of the GMs that have stated to control for density and have also provided a definition of it were not consistent in the way it was calculated. For example, Sophian and Chu (2008) discussed two definitions of density. The first definition is based on the total surface area. The second definition is based on individual items. However, they eventually controlled for a third definition, namely, the amount of open space in the array. We could not find a concrete definition of the term open space. Instead, open space is the aggregated space unoccupied by the array’s items. Open space might be considered as a proxy of density. Open space was manipulated by using different array configurations or different groupings that provide different levels of open space. The GMs that have stated to control for density but have not defined it or have inconsistently referred to it within a paper, were conceived as a part of the earlier GMs. In later years, more GMs that have stated to control for density have defined it. 20 Table 3 Different Density Definitions of the GMs that Controlled Density Generation Method Control Definition Definition Statement Availability Consistency Huntley-Fenner & Cannon, 2000 Piazza et al., 2004 Definition ✓ Rousselle et al., 2004 Halberda et al., 2008 Halberda & Feigenson, 2008 Ross, 2003 ✓ ✓ Dakin et al., 2011 ✓ ✓ Sophian & Chu, 2008 ✓ ✓ Number Display area When the display area is defined as the area in which the items can be scattered The authors regarded density as element spacing but have not defined it. Otherwise, defined Number density as convex hull area Total surface area Dⅈsplay area Also, defined density as the amount of space individual items occupy. Eventually, they controlled for the open space in the array, a proxy of density (see above). Guillaume et al., 2020 ✓ ✓ ✓ ✓ ✓ ✓ Zanon et al., 2021 Gebuis & Reynvoet, 2011a Gebuis & Reynvoet, 2011b Total surface area Convex hull area 21 Salti et al., 2017 De Marco & Cutini, 2020 DeWind et al., 2015 ✓ ✓ ✓ Item size √ Spacing Spacing was defined according to the distance between a fixed number of items. However, in Number future work DeWind and Brannon (2019) defined density as: Convex hull area Note. The table displays an analysis of the different definitions for density. Eight out of 15 GMs (~53%) that stated they control density have inconsistently defined density or have not defined density at all. All GMs in the current table have stated that they controlled density. The second column from the left presents the existence of a density control statement. The third column denotes the availability of a definition for density in the text. The fourth column presents the consistency of the definition of density throughout the text. Trivially, if density was not defined, its definition could not be consistent. A definition was marked inconsistent if the annotators found a discrepancy between the definition of density and the actual way it was controlled for. The seven different definitions for density appear in the last column. A green checkmark represents conditions encoded as true, defined as conditions in which the annotators have spotted a control statement on density, or found a definition of density or found it consistent. Whenever one of these conditions was not met, it was marked with a red x-mark. The GMs are ordered according to the following categories: (1) GMs that have not defined density; (2) GMs that have inconsistently regarded density; and (3) GMs that have properly defined density. Within each category the GMs are arranged in the following order: (1) definitions of density based on the items’ number; (2) definitions of density relying on the convex hull area; and (3) definitions of density relying on the items’ distances. Otherwise, the GMs are ordered chronologically. 22 Inconsistent terminology. We found that the thirteen different controlled for properties were referred to by a total of 35 alternative terms (Table 2). The same property might be referred to by using synonyms that share the same meaning, with some more similar than others (Danner, 2014; Lea, 2008). A property can be also referred to by different homologous terms, although these terms are not straightforward synonyms of the same property. The annotators reviewed the GM text again and discussed whether these terms refer to properties that already exist in our properties list. Total circumference provides an example of a property that could be replaced by various synonyms. For instance, the word “circumference” can be replaced with the word “perimeter” or any other synonym. The word “total” can be replaced with “sum” or any other synonym or combination of these synonyms. We found that many GMs used different synonyms for the total circumference (De Marco & Cutini, 2020; DeWind et al., 2015; Halberda & Feigenson, 2008; Lyons et al., 2014; Price et al., 2012; Rousselle & Noël, 2008; Rousselle et al., 2004; Salti et al., 2017; Yousif & Keil, 2019; Zanon et al., 2021). Some properties were referred to by terms that are not direct synonyms and can throw the reader off in a different theoretical direction. For instance, the term “area extended”, used as an alternative term to convex hull area, can be misinterpreted as related to the items’ surface area. For some terms, it was hard to know if the different authors referred to the same property. For example, some used the term “contour length” as homologous to the term “total circumference” (DeWind et al., 2015; Gebuis & Reynvoet, 2011a, 2011b; Halberda & Feigenson, 2008; Rousselle & Noël, 2008; Rousselle et al., 2004; Soltész et al., 2010; Sophian & Chu, 2008; Yousif & Keil, 2019; Zanon et al., 2021). The situation was even less clear for extrinsic properties not dependent on the items’ radiuses. For example, the convex hull area was also referred to using the term “field area” (DeWind et al., 2015) or “area extended” (Gebuis & Reynvoet, 2011a), or “global occupied” area. The convex hull area was also referred to using the term “total envelope” (Halberda & Feigenson, 2008) which is also used by Soltész et al. (2010) to refer to the items’ “total circumference”. Quantitative analysis Stimuli reproducibility. None of the GMs have provided their stimuli. We counted the number of methods that provided their stimuli or a way to reproduce the stimuli or provided a detailed report on the stimuli features. Importantly, no GM has provided the stimuli. Ten of the 22 GMs provided a way to reproduce their stimuli. Only one method provided detailed data on its stimuli but did not provide a way to reproduce its stimuli (Guillaume et al., 2020). The chance that a GM will provide options for stimuli 23 reproducibility is higher in recent years than in the earlier years of this review. We found that the year of publication predicts 43% of the variance in the proportion of publications that provided means for stimuli reproducibility, F(1, 11) = 10.129, p = .009, r = 0.692, 95% CI: [0.229, 0.9], with moderate-tostrong Bayesian evidence supporting this effect, BF10 = 7.566. Controlled properties. The number of controlled properties. The average number of controlled properties by each GM was 3.227, SEM = 0.271. The number of controlled intrinsic properties (MeanIntrinsic = 2.318, SEM = 0.179) was higher than the number of controlled extrinsic properties (MeanExtrinsic = 0.909, SEM = 0.236). The diversity in the number of controlled properties was calculated using Gini-Simpson's index of diversity. The Gini-Simpson's index of diversity is usually denoted using D or G but for clarity will be denoted here using the notation Diversity-index (Keylock, 2005; Lande, 1996; Simpson, 1949; Tran et al., 2021). In the current context, Diversity-index measures the probability that two GMs will control a different number of properties. The higher the Gini-Simpson's index of diversity, the higher the probability that the two compared GMs will control a different number of properties. There was medium-to-high diversity, with Diversity-index = .757 in the number of controlled properties. Property control has increased throughout the years. As seen in Figure 6, throughout the years the average number of controlled properties has constantly increased. Publication year explains 39.7% of the variance in the number of controlled for properties, F(21) = 14.828, p < .001, r = 0.652, 95% CI: [0.319, 0.942], with strong Bayesian evidence supporting this effect, BF10 = 30.344. As the number of years increases, the number of controlled intrinsic properties increases, but the number of extrinsic properties does not increase. Publication year explains 16.5% of the variance in the number of controlled intrinsic properties, F(1, 21) = 5.137, p = .034, r = 0.452, 95% CI: [0.038, 0.734], with weak-tomoderate Bayesian evidence supporting this effect, BF10 = 2.144. In contrast, publication year has not explained the number of controlled for extrinsic properties, r = 0.311, 95% CI: [-0.128, 0.647], p = .159. There was not enough Bayesian support for a null effect, BF01 = 0.812. Therefore, the increase in the number of controlled properties is driven by the increase in the control of intrinsic properties and is not affected by a change in the control in extrinsic properties. There was also an increase in the diversity of the properties controlled by GMs. When comparing the Gini-Simpson's index of the diversity of the GMs before and after 2011, Diversity-index Before = .666 and Diversity-indexAfter = .833, the diversity of the number of controlled properties multiplied itself by a factor of 1.25 after 2011. 24 Figure 6 The Number of Controlled Properties Throughout the Years Note. The figure depicts the number of controlled properties (y-axis) as a function of the year in which the GMs were published. The number of controlled properties has constantly grown over the years. Each black circle depicts one GM. If two GMs were published in the same year and controlled for the same number of properties, the circles are overlayed on top of each other. For example, in 2012 two GMs (Lourenco et al., 2012; Price et al., 2012) controlled for two properties and they are represented by one circle only. Linear regression shows that the number of controlled properties has constantly increased throughout the years, r = 0.652, 95% CI: [0.319, 0.942], p < .001, BF10 = 30.344, YProperties = 0.652Xyear – 274.915, R2 = 0.397. The blue line denotes the regression fit line, and the SEM is represented by the shaded gray curve. Total surface area is the most common controlled property. The different methods have controlled for 13 intrinsic and extrinsic properties (see Table 2). Figure 7 displays the relative frequency of GMs that controlled specific properties. No single property was controlled by all methods. The average 25 relative frequency of all controlled properties was ~25%, SEM = 6.274. We measured the asymmetry of the GMs’ controlled properties distribution by calculating the distribution’s skewness. Skewness can be defined as a measure of the asymmetry of a distribution of a random variable around its mean (Groeneveld & Meeden, 1984; MacGillivray, 1986; Pearson, 1895). Ordering the distributions from the largest frequency to the smallest frequency results in a high positive skewness level, Skewness = 1.142, SEskewness = 0.616. The high level of positive skewness indicates that only a small number of physical properties were controlled for in most GMs and most properties have a low probability of being controlled for. Notably, the total surface area was controlled in 77% of the GMs, and the item surface area was controlled in 55% of the GMs. Figure 7 The Relative Frequency of the Controlled Properties Note. The figure depicts the distribution of different controlled properties. No property was controlled for by all GMs, but the total surface area was controlled for by most of the GMs. Intrinsic properties were more frequently controlled for than extrinsic properties. The list of properties is presented on the x-axis. Bars represent the relative frequency of the controlled properties between the GMs. The 26 controlled properties were grouped into intrinsic and extrinsic properties and are marked in dark and light pink, respectively. The properties within the intrinsic and extrinsic groups were ordered from the most frequently controlled property to the least frequently controlled property. When two properties were equally controlled by the GMs, they were ordered in the figure according to the order provided in Table 2. High diversity in the controlled properties. There was a high diversity in the properties controlled by the GMs, Diversity-index = .876, such that there is a very high probability that two different GMs will control for different properties. Importantly, the high diversity in the controlled properties persists regardless of the two most commonly controlled for properties. The Gini-Simpson's index of diversity was calculated for all possible controlled properties and stayed almost the same when we removed the total surface area and item surface, Diversity-indexWithout = .878. All GMs controlled for intrinsic properties. The different GMs either controlled for intrinsic properties, defined at the level of individual items, or controlled for extrinsic properties, defined at the level of the whole array. All GMs controlled for at least one intrinsic property. Half of the methods only controlled for intrinsic properties and the other half controlled for both intrinsic and extrinsic properties. Notably, not a single GM controlled for only extrinsic properties. Low-to-medium similarity in some of the controlled properties. As seen in Figure 7, the two most commonly controlled for properties were intrinsic properties—total surface area and item surface area were controlled for in 77% and 54% of the methods, respectively. Examining the union and intersection of the GMs that controlled for the total surface area and items’ surface area or only one of these properties, shows that approximately 36% of the GMs controlled for both properties, and 95% of controlled for at least one of them. Therefore, there is a medium similarity in the controlled for intrinsic properties. The most frequently controlled extrinsic properties were density and convex hull area. Only 18% of the methods controlled both the items’ density and convex hull. Therefore, there is almost no similarity among GMs in the controlled extrinsic properties. Controlled types. Most GMs used congruency, saliency, and sizes heterogeneity within as control types. The proportion of the methods that used various control types (see Methods section) is displayed in Figure 8. No single control type was used by all methods. The average relative frequency of the different control types was equal to approximately 46%, SEM = 13.667. Approximately, 60-80% of the GMs 27 used the same three control types: congruency, sizes heterogeneity within, and saliency. Fifty-nine percent of the GMs used congruency. Sizes heterogeneity within was controlled by 68% of the methods. Saliency was controlled by 77% of the GMs. Notably, 91% of the methods used at least one of these control types, and 82% of the methods used at least a combination of these control types. Therefore, there is a high similarity in the control types used by the different methods. Figure 8 GMs Use of Various Control Types Note. The figure depicts the relative frequency of use of different control types by the different GMs. Ninety-one percent of the GMs used at least one control type of: congruency, saliency, and sizes heterogeneity within. The x-axis presents the list of control types. The y-axis presents the relative frequency of the GMs that used these control types. The control types are ordered from the most frequently used to the least frequently used control type. GMs employ a stable number of control types throughout the years. Throughout the years the average number of control types employed by GMs has not significantly increased, p =.863, with weak 28 evidence for the null hypothesis, BF01 = 2.573. The difference between the average number of control types used by the GMs before and after 2011 was calculated using the Mann-Whitney U test. The Mann-Whitney U test (also known as the Wilcoxon rank sum test) is a non-parametric test for the difference between two independent samples, denoted using the statistic U (Mann & Whitney, 1947; McKnight & Najab, 2010; Wilcoxon, 1945). All in all, there was no significant number in the number of control types used by the GMs before and after 2011, U(13,9) = 45, p = .353, with weak Bayesian support for the null hypothesis, BF01 = 1.962. Discussion The current review examined various GMs with the following objectives in mind. First, we wanted to describe the current state of the field in terms of its common language and ground. Second, we wanted to accurately map the control types and properties controlled by the different GMs. Third, we wanted to test if the field is limited in its ability to compare between studies and to examine previous studies through new theoretical prisms. Finally, we wanted to provide guidelines for GMs. Using a combination of automatic tools and human annotators who inspected the literature, we found 22 GMs (Figure 5). These GMs used five different control types (Table 2) to control 13 physical properties (Table 2). Some of the control types can be imposed on all properties and some on only a portion of them. The control type has a low-to-no dependency on which properties were controlled for. Consequently, this leads to a variety of GMs and inevitably to different stimuli sets and results (Clayton et al., 2015; DeWind & Brannon, 2016; Gebuis & Reynvoet, 2011a; Katzin et al., 2019; Kuzmina & Malykh, 2022; Salti et al., 2017; Smets et al., 2015). The ability to compare and replicate previous results relies on proper definitions of the controlled properties. A large portion of the GMs publications lacks proper definitions of properties they chose to control for. Our analysis showed that some GMs have stated to control certain properties but have not defined these properties. While some of the properties do not need definitions (mainly definitions stemming from Euclidean geometry), other properties like density or the convex hull area do need an exact definition. Another problem is that some of the properties were inconsistently defined between GMs. Even worse, some were inconsistently defined within the same publication. The lack of consistency of property definition between GMs is more harmful than the lack of definitions as it makes the GMs that have controlled for it incomparable. Inconsistent property definition within an article not only makes it less replicable but makes it impossible to understand the prism through which 29 the researchers conducted their study and the logic underlying the GM design. All in all, the lack of proper definitions creates a lack of common language, prevents proper reproduction or reanalysis of the results, and makes GMs incomparable to one another. On the surface, the majority of GMs have used one of a few controls. All GMs but one controlled one of two intrinsic properties—the individual items’ surface area or the items’ total surface (Figures 5 and 7). Notably, the items individual and total surface area are similar properties dependent on the square radius of the items (as most of the GMs use dot-shaped items). In addition, most GMs used the same control types (Figure 8), and the type of controls has not changed throughout the years. Nevertheless, although most of the GMs controlled for similar properties and used similar control types, our mapping shows that the different GMs are incomparable. This is because each GM used additional controls that were highly diverse. In fact, most properties had a low probability of being controlled for. Finally, the number and diversity of the controlled properties increased throughout the years. There is the question of whether new ideas changed the way GMs were designed. Scientific progress occurs when scientific ideas feed the scientific domain and contribute to its expansion and development (Bird, 2000; Kuhn, 1970). Notably, throughout the years the number of controlled properties has increased (Figure 6). The increase in the number of controlled properties provides evidence for the assimilation of new ideas and findings into the field. Furthermore, during the 2nd decade of the reviewed period, new physical properties were added by new GMs. Yet, the high diversity of controlled properties and the low probability that two GMs will control the same properties suggests that the ideas coming from previous GMs are usually not adopted by later GMs. It is worth focusing on two different properties that reflect opposing trends in the field. On one hand, the adoption of the convex hull area shows that new ideas can be successfully implemented into the field. In 2011, Gebuis and Reynvoet showed how the convex hull area biased numerical comparisons, and since then others have highlighted the great need to control convex hull area in numerical comparison tasks (Clayton et al., 2015; Rodríguez & Ferreira, 2023). We found that since the convex hull area was introduced to the literature, the majority of the GMs have implemented it in their design (De Marco & Cutini, 2020; DeWind & Brannon, 2019; Guillaume et al., 2020; Salti et al., 2017; Zanon et al., 2021). On the other hand, the case of density provides a different picture. Although there were attempts to control it in a substantial portion of the GMs (~59%), it was not carefully defined and therefore poorly controlled for. 30 Taken together, the field shows a partial capability to improve GMs’ control and/or to assimilate new ideas. We think that the progress of the numerical cognition field will be more efficient if it will be possible to compare previous ideas and test novel ideas in light of previous findings. The ability to compare ideas and test new ideas in light of previous ones relies on the elaboration and precision of the methods section. Additionally, by providing the original study stimuli along with the data, it would be possible to inspect the results through the prisms of new GMs. However, we found that only half of the GMs provided a way to reproduce their stimuli. Moreover, none of the GMs provided their actual stimuli, let alone stimuli coupled with results. This review inspects the GMs from a prism of the physical properties they controlled for. The choice to focus on physical properties relies on two motivations. First, physical properties are intertwined with numerosity and therefore the initial control choices have major effects on the experimental results. Second, we identified that the choice to control certain physical properties makes studies incomparable as it creates unique correlations between different physical properties that make the stimuli incomparable. The guiding line of this review was to conduct an inductive data-driven review, as opposed to a narrative-driven review. However, our focus on physical properties necessarily limited our span and created priors for the rest of the process. There are other aspects that should be considered when designing a GM but were not reviewed here. Throughout this review, we considered all non-symbolic numerical comparative judgments as equal, regardless of the stimuli presentation mode. However, there is a variety of ways to present the stimuli. For instance, the two compared arrays can be displayed simultaneously or sequentially. Some have shown that when the same arrays were presented using either simultaneous, sequential, or intermixed modes of presentation, participants exhibited different reaction times, accuracy, and Weber fractions patterns (Kuzmina & Malykh, 2022; Norris & Castronovo, 2016; Price et al., 2012; Smets et al., 2016). In an intermixed presentation mode, two arrays colored in different colors are overlayed in the same display (Halberda et al., 2008). Figure 9 presents simultaneous and intermixed displays. The results obtained using these different presentation modes are not comparable (but see Smets et al. 2016), because different presentation modes might measure different types of cognitive processes, such as serial or parallel processing (Townsend, 1990). Figure 9 31 Two Stimuli Presentation Modes Involve Different Stimuli Control Note. The stimuli can be presented using simultaneous (Panel A) and intermixed presentation (Panel B) modes. Importantly, these modes of presentation involve different control of the stimuli physical properties. Top-down strategic effects and learning effects might bias participants to attend to different physical properties according to task goals (Salti et al., 2019). Specifically, using the same stimuli but changing the task goals influences the results. Changing the task from a numerical task in which the participants' goal is to choose the more numerous array, to a physical task in which the participants' goal is to choose the larger array, changes the effect of physical properties on performance (Katzin et al., 2019; Leibovich et al., 2015; Leibovich-Raveh et al., 2018; Salti et al., 2017). Avitan et al. (2022) found that changing the task goal to choose a smaller magnitude (quantity or area) instead of the larger magnitude modulated participants' performance and interacted with task type (numerical or physical). Leibovich-Raveh et al. (2018) manipulated participants' emphasis on accuracy or speed during a numerical comparison task and discovered that the effect of different physical properties was dependent on participants' emphasis on accuracy or speed. Furthermore, the experimental block design should be considered when designing a numerical comparison experiment. Pekár and Kinder (2020) generated their stimuli using Gebuis and Reynvoet’s (2011a) GM and found increased congruency effects when the different stimuli sets were mixed in the same block in comparison to when the different stimuli sets were displayed in different blocks. Tokita and Ishiguchi (2010) found that the effect of physical properties was modulated by practice. Thus, different block designs and amounts of practice induce different effects and should be carefully compared to one another. This review focused on the control of physical properties. We suggest paying attention and carefully controlling for the stimuli presentation mode and strategic and learning effects when designing GMs and running a numerical comparison study. A robust body of evidence supports a 32 relationship between stimuli design and top-down strategic effects, and therefore future studies should pursue this direction. The problem of controlling physical properties in non-symbolic stimuli is relevant regardless of the stimuli presentation mode, as different stimuli sets create different correlations between the numerical and physical dimensions. In fact, the discrepancy between different stimuli sets and the theoretical prisms leading their design is apparent when comparing simultaneous and intermixed displays. For example, it is not clear if the intermixed stimuli (Figure 9, Panel A) are comprised of two different arrays with different convex hull areas or we should only account for one convex hull comprised of both the yellow and the blue dots arrays. Guidelines Taken together, the variety of different GMs, properties, and control types reflects a theoretical and methodological wealth. This wealth makes it hard to maintain a common language within the field, but its advantages are apparent. The field is thriving with an increasing number of articles (Figure 3), and it has a large number of venues and ideas that could be pursued. We see the diversity in the field as a strength, and we do not advocate using one GM or any other way to impose a common language (but see De Marco & Cutini, 2020; Zanon et al., 2021, for a different opinion). Yet, currently, it is very hard to compare GMs. Accordingly, we recommend that authors provide an explicit and consistent definition of all controlled properties. Providing definitions will improve the replicability, interpretability, and explainability of previous studies (Broniatowski, 2021), such that previous ideas will be translatable in light of newer views (Almaatouq et al., 2022; Rocca & Yarkoni, 2021). We also suggest providing access to the code package used to generate the stimuli. Providing access to the original code will allow testing further questions and issues not tested in the original work. It will also provide a way to assess how hard or unnatural producing the stimuli was, as the control of some properties requires imposing numerous constraints on the stimuli. It is very hard to predict how controlling for one or more properties might affect other properties because different properties are intercorrelated with one another in many ways (De Marco & Cutini, 2020; Salti et al., 2017). Accordingly, we suggest that authors should examine the correlations among the different properties after the stimuli are produced. In addition, we suggest providing the actual stimuli alongside the behavioral and neuronal responses (if applicable) to each stimulus. New ideas could be tested by reanalyzing previous studies when both the stimuli and their corresponding responses are provided 33 (e.g., Shilat et al., 2021). These recommendations are in line with open-source practices, which increase study reliability and provide community-based knowledge-sharing (AlMarzouq et al., 2005). Moreover, these recommendations will hopefully enable researchers to better understand results obtained using different methods and provide more generality and reproducibility (Schooler, 2014) to the field. Finally, these recommendations allow diversity on one hand and an ability to examine old data through new prisms on the other hand. Suggested guidelines for GMs appear in Box 3 below. Box 3 Guidelines 1. Property definitions Provide explicit and consistent definitions of all controlled properties or control types. 2. Providing stimuli generation code Provide the code package used to generate the stimuli. 3. Post-production analysis Examine the correlations among different properties after producing the stimuli. 4. Providing stimuli and responses Provide the actual stimuli alongside the corresponding results to each stimulus. Acknowledgments The authors thank Dr. Tali Leibovich-Raveh for her valuable insights. Furthermore, we thank Ms. Adi Gabzu for her important insights and help with the data curation and tagging. Finally, we wish to thank the lovely Mrs. Desiree Meloul for her enlightening insights and editing the drafts of the article. Supplementary Information The online version contains supplementary material. Authors' Contributions Conceptualization - AH, MS, YS; Methodology - YS; Investigation - HG, SW, YS; Visualization - SW, YS; Supervision - AH, MS; Writing original draft - MS, YS; Reviewing and editing - AH, HG, MS, SW, YS. 34 Funding This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. Data and Code Availability The datasets generated and analyzed during this study are publicly available as a part of the Supplementary Information. Declarations Conflicts of Interest The authors declare that there is no conflict of interest. Ethics Approval Not applicable. Consent to Participate Not applicable. Consent for Publication Not applicable. Competing Interests The authors declare no competing interests. 35 References Almaatouq, A., Griffiths, T. L., Suchow, J. W., Whiting, M. E., Evans, J., & Watts, D. J. (2022). Beyond Playing 20 Questions with Nature: Integrative experiment design in the social and behavioral sciences. Behavioral and Brain Sciences, 1-55. https://doi.org/10.1017/S0140525X22002874 AlMarzouq, M., Zheng, L., Rong, G., & Grover, V. (2005). Open source: Concepts, benefits, and challenges. Communications of the Association for Information Systems, 16, 756-784. https://doi.org/10.17705/1CAIS.01637 Aulet, L. S., & Lourenco, S. F. (2021). The relative salience of numerical and non-numerical dimensions shifts over development: A re-analysis of Tomlinson, DeWind, and Brannon (2020). Cognition, 210, 104610. https://doi.org/10.1016/j.cognition.2021.104610 Avitan, A., Galili, H., & Henik, A. (2022). Less is more? Instructions modulate the way we interact with continuous features in non-symbolic dot-array comparison tasks. SSRN. https://doi.org/10.2139/ssrn.4065685 Bird, A. (2000). Thomas Kuhn (1st ed.). Acumen. Boreman, G. D. (2021). Modulation transfer function in optical and electro-optical systems (2nd ed.). SPIE Press. Broniatowski, D. A. (2021). Psychological Foundations of Explainability and Interpretability in Artificial Intelligence. National Institute of Standards and Technology. https://doi.org/10.6028/NIST.IR.8367 Clayton, S., Gilmore, C., & Inglis, M. (2015). Dot comparison stimuli are not all alike: The effect of different visual controls on ANS measurement. Acta Psychologica, 161, 177–184. https://doi.org/10.1016/j.actpsy.2015.09.007 Dakin, S. C., Tibber, M. S., Greenwood, J. A., Kingdom, F. A. A., & Morgan, M. J. (2011). A common visual metric for approximate number and density. Proceedings of the National Academy of Sciences, 108(49), 19552–19557. https://doi.org/10.1073/pnas.1113195108 Danner, H. G. (2014). A thesaurus of English word roots. Rowman & Littlefield. De Marco, D., & Cutini, S. (2020). Introducing CUSTOM: A customized, ultraprecise, standardizationoriented, multipurpose algorithm for generating nonsymbolic number stimuli. Behavior Research Methods, 52(4), 1528–1537. https://doi.org/10.3758/s13428-019-01332-z De Valois, K. K., & Switkes, E. (1980). Spatial frequency specific interaction of dot patterns and gratings. Proceedings of the National Academy of Sciences, 77(1), 662–665. https://doi.org/10.1073/pnas.77.1.662 36 Dehaene, S. (1999). The number sense: How the mind creates mathematics (1st ed.). Oxford University Press. DeWind, N. K., Adams, G. K., Platt, M., & Brannon, E. (2015). Modeling the approximate number system to quantify the contribution of visual stimulus features. Cognition, 142, 247–265. https://doi.org/10.1016/j.cognition.2015.05.016 DeWind, N. K., & Brannon, E. M. (2016). Significant inter-test reliability across approximate number system assessments. Frontiers in Psychology, 7, Article 317. https://doi.org/10.3389/fpsyg.2016.00310 DeWind, N. K., & Brannon, E. M. (2019, January 7). Measuring congruence effects in nonsymbolic number comparison: The importance of the degree of congruence. Methods in Numerical Cognition Workshop, Budapest, Hungary. https://osf.io/ds2h7 Efford, N. (2000). Digital image processing: A practical introduction using Java (1st ed.). AddisonWesley. Egner, T. (2007). Congruency sequence effects and cognitive control. Cognitive, Affective, & Behavioral Neuroscience, 7(4), 380–390. https://doi.org/10.3758/CABN.7.4.380 French, R. S. (1953). The discrimination of dot patterns as a function of number and average separation of dots. Journal of Experimental Psychology, 46(1), 1–9. https://doi.org/10.1037/h0059543 Fricke, S. (2018). Semantic Scholar. Journal of the Medical Library Association, 106(1), 145-147. https://doi.org/10.5195/jmla.2018.280 Frith, C. D., & Frith, U. (1972). The solitaire illusion: An illusion of numerosity. Perception & Psychophysics, 11(6), 409–410. https://doi.org/10.3758/BF03206279 Gebuis, T., & Reynvoet, B. (2011a). Generating nonsymbolic number stimuli. Behavior Research Methods, 43(4), 981–986. https://doi.org/10.3758/s13428-011-0097-5 Gebuis, T., & Reynvoet, B. (2011b). The interplay between nonsymbolic number and its continuous visual properties. Journal of Experimental Psychology: General, 141(4), 642–648. https://doi.org/10.1037/a0026218 Groeneveld, R. A., & Meeden, G. (1984). Measuring skewness and kurtosis. The Statistician, 33(4), 391399. https://doi.org/10.2307/2987742 Guillaume, M., Schiltz, C., & Rinsveld, A. V. (2020). NASCO: A New Method and Program to Generate Dot Arrays for Non-Symbolic Number Comparison Tasks. Journal of Numerical Cognition, 6(1), 129–147. https://doi.org/10.5964/JNC.V6I1.231 37 Gusenbauer, M. (2019). Google Scholar to overshadow them all? Comparing the sizes of 12 academic search engines and bibliographic databases. Scientometrics, 118(1), 177–214. https://doi.org/10.1007/s11192-018-2958-5 Gusenbauer, M., & Haddaway, N. R. (2020). Which academic search systems are suitable for systematic reviews or meta‐analyses? Evaluating retrieval qualities of Google Scholar, PubMed, and 26 other resources. Research Synthesis Methods, 11(2), 181–217. https://doi.org/10.1002/jrsm.1378 Guy, G., & Medioni, G. (1993). Inferring global perceptual contours from local features. Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 786–787. https://doi.org/10.1109/CVPR.1993.341175 Halberda, J., & Feigenson, L. (2008). Developmental change in the acuity of the “number sense”: The approximate number system in 3-, 4-, 5-, and 6-year-olds and adults. Developmental Psychology, 44(5), 1457–1465. https://doi.org/10.1037/a0012682 Halberda, J., Mazzocco, M. M. M., & Feigenson, L. (2008). Individual differences in non-verbal number acuity correlate with maths achievement. Nature, 455(7213), 665–668. https://doi.org/10.1038/nature07246 Huntley-Fenner, G., & Cannon, E. (2000). Preschoolers’ magnitude comparisons are mediated by a preverbal analog mechanism. Psychological Science, 11(2), 147–152. https://doi.org/10.1111/1467-9280.00230 Jones, N. (2015). Artificial-intelligence institute launches free science search engine. Nature, 10. https://doi.org/10.1038/nature.2015.18703 Kadosh, R. C., Kadosh, K. C., & Henik, A. (2008). When brightness counts: The neuronal correlate of numerical–luminance interference. Cerebral Cortex, 18(2), 337-343. https://doi.org/10.1093/cercor/bhm058 Katzin, N., Katzin, D., Rosen, A., Henik, A., & Salti, M. (2020). Putting the world in mind: The case of mental representation of quantity. Cognition, 195, Article 104088. https://doi.org/10.1016/j.cognition.2019.104088 Katzin, N., Salti, M., & Henik, A. (2019). Holistic processing of numerical arrays. Journal of Experimental Psychology: Learning, Memory, and Cognition, 45(6), 1014–1022. https://doi.org/10.1037/xlm0000640 Kaur, A., Sharma, R., Mishra, P., Sinhababu, A., & Chakravarty, R. (2022). Visual research discovery using connected papers: A use case of blockchain in libraries. The Serials Librarian, 83(2), 186196. 38 Keylock, C. J. (2005). Simpson diversity and the Shannon-Wiener index as special cases of a generalized entropy. Oikos, 109(1), 203–207. https://doi.org/10.1111/j.0030-1299.2005.13735.x Kuhn, T. S. (1970). The structure of scientific revolutions (2d ed.). University of Chicago Press. Kuzmina, Y., & Malykh, S. (2022). The effect of visual parameters on nonsymbolic numerosity estimation varies depending on the format of stimulus presentation. Journal of Experimental Child Psychology, 224, Article 105514. https://doi.org/10.1016/j.jecp.2022.105514 Lande, R. (1996). Statistics and partitioning of species diversity, and similarity among multiple communities. Oikos, 76(1), 5-13. https://doi.org/10.2307/3545743 Lea, D. (Ed.). (2008). Oxford learner’s thesaurus: A dictionary of synonyms (1st ed.). Oxford Univ Press. Leibovich, T., Henik, A., & Salti, M. (2015). Numerosity processing is context driven even in the subitizing range: An fMRI study. Neuropsychologia, 77, 137–147. https://doi.org/10.1016/j.neuropsychologia.2015.08.016 Leibovich, T., Katzin, N., Harel, M., & Henik, A. (2017). From “sense of number” to “sense of magnitude”: The role of continuous magnitudes in numerical cognition. Behavioral and Brain Sciences, 40, e164. https://doi.org/10.1017/S0140525X16000960 Leibovich-Raveh, T., Stein, I., Henik, A., & Salti, M. (2018). Number and continuous magnitude processing depends on task goals and numerosity ratio. Journal of Cognition, 1(1), Article 19. https://doi.org/10.5334/joc.22 Lourenco, S. F., Bonny, J. W., Fernandez, E. P., & Rao, S. (2012). Nonsymbolic number and cumulative area representations contribute shared and unique variance to symbolic math competence. Proceedings of the National Academy of Sciences, 109(46), 18737–18742. https://doi.org/10.1073/pnas.1207212109 Lyons, I. M., Price, G. R., Vaessen, A., Blomert, L., & Ansari, D. (2014). Numerical predictors of arithmetic success in grades 1-6. Developmental Science, 17(5), 714–726. https://doi.org/10.1111/desc.12152 MacGillivray, H. L. (1986). Skewness and asymmetry: Measures and orderings. The Annals of Statistics, 14(3), 994-1011. https://doi.org/10.1214/aos/1176350046 Mann, H. B., & Whitney, D. R. (1947). On a test of whether one of two random variables is stochastically larger than the other. The Annals of Mathematical Statistic, 18(1), 50–60. http://www.jstor.org/stable/2236101 39 Marchant, A. P., Simons, D. J., & de Fockert, J. W. (2013). Ensemble representations: Effects of set size and item heterogeneity on average size perception. Acta Psychologica, 142(2), 245–250. https://doi.org/10.1016/j.actpsy.2012.11.002 Mareschal, I., & Baker, C. L. (1998). A cortical locus for the processing of contrast-defined contours. Nature neuroscience, 1(2), 150-154. https://doi.org/10.1038/401 McKnight, P. E., & Najab, J. (2010). Mann‐Whitney U Test. In I. B. Weiner & W. E. Craighead (Eds.), The Corsini encyclopedia of psychology (1st ed.). Wiley. https://doi.org/10.1002/9780470479216.corpsy0524 Mehler, J., & Bever, T. G. (1967). Cognitive capacity of very young children. Science, 158(3797), 141– 142. https://doi.org/10.1126/science.158.3797.141 Navon, D. (1977). Forest before trees: The precedence of global features in visual perception. Cognitive Psychology, 9(3), 353–383. https://doi.org/10.1016/0010-0285(77)90012-3 Norris, J., & Castronovo, J. (2016). Dot display affects approximate number system acuity and relationships with mathematical achievement and inhibitory control. PLoS ONE, 11(5), e0155543. https://doi.org/10.1371/journal.pone.0155543 Pearson, K. (1895). X. Contributions to the mathematical theory of evolution.—II. Skew variation in homogeneous material. Philosophical Transactions of the Royal Society of London (A), 186, 343–414. https://doi.org/10.1098/rsta.1895.0010 Pekár, J., & Kinder, A. (2020). The interplay between non-symbolic number and its continuous visual properties revisited: Effects of mixing trials of different types. Quarterly Journal of Experimental Psychology, 73(5), 698–710. https://doi.org/10.1177/1747021819891068 Piaget, J. (1968). Quantification, conservation, and nativism: Quantitative evaluations of children aged two to three years are examined. Science, 162(3857), 976–979. https://doi.org/10.1126/science.162.3857.976 Piazza, M., Izard, V., Pinel, P., Bihan, D. L., & Dehaene, S. (2004). Tuning curves for approximate numerosity in the human intraparietal sulcus. Neuron, 44(3), 547–555. https://doi.org/10.1016/j.neuron.2004.10.014 Pinel, P., Piazza, M., Le Bihan, D., & Dehaene, S. (2004). Distributed and overlapping cerebral representations of number, size, and luminance during comparative judgments. Neuron, 41(6), 983-993. https://doi.org/10.1016/S0896-6273(04)00107-2 40 Price, G. R., Palmer, D., Battista, C., & Ansari, D. (2012). Nonsymbolic numerical magnitude comparison: Reliability and validity of different task variants and outcome measures, and their relationship to arithmetic achievement in adults. Acta Psychologica, 140(1), 50–57. https://doi.org/10.1016/j.actpsy.2012.02.008 Rocca, R., & Yarkoni, T. (2021). Putting psychology to the test: Rethinking model evaluation through benchmarking and prediction. Advances in Methods and Practices in Psychological Science, 4(3), 1-24. https://doi.org/10.1177/25152459211026864 Rodríguez, C., & Ferreira, R. A. (2023). To what extent is dot comparison an appropriate measure of approximate number system? Frontiers in Psychology, 13, Article 1065600. https://doi.org/10.3389/fpsyg.2022.1065600 Ross, J. (2003). Visual discrimination of number without counting. Perception, 32(7), 867–870. https://doi.org/10.1068/p5029 Rousselle, L., & Noël, M.-P. (2008). The development of automatic numerosity processing in preschoolers: Evidence for numerosity-perceptual interference. Developmental Psychology, 44(2), 544–560. https://doi.org/10.1037/0012-1649.44.2.544 Rousselle, L., Palmers, E., & Noël, M.-P. (2004). Magnitude comparison in preschoolers: What counts? Influence of perceptual variables. Journal of Experimental Child Psychology, 87(1), 57–84. https://doi.org/10.1016/j.jecp.2003.10.005 Salti, M., Harel, A., & Marti, S. (2019). Conscious perception: Time for an update? Journal of Cognitive Neuroscience, 31(1), 1–7. https://doi.org/10.1162/jocn_a_01343 Salti, M., Katzin, N., Katzin, D., Leibovich, T., & Henik, A. (2017). One tamed at a time: A new approach for controlling continuous magnitudes in numerical comparison tasks. Behavior Research Methods, 49(3), 1120–1127. https://doi.org/10.3758/s13428-016-0772-7 Schooler, J. W. (2014). Metascience could rescue the ‘replication crisis.’ Nature, 515(7525), 9–9. https://doi.org/10.1038/515009a Shilat, Y., Salti, M., & Henik, A. (2021). Shaping the way from the unknown to the known: The role of convex hull shape in numerical comparisons. Cognition, 217, Article 104893. https://doi.org/10.1016/j.cognition.2021.104893 Simpson, E. H. (1949). Measurement of diversity. Nature, 163(4148), 688–688. https://doi.org/10.1038/163688a0 41 Smets, K., Moors, P., & Reynvoet, B. (2016). Effects of presentation type and visual control in numerosity discrimination: Implications for number processing? Frontiers in Psychology, 7, Article 66. https://doi.org/10.3389/fpsyg.2016.00066 Smets, K., Sasanguie, D., Szücs, D., & Reynvoet, B. (2015). The effect of different methods to construct non-symbolic stimuli in numerosity estimation and comparison. Journal of Cognitive Psychology, 27(3), 310–325. https://doi.org/10.1080/20445911.2014.996568 Soltész, F., Szűcs, D., & Szűcs, L. (2010). Relationships between magnitude representation, counting and memory in 4- to 7-year-old children: A developmental study. Behavioral and Brain Functions, 6, Article 13. https://doi.org/10.1186/1744-9081-6-13 Sophian, C., & Chu, Y. (2008). How do people apprehend large numerosities? Cognition, 107(2), 460– 478. https://doi.org/10.1016/j.cognition.2007.10.009 Stalnaker, R. (2002). Common ground. Linguistics and Philosophy, 25(5/6), 701–721. https://doi.org/10.1023/A:1020867916902 Tokita, M., & Ishiguchi, A. (2010). How might the discrepancy in the effects of perceptual variables on numerosity judgment be reconciled? Attention, Perception & Psychophysics, 72(7), 1839–1853. https://doi.org/10.3758/APP.72.7.1839 Townsend, J. T. (1990). Serial vs. parallel processing: Sometimes they look like Tweedledum and Tweedledee but they can (and should) be distinguished. Psychological Science, 1(1), 46–54. https://doi.org/10.1111/j.1467-9280.1990.tb00067.x Tran, U. S., Lallai, T., Gyimesi, M., Baliko, J., Ramazanova, D., & Voracek, M. (2021). Harnessing the fifth element of distributional statistics for psychological science: A practical primer and shiny app for measures of statistical inequality and concentration. Frontiers in Psychology, 12, Article 716164. https://doi.org/10.3389/fpsyg.2021.716164 Trezona, P.W. (2000), Luminance: Its use and misuse. Color Res. Appl., 25, 145-147. https://doi.org/10.1002/(SICI)1520-6378(200004)25:2<145::AID-COL9>3.0.CO;2-0 Wilcoxon, F. (1945). Individual comparisons by ranking methods. Biometrics Bulletin, 1(6), 80-83. https://doi.org/10.2307/3001968 Xu, F., & Spelke, E. S. (2000). Large number discrimination in 6-month-old infants. Cognition, 74(1), B1–B11. https://doi.org/10.1016/S0010-0277(99)00066-9 42 Yousif, S. R., & Keil, F. C. (2019). The additive-area heuristic: An efficient but illusory means of visual area approximation. Psychological Science, 30(4), 495–503. https://doi.org/10.1177/0956797619831617 Zanon, M., Potrich, D., Bortot, M., & Vallortigara, G. (2021). Towards a standardization of nonsymbolic numerical experiments: GeNEsIS, a flexible and user-friendly tool to generate controlled stimuli. Behavior Research Methods, 54, 146-157. https://doi.org/10.3758/s13428021-01580-y