Wayfinding or goal-directed navigation is important for the survival of most mobile species. Against that backdrop, it may seem surprising that humans differ in their navigation skills. Nevertheless, there is considerable individual variation in wayfinding accuracy and confidence, across a variety of experimental paradigms (Newcombe, 2018). We do not yet fully understand the correlates and causes of such individual variation, but sex of participants may be one dimension that matters. There are frequently cited instances of male advantage in the literature (Dabbs, Chang, Strong, & Milun, 1998; Gagnon et al., 2018; Lawton, 1994; Lawton & Kallai, 2002). However, there are also reports of no sex differences (Driscoll, Hamilton, Yeo, Brooks & Sutherland, 2005; Herman, Kail, & Siegel, 1979; O’Laughlin & Brubaker, 1998; Saucier et al., 2002; Schmitzer-Torbert, 2007) and even studies showing a female advantage on navigation tasks under certain conditions (Burigat & Chittaro, 2007). These variations in results raise the question of whether sex differences exist in human spatial navigation. However, in popular perception and the popular press, there is little doubt; it is common to see headlines such as “Study says that men have the better sense of direction” (Gregoire, 2015).

A meta-analysis can bring together the empirical literature, integrating reports with varying sample sizes and demographic characteristics, and a wide range of methodologies (testing environments, experimental procedures, use of technology, learning perspectives, and outcome measures). Previous reviews do not address the core question of quantifying behavioral sex differences in human navigation and examining potential moderators, although they do provide valuable information. The most relevant prior effort is a narrative literature review, in which Coluccia and Louse (2004) found that, while more than 50% of papers did not find significant sex differences, the remainder found better performance by males, and no studies reported females doing better. This pattern of evidence was suggestive of a male advantage, but in the absence of a systematic meta-analysis, it is impossible to know for sure, let alone to estimate the size of sex differences and the role of potential moderators in any effects. In addition, in the years since the Collucia and Louse review, there has been a great deal more navigation research. For example, the paper by Burigat and Chittaro (2007) showing a female advantage came out later.

There are two other relevant reviews, both of which are quantitative, although they address somewhat different questions. Jonasson (2005) conducted a meta-analysis of sex differences in rodent navigation, which can be seen as relevant to human differences given our common mammalian heritage. The analysis showed that sex differences in rodents varied with rearing environment, training protocols, and navigation task. Crucially, there was evidence of species variation, suggesting that there is no basic mammalian pattern. Mice actually demonstrated a small female advantage in the water maze task. In another review relevant to our purpose, Boccia, Nemmi, and Guariglia (2014) conducted a meta-analysis of 66 fMRI studies of the neural substrates involved in human navigation. They found interesting sex-linked variations in brain activation moderated by environment familiarity and egocentric and allocentric strategies. However, their review did not address the question of behavioral sex differences in navigation accuracy.

Accordingly, our purpose was to provide a comprehensive meta-analysis of behavioral sex differences in human spatial navigation. We aimed to quantify the overall magnitude of sex differences and examine potential moderators of the effects. To conduct this synthesis, we began with defining spatial navigation and then specified potential moderators.

Defining wayfinding

There are a wide variety of cognitive processes involved in navigation, and experimenters have developed many different small-scale and large-scale paradigms and self-report measures to investigate individual differences (Wiener, Büchner, & Hölscher, 2009). There does not seem to be an agreed-upon definition of what constitutes “spatial navigation.” Coluccia and Louse (2004) defined navigation as a “complex of all the skills used for locating themselves with respect to a point of reference or an absolute system of coordinates, (p. 329).” However, this definition encompasses spatial processes at different scales (see Montello, 1993, 2010), despite substantial evidence that navigation in large-scale environments is fundamentally different from small-scale tasks in which all objects are intervisible (Hegarty, Montello, Richardson, Ishikawa, & Lovelace, 2006; Learmonth, Newcombe, Sheridan, & Jones, 2003; Montello, 1993; Padilla, Creem-Regehr, Stefanucci & Cashdan, 2017).

For the purpose of the present meta-analysis, we defined navigation as skills involved in encoding spatial relations at environmental scale and transforming them to orient and navigate in environmental space. This definition includes spatial navigation with external representations (e.g., maps), but excludes reorientation tasks conducted in small spaces and small-scale mazes with target locations visible from a single vantage point. The only exception to the scale criterion for our review was the virtual water-maze task (Astur, Ortiz, & Sutherland, 1998), based on the Morris water-maze task (Morris, 1981). This task is a well-accepted measure in the animal spatial cognition literature, where it has been shown to engage hippocampal processes that support cognitive mapping (Astur, Taylor, Mamelak, Philpott, & Sutherland, 2002). It has been widely used to study individual differences in the use of proximal and distal cues and has a known neural substrate implicated in navigation (Daugherty, Bender, Yuan, & Raz, 2015; Packard & McGaugh, 1996). Further, the hidden platform task occludes the target to be reached and thus can be arguably included within the environment scale definition.

Potential moderators

A systematic approach to moderator identification is likely to decrease the risk of Type I error in meta-analytic results (Lipsey & Wilson, 2001). Accordingly, we identified empirically supported or theoretically relevant factors a priori for consideration in the moderator analysis. Table 1 lists the measure-level variables that we identified for later coding. Below, we discuss the empirical and theoretical basis that led us to select these factors as potential moderators.

Table 1 Coding scheme for moderators in the present meta-analysis

Task goal

There are large variations in the task goal used to assess a participant’s navigation skills. We list these task goals in eight distinct categories (see Table 1). These eight categories may demonstrate large, small, or negligible sex differences, depending on the cognitive processes involved in completing the task goal. For example, Jansen-Osmann, Schmid, and Heil (2007) found no sex difference on a route-recall task, whereas map reading showed a large male advantage. In contrast, Allen and Wittenborg (1998) found a significant male advantage for route recall, but not for a distance task. This type of inconsistency across studies suggests that inclusion of task goals as a moderator is important. In addition, task goals form the basic level from which all other task requirements are defined, and thus we used this moderator as the basic unit of analysis in subgroup analyses.

Perspective

An important difference in navigation tasks pertains to the type of visual representation of the environment—route or survey. In a route perspective, individuals experience the environment from a first-person perspective (i.e., they locate target objects in relation to their own position). In contrast, in a survey perspective, individuals see the entire spatial layout from a bird’s-eye view. The inclusion of perspective as a moderator is particularly important in light of research suggesting that men show a preference for a Euclidean orientation strategy using cardinal directions and distances, but that women prefer a landmark strategy that relies on a sequence of turns and proximal cues (Dabbs, Chang, Strong, & Milun, 1998; Lawton, 1994; Lawton & Kallai, 2002). This observation raises the possibility that men and women may differentially benefit from a particular perspective. There are also navigation studies that use a combination of route and survey perspectives, and we categorized these studies into a third group. At a more global, theoretical level, perspective can be seen as a central defining component of spatial navigation, possibly because a change in perspective is a component of many navigation tasks (Wolbers & Hegarty, 2010). The importance of perspective as a moderator therefore manifests itself both at the empirical and theoretical level.

Outcome measure

Outcome measures are often presented as reflecting different dimensions of performance. For example, in mental rotation, Lohman (1986) presented response time as reflecting speed of processing, whereas accuracy was believed to reflect level of ability. From this perspective, findings of a larger male advantage with accuracy than with response time in a spatial navigation task by Malinowksi and Gillespie (2001) emphasizes sex differences in level of ability. In the context of spatial navigation, methods of scoring other than accuracy and response time have also been used to quantify performance. With this in mind, we therefore categorized outcome measures into four categories—accuracy, degrees of deviation, distance, and response time. Examining potential variations in the magnitude of sex differences in outcome measures is crucial to assess current navigation metrics and biases built into them and how these might affect the magnitude of sex differences.

Route direction

Depending on task goals and outcome measures, participants may need to navigate routes in the same direction as initially learned or in opposite directions. Potential variations in sex differences based on route direction could be indicative of differences in working memory and strategy flexibility. For example, individuals who depend on a rigid sequence of cues and turns may perform better in the forward direction than in the reversal condition. Given the male advantage in visuospatial working memory (Voyer, Voyer, & Saint Aubin, 2017), the magnitude of sex difference might be larger during route reversal as a result of the added working memory load. Hence, we included route directions as a potential moderator and categorized movement into three groups—forward, backward, and free choice.

Route selection

In the research we retrieved in the literature, participants were often asked to select a specific route to meet the task goal. These route selection options could be categorized into three groups—free choice, exact way, and shortcut. Empirical findings from Choi, McKillop, Ward, and L’Hirondelle (2006) showed that asking participants to use a shortcut can produce a male advantage, whereas following the exact way that was learned can produce a female advantage. Furthermore, Boone, Gong, and Hegarty (2018) reported that men are more likely than women to use shortcuts, whereas women prefer the learned route. Therefore, we included route selection as a moderator in our analysis.

Timing

Research suggests that anxiety and self-doubt negatively influence the ability to encode important spatial features and consequently navigation performance (Saucier et al., 2002). Women generally report higher spatial anxiety than men (Huang & Voyer, 2017; Lawton, 1994; Lawton & Kallai, 2002) and lower self-confidence in spatial tasks (Huang & Voyer, 2017; O’Laughlin & Brubaker, 1998). This suggests that women may be affected differentially by the imposition of time constraints. For example, it is possible that the male advantage in spatial navigation could be larger under time constraints compared with unconstrained conditions, as has been reported for mental rotation (Voyer, 2011). Hence, we included timing conditions as a potential moderator with two categories—untimed and timed.

Cues

Environments may also differ in the amount and diversity of visual cues available in them. Visual cues used to orient and navigate can be broadly divided into two main categories—proximal and distal (Chai & Jacobs, 2010; Knierim & Hamilton, 2011; O’Keefe & Nadel, 1978; Sandstrom, Kauffman & Huettel, 1998; Vorhees & Williams, 2006). Proximal and distal cues can be differentiated by the effect on parallax when individuals change their location in an environment, with the latter cue type providing a more stable bearing than the former (Knierim & Hamilton, 2011). Most work manipulating the role of visual cues, however, has been restricted to small-scale navigation paradigms with few exceptions (Padilla et al., 2017), even though large-scale environments are also obviously rich in both proximal and distal cues. Existing research suggests a male superiority in environments with limited or directional landmark cues (Astur et al., 2004; Barkley & Gabriel, 2007; Cánovas, Garcia, & Cimadevilla, 2011; Moffat, Hampson, & Hatzipantelis, 1998) and no sex differences in environments rich with proximal cues (Astur, Tropp, Sava, Constable, & Markus, 2004; Saucier et al., 2002). Sex differences in human visual cue processing, mentioned by many authors (Astur, Ortiz & Sutherland, 1998; Barkley & Gabriel, 2007; Chai & Jacobs, 2009, 2010; Choi et al., 2006; Sandstrom, Kauffman & Huettel, 1998) provide strong empirical support for the inclusion of cues as a moderator of sex differences in navigation.

The role of distal and proximal cues in spatial navigation also has theoretical implications, as these cues are seen as central to many navigation tasks, likely as a result of the much-documented link between the use of such cues and hippocampal functioning (Poulter, Hartley, & Lever, 2018). Therefore, a result showing an effect of cues on the magnitude of sex differences in spatial navigation could raise the possibility of differentiated hippocampal functions in men and women.

Environment

The environment experienced by the participant while navigating can be seen as reflecting variation in the actual scale of the surroundings. Specifically, we categorized the environment as indoor, outdoor, or a combination of indoor–outdoor to refer to a limited, unlimited, or mixed-area scale environment, respectively. More importantly, recall that although some researchers might consider them small-scale tasks, water-maze tasks were included in our sample as they essentially provide the richness of large-scale tasks within a limited area by means of a manipulation of distal and proximal cues (Poulter et al., 2018). Therefore, this type of task formed a fourth environment category. In view of the argument that water-maze tasks provide the clearest available evidence for sex differences in spatial navigation strategy selection (Boone et al., 2018), it is critical for us to consider this type of tasks as a central component of the environment moderator.

Familiarity

Navigation tasks can be conducted in environments that are either learned specifically for the purpose of testing (e.g., Jansen-Osmann, Schmid, & Heil, 2007; Nazareth, Weisberg, Margulis, & Newcombe, 2018) or that are already familiar to the participant (Meilinger, Frankenstein, & Bülthoff, 2013). Boccia, Nemmi, and Guariglia (2014) found that different neural correlates were involved in the processing of familiar and learned environments. It is therefore not surprising that Abu-Obeid (1998) found a significant male advantage in a new environment but not in a familiar one. Accordingly, we included familiarity as a potential moderator of sex differences in navigation.

Testing medium

Coluccia and Louse (2004) reported that males outperformed females in about 57% of the research using “simulated” (i.e., virtual) environments, very similar to the 59% found for real environments. However, this value dropped to 42% for maps. More importantly, virtual and real environment showed no cases where females outperformed males, whereas this figure was around 18% for maps. Therefore, we aimed to determine whether Coluccia and Louse’s findings on this moderator would be reflected in smaller sex differences for map-like tasks compared with virtual or real environments in our comprehensive quantitative review. Accordingly, we categorized testing medium into virtual, real, and symbolic groups, where symbolic referred to the use of a map or a similar medium.

Feedback and hints

There is evidence in the cognition literature of sex differences in attitudes and responses to failures and achievements (Dweck, 1986). Although not examined frequently in the navigation literature, the inclusion of feedback and/or hints from guides as moderators accounts for important differences in men’s and women’s response to performance feedback (Lenney, 1977) and their ability to use the feedback constructively for spatial updating. We therefore considered feedback and hints as two separate moderators. Feedback had two categories—immediate feedback and no feedback. Hints had two categories—hints and no hints—based on whether participants were given hints during the task goal.

Device assistance

We know that device assistance from a map provides a survey perspective of the environment and consequently may provide an additional support to individuals—primarily women—who rely on a route strategy (Dabbs, Chang, Strong, & Milun, 1998; Lawton, 1994; Lawton & Kallai, 2002), as is described in more detail in the section on perspective. Hence, device assistance was used as a potential moderator with two categories—device and no device.

Learning intervals

Learning intervals were broken down into three categories—immediate (no gap between learning and testing phases), short (less than 24 hours between both phases), and long (more than 24 hours gap). Forgetting occurs over time, as a century of memory research shows, but spatial memory also seems to undergo consolidation during wakeful rest or sleep (Skaggs & McNaughton, 1996), in which it may actually strengthen. We are unaware of data suggesting that such “replay” differs by sex, but it seemed prudent to examine the question in our analysis, given the importance of these processes in the literature.

Age

There is gradual age-related change in navigation skills in children between 6 and 12 years of age (e.g., Acredolo, Pick, & Olsen, 1975; Allen, Kirasic, Siegel, & Herman, 1979; Heth, Cornell, & Alberts, 1997; Laurance, Learmonth, Nadel, & Jacobs, 2003; Overman, Pate, Moore, & Peuster, 1996) with spatial representations maturing near adolescence (Liben, Myers, Christensen, & Bower, 2013). Further, the large body of literature on the influence of hormone levels on spatial skills (Brake & Lacasse, 2018; Lisofsky, Riediger, Gallinat, Lindenberger, & Kühn, 2016; Puts, McDaniel, Jordan, & Breedlove, 2008) highlights the importance of participant age during testing (e.g., puberty, pregnancy, menopause). Thus, in addition to the task-related variations, we investigated the moderating influence of participant age in sex differences in navigation skills.

Current meta-analysis

The main aim of the current meta-analysis was to provide a summary of sex differences found in navigation research. To this end, we conducted an exhaustive search for relevant published and unpublished data collected in different countries with different sample populations and using different navigation paradigms. To our knowledge, no meta-analysis specifically quantifying human sex differences in navigation has been published to date, and we believe that the current meta-analysis serves to fill that gap in the navigation literature. It is important to clarify that the current meta-analysis excludes self-report measures of navigation skills as well as any navigation studies involving small-scale paradigms in which the entire testing environment can be viewed from a single vantage point. We also constrained the meta-analysis to include only studies with typical populations. Further, published research studies that failed to include numerical values for male and female sex differences and for which the corresponding author was unable to provide these data were excluded from the meta-analysis.

Not surprisingly, many of the studies we retrieve included several outcome measures relevant to our research questions or provided nonindependent measurements of their participants under different experimental conditions relevant to the potential moderators we identified. In the context of fixed effects, random effects, or mixed effects meta-analysis, such nonindependent effect sizes would violate the assumptions underlying these types of analyses (Borenstein, Hedges, Higgins, & Rothstein, 2009) and make the results uninterpretable. Accordingly, for all data analyses, we used multilevel meta-analysis, as this method is particularly well suited to handle the hierarchical nature of a meta-analytic data set and nonindependent effect sizes (Raudenbush & Bryk, 2002).

Method

Literature search and study selection

A primary search was conducted in databases of PsycINFO, PsycARTICLES, and ERIC using various Boolean combinations of the search terms wayfinding, spatial, navigation, orientation, maps, representation, cognition, and environment. The search included all available records from the year 1803 (default lower limit in the search engine) to October 2017. We refrained from using sex/gender differences in our initial search to prevent prematurely excluding papers that did not have sex/gender differences as their central theme. Our filters excluded patents, reviews, books, magazine articles, and any non-peer-reviewed sources. These searches resulted in 10,663 nonoverlapping hits, which was initially reduced to 1,164 with the help of our inclusion criteria, detailed below. Foreign language articles including an English abstract were also included in the analyses. Theses and dissertations were considered as possible sources of unpublished material, but were excluded if the same data had been published. In such cases, only the published version was included in the meta-analysis. In addition, requests for unpublished data announcements were posted to the following LISTSERVs—Cognitive Development Society (CDS), Spatial Learning Network of the Spatial Intelligence and Learning Center (SILC), and Canadian Society for Brain, Behavior and Cognitive Science (CSBBCS). Altogether, we received 47 responses to our LISTSERV announcement, although most of those were for published work.

Next, we adhered to the rules in Fig. 1 to exclude articles that did not meet our inclusion criteria but had not been excluded by our automated search engine filters. This process involved two of the four authors (a postdoctoral research fellow and a doctoral student) carefully reading through the title, abstract, and, on occasion, the entire article to determine eligibility. These additional exclusions were papers that presented studies on nonhumans, robotics, unmanned devices, and self-report measures as well as tasks that involved line orientation, object location in a room/table/virtual environment, web navigation, language, haptic orientation, tactile, grating, locomotion, optic flow, categorization, spatial frequency, grasping, and reach planning, and so forth, that did not fit our definition of spatial navigation. Studies that had either male-only or female-only human participants, as well as special populations (for example, brain-damaged, hearing, or visually impaired individuals) were excluded from our sample, although data from healthy control groups were included. Finally, papers that did not present original research (e.g., review papers) were excluded from the analyses. The next step was to contact authors of articles that cleared our screening process, but that did not report the information required for us to compute effect sizes relevant to sex differences. Papers from authors who did not respond to our e-mails or who no longer had access to the data were excluded from the meta-analysis.

Fig. 1
figure 1

Flowchart illustrating the selection process of articles included in the meta-analysis

Our final sample had 694 effects sizes drawn from 266 samples. Out of the total effect sizes, 80 came from unpublished research in English (70 from dissertations, 10 from one unpublished paper). For the remainder, 614 effect sizes were from papers published in English, and no papers came from work published in other languages. Furthermore, 293 out of the 694 effect sizes (42.2%) originated from the United States, 135 effect sizes (19.5%) were from Germany, 73 were from Canada (10.5%), and 64 were from the United Kingdom (9.2%). The remaining effect sizes (18.6%) were from a variety of other countries. The final sample of studies included in the meta-analysis is presented in the Supplemental Material.

Inclusion criteria

The following criteria were used to determine the inclusion eligibility of a study in the meta-analysis:

  1. 1.

    The study should involve empirical research with nonclinical, male AND female human populations. For papers that investigated both nonhuman and human subjects, we included data for the human subjects only. For papers that included a healthy control when investigating clinical/special populations, we included data for the healthy controls only.

  2. 2.

    The study should include at least one objective navigation performance outcome in a task that met our definition of spatial navigation (presented in the Introduction). Importantly, this criterion meant that studies using only self-reports/surveys as measures of navigation performance were excluded.

  3. 3.

    The navigation paradigm should be categorized at an “environmental” scale (see Montello, 1993, for a description of scales). At an environmental scale, a participant needs to move some distance within the space in order to obtain information about the spatial properties of the real or virtual environment (i.e., all spatial information cannot be obtained from a single vantage point in the environment). The only exception to this rule was the water maze task (e.g., Chamizo, Artigas, Sansa, & Banterla, 2011), as justified in the Introduction.

Coding of study variables

As a starting point, we developed a coding template that captured crucial methodological aspects of the studies along with the moderators of interest. Our coding template had categories for a number of variables not necessarily considered as moderators to provide as complete a picture as possible for each study, with an eye on later data interpretation. Therefore, the coding template involved the following study characteristics: authors, year of publication, author ID, sample ID (a crucial variable for multilevel analysis), publication status, mean age of sample, experiment number, sample origin, number of males, number of females, study location, task goals, testing medium, outcome measure(s), route direction, timing conditions, perspective, locomotion, route direction, environment, familiarity, test interval, hints, feedback, device assistance, cues, and the calculated effect size(s). From this larger set, we considered the variables at sample level and measure level detailed in the Introduction section to identify factors that might moderate sex differences in wayfinding/navigation skills. Sample-level variables reflect variables inherent to the samples themselves, such as mean age. Measure-level variables are those that are inherent to spatial navigation tasks, such as medium and outcome measure.

Sample-level variables

Undergraduate students represented 53.5% of the total effect sizes, and children represented 9.7% of the total effect sizes. The remaining effect sizes represented a wide range of participant ages, justifying our use of mean age of the participants the moderator of interest as a continuous variable and a categorical variable (less than 13, 13–17, 18–29, 30–50, 50–older).

We also considered the sample-level variable year of publication. This variable is easily obtained and is routinely considered in meta-analyses of cognitive sex differences as a means to investigate Feingold’s (1988) claim that cognitive sex differences are decreasing in magnitude in recent years.

Measure-level variables

Table 1 presents a list of measure-level variables along with their respective categories. We have also included examples to clarify our classification process. In total, we had 14 measure-level variables consisting mostly of two to three groupings each. The task goals variable had eight categories, which was the highest number of categories in any variable. This large number of task goals testifies to the wide range of skills measured in the navigation literature.

To ensure coding validity, the detailed coding template mentioned earlier was used as a strong guideline. As an initial step before final coding, two of the authors of the present report (Coder 2 experienced in meta-analyses; Coder 1 experienced in navigation research and fully trained in meta-analytic coding by Coder 2) independently coded 25 studies accounting for 59 effects sizes for a total of 1,711 entries. This coding process involved 29 variables (again, not all considered in moderator analyses): authors, year of publication, author ID, sample ID (a crucial variable for multilevel analysis), publication status, mean age of sample, experiment number, sample origin, number of males, number of females, study location, task goals, testing medium, outcome measure, route direction, timing conditions, perspective, locomotion, route selection, environment, familiarity, learning interval, guide assistance, feedback, device assistance, cues and the calculated effect size. This independent coding resulted in 173 disagreements, representing 1,538 agreements over 1,711 entries (29 variables × 59 effect sizes), for a .899 agreement rate (Cohen’s kappa = .798). The two coders had extensive discussions to elucidate points of disagreement, and Coder 1 then proceeded with coding the remainder of the studies. At completion of coding, Coder 2 independently coded a new set of 50 studies (accounting for 77 effect sizes) from the final sample. In this case, the total of 2,233 entries (29 variables × 77 effect sizes) produced only nine disagreements, resulting in an interrater reliability of 99.6% (2,224 agreements/2,233 total entries; Cohen’s kappa = .992). This high interrater reliability clearly reflects the validity of the final coding.

Measure of effect size

The effect size measure was the standardized mean difference calculated as the mean for males minus that for females divided by the pooled standard deviation (Cohen’s d; Cohen, 1988). The assumption is that men would perform better than women based on the literature presented so far. Thus, a positive effect size reflects a male advantage, and a negative effect size reflects a female advantage in spatial orientation tasks. The effect size calculation was based on Cohen’s (1988) formula when means and standard deviations were available, which was the case for 372 out of the 694 effect sizes (53.6%). The remaining cases were available with an inferential statistic (typically t test, p, r, or F), thus, the formulae presented by Lipsey and Wilson (2001) were used. In all cases, effect sizes were computed by the calculator provided on David Wilson’s webpage (http://mason.gmu.edu/~dwilsonb/downloads/ES_Calculator.xls). Following recommendations by Hedges and Becker (1986), a small sample correction was applied to all effect sizes. When an effect size was not significant and no means or inferential statistics values were presented, authors were contacted by e-mail for more information. Out of the 19 authors who were contacted for that purpose, six replied and provided usable data. For the remainder, as suggested by Rosenthal (1991), we kept an effect size of zero to avoid excluding relevant work. Note, however, that in some of the cases in the table, zero was the actual effect size value.

Data analysis

As is a typical goal of most meta-analyses focusing on sex differences, we aimed to quantify the overall magnitude of sex differences in spatial navigation and to identify variables that might moderate these sex differences. A valid examination of specific tasks and potential moderators required us to retrieve multiple effect sizes that are nonindependent. Using these effect sizes in a fixed or random effects meta-analysis would violate the assumption that effect sizes should be independent (Borenstein et al., 2009), and this would distort the statistical analyses (Bateman & Jones, 2003). Accordingly, we relied on the multilevel linear modeling (MLM) approach to meta-analysis, as it does not require independence of effect sizes and it easily handles the type of hierarchical design represented in meta-analysis (Raudenbush & Bryk, 2002). As the standard error calculated for each effect size in a meta-analysis reflects an estimate of the variance for individual effect sizes (see Borenstein et al., 2009), multilevel meta-analysis represents a “variance-known” hierarchical linear model resulting in the precision weighted estimates of effect sizes typical of meta-analytic results (Raudenbush & Bryk, 2002).

As a starting point, and similar to the approach used by Voyer, Voyer, and Saint Aubin (2017), we computed an overall analysis and moderator analysis on the whole sample. The overall multilevel analysis was computed by examining the data organized in two levels: effect sizes nested within samples. This overall structure reflected 694 effect sizes (Level 1) nested within 266 samples (Level 2). This large number of Level 1 and Level 2 units had the advantage to maximize power for the identification of significant moderators and to provide a more complete documentation of the overall findings in the available data. The variables task, year of publication, mean age of the sample, age coded categorically, testing medium, outcome measure, route direction, timing conditions, perspective, route selection, environment, familiarity, learning interval, hints, feedback, device assistance, and cues were considered in the moderator analysis.

As a second step, after demonstrating that the different tasks differed at some basic level (as reflected in the finding that they produced effect sizes of different magnitude), we performed a moderator analysis separately for each task. This additional set of analyses was required to recognize the fact that some of the moderators might be confounded with task. For example, in most cases, pointing tasks generally produce a deviation score as an outcome variable. These analyses also used the multilevel approach considering that all the task groupings included nonindependent effect sizes.

All meta-analytic computations were performed with the metafor package in the R statistical software (Viechtbauer, 2010). Effect sizes were treated as random effects whereas moderators were treated as fixed effects. As previously mentioned, the observed values obtained in this approach reflect precision weighted estimates of effect sizes (Raudenbush & Bryk, 2002). In addition, significance testing for multilevel models used robust standard error for added precision, as they can be easily obtained with the robust command in the metafor package. However, as a small number of Level 2 clusters (i.e., samples) can bias the calculation of robust standard errors, the appropriate correction built-in the robust procedure in metafor was implemented. Note that, with robust, an F test is reported (instead of the more common between-groups Q test) when robust standard errors are used.

Categorical independent variables were dummy coded into k − 1 dichotomous vectors (where k represents the number of categories) for consideration in the analysis, whereas continuous moderators were mean centered. In all moderator analyses, moderators were examined one at a time in models as there was no a priori basis to justify the examination of multifactor models or interactions. In addition, only effects significant with p < .05 are presented in the Results section. This means that any moderator that is not mentioned in the results was nonsignificant.

Results

A preliminary analysis was conducted to identify outliers. Following recommendations by Tabachnick and Fidell (2007), we defined outliers as effect sizes values that were more than 3.29 standard deviations away from the grand mean. Five outliers were identified based on this criterion. However, as such a number of outliers should be expected in a comprehensive sample, they were preserved as is for the sake of completeness, although they are identified by a star (*) in Table 1. The final sample, therefore, consisted of 694 effect sizes drawn from 266 independent samples, reflecting combined results from 9,435 males and 9,570 females.

Overall meta-analysis

Overall sex differences in spatial navigation

A null model where the test of significance for the intercept is examined (Raudenbush & Bryk, 2002) provided data on the overall magnitude of sex differences in spatial navigation based on the current sample of studies. This analysis produced a mean estimated d of 0.341 (95% confidence interval (CI) [0.302, 0.380]), indicating that males significantly outperformed females on spatial navigation tasks, z = 17.05, p < .001. Having considered this initial finding, it is important to remember that when authors reported sex differences as nonsignificant but provided no information for effect size coding, we entered an effect size of zero for these studies. Of course, we contacted authors to obtain clarifications but were still left with 62 cases where the effect size was coded as zero because of the lack of additional data. Accordingly, the estimate presented above might underestimate the actual magnitude of sex differences in spatial navigation. With this in mind, we removed these 62 effect sizes and conducted a second overall analysis. In this analysis, we found a mean estimated d of 0.381 (95% CI [0.340, 0.423]). Therefore, it might be more appropriate to state that the true estimate of sex difference in spatial navigation is found within a range from 0.341 to 0.381. In any case, the remainder of the analyses preserved all 694 effect sizes in an attempt to provide a report on the complete data set. Regardless of which sample is used, however, results also showed that the overall effect was heterogeneous, Q(693) = 1,473.11, p < .001, I2 = 50.4% (for the complete data set). This fact suggests that the overall estimate of effect size is not representative of the sample of effect sizes. Accordingly, the examination of potential moderators was undertaken to attempt to account for this variability.

Moderators of sex differences in the overall sample

The moderator analysis revealed that task goal accounted for significant variance in effect sizes, F(7, 258) = 4.38, p < .001. Estimated effect sizes for this variable are presented in Table 2. The finding that none of the 95% confidence intervals contain zero indicates that a significant male advantage was observed for all task categories. In addition, multiple comparisons among means based on the robust standard errors and using the Tukey HSD method at the .05 level showed that recall and pointing tasks produced significantly larger effects than distance, learning, and verbal instructions tasks. No other differences achieved significance (all ps > .057).

Table 2 Summary for significant moderators in the overall meta-analysis

Outcome measures was also a significant moderator of the effect sizes, F(3, 262) = 7.04, p < .001, with estimated effect sizes presented again in Table 2. As we have seen in all cases so far, none of the confidence intervals contained zero, indicating a significant male advantage in all categories. Tukey HSD multiple comparisons showed that deviation scores produced significantly larger effects than accuracy and distance measures, whereas response time produced larger effects than distance. No other differences achieved significance (all ps > .11).

Timing condition was also found to contribute significantly to variance in effect sizes, F(1, 264) = 4.54, p = .034, with estimated effect sizes also presented in Table 2. Examination of Table 2 indicates that, based on confidence intervals, both timed and untimed administration produce a significant male advantage, although it appears that the sex differences are larger for timed than for untimed tasks.

Environment produced a significant effect, F(3, 262) = 4.01, p = .008, with estimated effect sizes presented in Table 2. As none of the confidence intervals contained zero, a significant male advantage was found in all categories. Tukey HSD multiple comparisons showed that water mazes and combination of indoor–outdoor environments produced significantly larger effects than studies that used exclusively indoor or outdoor environments. No other differences achieved significance (smallest p = .56).

Finally, even though mean age of the participants as a continuous variable failed to account for significant variance (p = .94), age defined categorically produced a significant contribution to variance, F(4, 261) = 4.74, p = .001. Once more, the estimated mean effect sizes are presented in Table 2. Again, none of the confidence intervals contained zero, indicating a significant male advantage in all age categories. Tukey HSD multiple comparisons showed that studies sampling participants below the age of 13 produced significantly smaller effects than participants in all other age groups. In addition, the 30 to 49 category produced a smaller effect than the 13 to 17 category. No other differences achieved significance (all ps > .07).

Task subgroup analysis

As a starting point to the task subgroup analyses, we examined the overall homogeneity statistic within each subgroup to ensure that there was significant heterogeneity in the effect sizes that required explanation by moderators before proceeding. Results of this analysis showed that learning and distance tasks goals produced homogeneous effect sizes (p = .330, I2 = 22.8% for learning, p = .474, I2=< 0.01% for distance). Accordingly, the effect sizes presented in Table 2 for these two tasks goals can be considered representative of the state of affairs. However, for the remaining tasks, significant heterogeneity was observed (recall/recognition: p < .001, I2 = 47.4%; cardinal directions: p < .001, I2 = 75.8%; landmark position: p < .001, I2 = 53.9%; maps: p < .001, I2 = 71.3%; pointing: p < .001, I2 = 41.2%; verbal instructions: p = .015, I2 = 21.8%). Accordingly, moderator analyses proceeded separately for recall/recognition, cardinal directions, pointing, landmarks, maps, and verbal tasks, as presented in the following sections. However, to provide a clearer picture of what specific moderators accounted for significant variability, the results are structured as a function of moderator in what follows.

Perspective

Results showed that perspective was a significant moderator only in landmark position tasks, F(2, 11) =9.21, p = .001. Estimated means, presented in Table 3, show that only the route perspective category produced a significant male advantage. Tukey HSD comparisons showed that a route perspective produced larger effects than a combination of route-survey perspectives. No other comparisons produced significant differences among the mean estimated effect sizes (smallest p = .24).

Table 3 Results of the analysis for perspective as a moderator

Outcome measure

Outcome measure was a significant moderator for all the tasks goals examined in the subgroup analysis (largest p = .006), with the estimates presented in Table 4. On pointing tasks, all the categories except response time indicated a significant male advantage. Tukey HSD comparisons showed that a deviation measures produced larger sex differences than response time and distance measures. No other differences achieved significance (smallest p = .054).

Table 4 Results of the analysis for outcome measure as a moderator

For recall/recognition tasks, the estimates indicated a significant male advantage in all categories. Tukey HSD comparisons showed that response time measures produced a significantly larger effect than accuracy and distance measures. No other differences achieved significance (smallest p = .25).

For cardinal direction tasks, all the categories except response time indicated a significant male advantage. Tukey HSD comparisons showed that accuracy and deviation measures produced larger sex differences than response time measures. No other differences achieved significance (p = .56).

For map tasks, the estimates presented in Table 4 indicated a significant male advantage on accuracy and deviations scores, but not on distance and response-time measures. Tukey HSD comparisons showed that deviation measures produced significantly larger effects than distance measures. No other differences achieved significance (smallest p = .071).

Estimates relevant to landmark position tasks showed that only the “other” category, combining one each for distance and response time measures, produced no significant male advantage, despite seemingly large effects. Tukey HSD comparisons showed a larger effect for deviation than accuracy measures. There were no other significant differences among the effect sizes (smallest p = .15).

Finally, on verbal instructions tasks, accuracy and response-time measures indicated a significant male advantage, whereas deviation and distance scores did not. Tukey HSD comparisons showed that accuracy and response time produced larger sex differences than distance measures did. No other differences achieved significance (smallest p = .22).

Route direction

Route direction was a significant moderator of effect sizes for pointing tasks, F(2, 68) = 3.19, p = .048, and verbal instructions tasks, F(1, 13) = 19.93, p < .001, with estimates presented in Table 5. For pointing, results showed a significant male advantage when participants went forward on the learned route or when such instructions were not applicable, but not for the backward route. Tukey HSD comparisons showed that a backward route produced significantly smaller effects than the “not applicable” category. No other difference achieved significance (smallest p = .09).

Table 5 Results of the analysis for route direction as a moderator

For verbal instruction tasks, the estimates presented in Table 5 indicated a significant male advantage only for the “Not applicable” category. Direct interpretation of the estimated mean seen in Table 5 suggests that sex differences were larger when route direction was not applicable compared with when the forward direction was followed.

Route selection

Route selection was only a significant moderator in cardinal direction tasks, F(3, 10) = 361.61, p < .001. Relevant estimates, seen in Table 6, indicate that the male advantage was not significant only when a free choice was allowed. Tukey HSD multiple comparisons showed that shortcuts produced significantly larger effects than in all other groups. Free choice and exact way also produced smaller effects than when this moderator was not applicable. No other differences achieved significance (all ps > .055).

Table 6 Results of the analysis for route selection as a moderator

Timing condition

Timing conditions was a significant moderator in map tasks, F(1, 29) = 8.16, p = .008, and verbal instructions tasks, F(1, 13) = 34.18 p < .001, with estimates presented in Table 7. For map tasks, the results reflected the overall finding that although the male advantage was significant regardless of timing conditions, it was significantly larger when a timed administration was used. In contrast, on verbal instructions tasks, timed conditions resulted in a female advantage, whereas untimed conditions produced a male advantage.

Table 7 Results of the analysis for timing condition as a moderator

Environment

Environment was a significant moderator of effect sizes in pointing tasks, F(2, 68) = 9.60, p < .001, and cardinal direction tasks, F(2, 11) = 18.34, p < .001, with estimates presented in Table 8. In both tasks goals, the male advantage was significant regardless of environmental conditions. For pointing, Tukey HSD comparisons showed that testing in both environments (indoor–outdoor) produced significantly larger effects than testing indoor or outdoor singly. No other difference achieved significance (smallest p = .51).

Table 8 Results of the analysis for environment as a moderator

For cardinal direction tasks, Tukey HSD comparisons showed that water-maze environments produced significantly larger sex differences than indoor or outdoor environments did. The difference between these last two categories did not achieve significance (p = .65).

Familiarity

Familiarity was only a significant moderator for cardinal direction tasks, F(2, 11) = 22.57, p < .001, with the estimates presented in Table 9, and a significant male advantage was found for all categories. Tukey HSD comparisons showed that familiar locations produced larger sex differences than unfamiliar ones did, whereas remaining differences did not achieve significance (smallest p = .07).

Table 9 Results of the analysis for familiarity as a moderator

Feedback

The presence of feedback was a significant moderator for pointing tasks, F(2, 28) = 9.68, p < .001, and map tasks, F(2, 28) = 32.56, p < .001. Estimates are presented in Table 10. For pointing, the estimated effect sizes indicated a significant male advantage in all categories. Tukey HSD comparisons showed that immediate feedback and the “not reported” category produced significantly larger effects than no feedback. The difference between the first two categories did not achieve significance (p = .43).

Table 10 Results of the analysis for feedback as a moderator

For map tasks, a significant male advantage was found under immediate or no feedback, but not for the “not reported” categories. Tukey HSD comparisons showed that immediate feedback produced significantly larger effects than no feedback or the “not reported” category did. The difference between these last two categories did not achieve significance (p = .97).

Learning interval

Learning interval only produced a significant contribution to variance in recall/recognition tasks, F(3, 127) = 4.03, p = .009. As seen in the estimates presented in Table 11, only immediate testing produced a significant male advantage when a delay was applied. Studies where the interval was not applicable also produced a significant male advantage. Formal Tukey HSD comparisons showed that effects in long time intervals were significantly smaller than those in immediate interval or the “not applicable” case. No other differences achieved significance (smallest p = .20).

Table 11 Results of the analysis for learning interval as a moderator

Age

Age defined categorically was also a significant moderator for pointing tasks, F(3, 67) = 6.82, p = .007, and landmark position tasks, F(2, 22) = 7.53, p = .003. Estimated mean effect sizes are presented in Table 12. For pointing, results indicated a significant male advantage in all the represented age categories except in the ages 30 to 49 category. Tukey HSD multiple comparisons showed that the 13 to 17 category produced significantly larger effects than all other groups. In addition, the less than 13 category produced a smaller effect than the 18 to 29 category. No other differences achieved significance (all ps > .16).

Table 12 Results of the analysis for age categories as a moderator

In landmark position tasks, all the represented age categories showed a significant male advantage. Tukey HSD comparisons showed a significantly larger effect for the 18 to 29 and 30 to 49 age samples than for samples below the age of 13. The 18 to 29 and 30 to 49 groupings did not differ from each other (p = .71).

Publication bias and the file drawer problem

Despite our best efforts to obtain unpublished work, the present meta-analysis consists mostly of data obtained from published studies. In such a case, it is often assumed that the final sample might not be representative of the entire population of studies in existence (Rosenthal, 1979). Such a situation raises the possible influence of the “file-drawer problem” (Sterling, 1959), suggesting that studies producing nonsignificant results or in the unexpected direction (a female advantage in our case) have a lower probability of publication. This putative publication bias has the potential to affect any meta-analytic results so that by including mostly published studies, meta-analytic findings might exaggerate the magnitude of the effect under consideration.

The simplest way to examine the potential influence of the file-drawer problem is to compare the mean estimated effect sizes for samples obtained from published and unpublished research. We therefore proceeded with such an analysis in the overall sample. We also divided research intro three rather than two categories. Of course, whether a data source was clearly published or unpublished formed two of the categories. However, theses and dissertation were considered as a third category because their status is uncertain in relation to the file-drawer problem. Specifically, one reason why they are not published might simply be that the author of the thesis did not pursue publication. This resulted in 614 published effect sizes (239 samples), 70 effect sizes from theses (24 samples), and only 10 effect sizes from unpublished sources (three samples). With this in mind, the analysis using publication status as moderator showed no significant influence of publication status, F(2, 263) = 0.72, p = .488. This suggests no evidence of a publication bias in the present sample.

One might sensibly argue that the small number of unpublished studies that we were able to obtain reduces the value of our examination of publication status as a moderator in testing for a publication bias. Based on this argument, and despite potential drawbacks, the Egger, Davey Smith, Schneider, and Minder (1997) approach was used as a further way to test a potential publication bias in the present sample. This method makes the assumption that studies with a small sample and a small effect size are less likely to get published (Borenstein et al., 2009). Therefore, in the presence of publication bias, plotting precision (the inverse of the variance; y-axis) against effect size (x-axis) would produce an asymmetrical distribution with few values on the bottom left-hand side of the plot, where small samples and negative effects would belong. Accordingly, the present data are shown in such a funnel plot in Fig. 2. A visual inspection of Fig. 2 reveals no sign of asymmetry. However, visual examination of the plot is not sufficient, and Egger et al. have formalized this process mathematically.

Fig. 2
figure 2

Funnel plot of precision (inverse variance) as a function of Cohen’s d (observed outcome) for the whole sample

Specifically, the Egger et al. (1997) method allows examination of a possible publication bias by regressing the standard normal deviate for the effect size on precision. If there is no publication bias, the regression line should run through the origin and the intercept of the regression equation should not be significantly different from zero. Egger et al. recommended a significance level of .10 to maximize power. Following the approach proposed by Viechtbauer (see http://stats.stackexchange.com/questions/155693/metafor-package-bias-and-sensitivity-diagnostics), the inverse standard error was used as a moderator in the multilevel analysis with the standard normal deviate for the effect sizes as outcome variable. Results of the Egger et al. test showed that the intercept was not significantly larger than zero at p < .10, with an intercept estimate of 0.065 (90% CI [−0.242, 0.375]). Therefore, the Egger et al.’s approach failed to support the presence of a publication bias in our data. This finding was further supported by results of a trim and fill analysis, an approach in which data points missing as a result of funnel plot asymmetry are imputed (see Duval & Tweedie, 2000). Specifically, this analysis indicated that no imputed data points were required and confirmed that the overall effect size remained unchanged from the original analysis.

Discussion

The present meta-analysis aimed to summarize the available literature on sex differences in human spatial navigation and to examine potential moderators of these effects. The results of the analyses are summarized in Table 13. These results should be seen as a guide to forward thinking in this area, although it is important to remember that, because of their quasi-experimental nature, meta-analytic results do not allow causal conclusions. Accordingly, our discussion provides speculations intended to stimulate empirical assessment in future work.

Table 13 Summary of results for significant moderators

Overall results

The overall effect for our 694 effect sizes was estimated as d of 0.34 (up to 0.38 if effects sizes coded as zero are omitted). In Cohen’s (1988) classification, this effect would be considered small to medium. Interestingly, it is in line with the overall effect reported by Voyer, Voyer, and Bryden (1995) in their examination of small-scale spatial ability sex differences (overall d of 0.37). In addition, the effect size is in line with the range of sex differences reported in a large international study using the game Sea Hero Quest (Coutrot et al., 2018). This study had an extremely large sample size, but was reliant on self-reports of sex and was vulnerable to factors such as more than one player using the game. Thus, alignment in those results with those in the meta-analysis supports the validity of the conclusion that there are sex differences, albeit of small to moderate size.

Furthermore, results showed that a publication bias is unlikely to account for the present findings. To put these estimates in context, consider that for a d of 0.34, the distribution of men and women in spatial navigation overlaps by approximately 86%, which, at first glance, might seem to be a large overlap. However, it also means that about 64% of men score above the mean of women (and therefore, 36% of men scored below the mean of women), and thus the difference is arguably not negligible. In a more practical fashion, this means that in any pair composed of a randomly selected man and a randomly selected woman, the man would have about a 64% of probability to score above the woman, and a 36% probability to score below her. Navigation training in regular schooling or in informal activities might be helpful to narrow sex differences, in view of the malleability of spatial abilities more generally (Uttal et al., 2013), although navigation-specific training needs development. There have only been a few efforts to improve mapping and wayfinding skills, although what prior efforts exist have seen some success (Kastens & Liben, 2007; Nazareth, Newcombe, Shipley, Velazquez & Weisberg, in press). The goal of such training would be not only to narrow the sex gap but also to benefit individuals of both sexes in the long term.

Moderator analyses

The discussion of moderator analyses builds on Table 13. Readers should refer to Tables 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, and 13 for the relevant summary values.

Task goal

Task goal was a significant moderator of sex differences in navigation skills. Although the numerically largest effects were found in cardinal direction tasks (based on a relatively small number of effect sizes; see Table 2), statistical results showed only that recall and pointing tasks produced significantly larger effects than distance, learning, and verbal instructions tasks, with no other differences achieving significance. The smaller sex difference for distance tasks might reflect a floor effect as both men and women tend to perform poorly on such tasks as a result of their inherent difficulty (Newcombe, 1985). For verbal instructions, the smaller sex difference might reflect the female advantage in language tasks and courses (Voyer & Voyer, 2014), providing them with some amount of compensation in relation to men’s performance. Finally, for learning tasks, although the male advantage is significant, the reduced magnitude of the effect size might reflect the possibility that women can catch up with men in paradigms that allow for learning.

Outcome measure

Outcome measure was a significant moderator of sex differences in navigation skills in the overall analysis (see Table 2). In addition, it was a significant moderator for all task goals considered in the subgroup analysis (see Table 4). For all task goals except for verbal instructions, deviation scores produced the largest male advantage, perhaps because deviation scores provide an unusually sensitive measure of precision and accuracy. For verbal instructions, accuracy and response time produced the largest male advantage. Although distance measures did not produce a significant sex difference under verbal instructions, the fact that they reflect a trend favoring women is intriguing, especially considering that this is one of only three effect sizes with a negative sign in the tables of results. Conclusions on this finding are limited by the small number of relevant effect sizes (k = 3; see Table 4). However, the investigation of distance measures with verbal instructions tasks might provide a fruitful avenue in future work to better understand why women might process navigation tasks efficiently in this context.

Route direction

Route direction was a significant moderator for pointing and verbal instruction task goals (see Table 5). However, the results with route direction are not particularly informative because the reverse route condition was not used as manipulation with verbal instructions. For pointing, a similar problem arises, with only five of 107 studies using a reversed direction. The findings show that sex differences are small and not significant under such conditions. One possible way to account for this finding is that the additional cognitive load required for a route followed in the reversed direction might reach a point where it even exceeds the men’s ability to handle the extra information, resulting in floor effects.

Timing condition

Timing condition was a significant moderator of sex differences in navigation skills in the overall sample (see Table 2). It was also significant in maps and verbal instructions tasks (see Table 7). Except for verbal instructions, the male advantage was larger under timed than under untimed conditions. The role of timing condition might suggest the presence of sex differences in speed of processing for complex spatial information. However, it is important to keep in mind that timed tasks accounted for only 16% of effect sizes in the overall sample (111 out of 694) and only 8% for map tasks (five out of 62). Furthermore, the findings for verbal instructions are the opposite of what would be expected, with a significant female advantage observed for timed tasks, and a significant male advantage for untimed conditions. The findings for verbal instructions are questionable, however, because all four effect sizes for the timed condition come from the same sample in one study (Ishikawa & Kiyomoto, 2008). It is also important to consider that all effect sizes for distance and pointing were untimed, whereas for other categories the percentage of timed effect sizes was as follows: recall/recognition = 29.5%; cardinal directions = 27.8%; landmark position = 13.9%; learning = 3.6%; and verbal instructions = 12.5%. The small number of effect sizes from timed tasks in most of these task categories suggests that it is premature to draw strong conclusions concerning the role of timing conditions on sex differences in spatial navigation.

Environment

Environment was a significant moderator of sex differences in the overall sample as well as for pointing tasks (see Tables 3 and 9). Sex differences were largest when testing involved a combination of indoor and outdoor environments. Water-maze tasks stood out for producing larger sex differences than indoor environments or outdoor environments. There were no significant differences in effects between the water maze and the combined indoor–outdoor environments.

The large sex differences in combined indoor–outdoor environments might reflect the complexity of such tasks, adding to instances of high task complexity promoting larger sex differences in spatial tasks (Coluccia & Louse, 2004; Heil & Jansen-Osmann, 2008). Environments combining indoor and outdoor settings likely involve switching between egocentric and allocentric wayfinding strategies. Therefore, large sex differences in this context might partly reflect the male advantage in the ability to alternate between strategies (Wang & Carr, 2014).

Feedback

Feedback significantly moderated task goals, specifically for maps and pointing tasks (see Table 10). Immediate feedback increased the magnitude of the sex difference. However, the finding that the largest effect for the pointing tasks is for the “not reported” category undermines this conclusion. In fact, it is clear that more research manipulating feedback is required considering that studies with immediate feedback reflected only 6.5% of effect sizes for maps and 7.4% for pointing.

Age

Age was a significant moderator of sex differences in the overall sample (see Table 2) as well as in pointing and landmark position tasks (see Table 11). It is readily apparent from the data presented in the tables that the less than 13 years age group produced the smallest effect sizes, with a clear increase in magnitude for the 13 to 17 years category. Considering that adolescence is a time for increased independent navigational range (Anooshian & Young, 1981) with sex differences in how far and frequently children travel away from home, experiential and social norms may play a role in promoting sex differences in navigation skills in the latter age group. However, it is important to note that the 18 to 29 years old group is overrepresented in the retrieved literature, reflecting 70.5% of the effect sizes examined here (see Table 2). The 13 to 17 years old group is interesting because it produced the largest effect size presented in Table 2; however, it is also the grouping with the smallest sample (k = 24). These data emphasize the need for more life-span developmental research on spatial navigation.

Moderators significant in only one task

Finally, a number of moderators only accounted for significant variance for one of the task categories. For instance, perspective accounted for significant variance only in landmark position tasks (see Table 3.) However, this finding might once again reflect a limited number of studies and does not warrant a lengthy discussion.

Route selection only achieved significance for cardinal direction (see Table 6), showing that the male advantage was largest when a shortcut was required as part of the task response. In fact, the effect size of 1.07 for that category is the largest in all our tables. Speculatively, the use of a shortcut might require deeper processing of the route and results in a better cognitive map, suggesting that higher depth of processing advantages males. However, this reflects another case of a result based on a small number of effect sizes (k = 2). The most parsimonious conclusion here is therefore that this is a finding that requires replication in many more studies before efforts are expanded to explain it.

Familiarity with the testing environment moderated cardinal directions (see Table 9), showing a significantly larger male advantage for familiar compared with new locations. This finding could be a side effect of the very small number of effect sizes (k = 2) for familiar settings in cardinal direction tasks.

Finally, learning interval was a significant moderator only for recall/recognition tasks (see Table 11). Longer time intervals did not produce a significant male advantage, whereas immediate time intervals and testing where learning interval was not relevant did. We can speculate that, for longer intervals, the memory load exceeded even the male’s abilities and produced a floor effect. Short time intervals did not produce a significant male advantage despite a medium effect size (d = 0.42). However, this category had a very broad confidence interval as a result of imprecise estimates accounted for by small sample sizes. Nevertheless, in terms of actual magnitude, short intervals produced similar effects to what we found with immediate recall, thereby supporting the memory load account to some extent.

Nonsignificant moderators of special importance

Although we found that a large of number of variables were significant moderators of sex differences in spatial navigation, some of the moderators failed to achieve significance despite our expectations. A few of those are particularly noteworthy because of their theoretical or practical implications. Specifically, despite the theoretical importance often assigned to cue types (proximal, distal; Padilla, Creem-Regehr, Stefanucci, & Cashdan, 2017), this moderator was not significant. This finding contradicts the assumption of differential hippocampal function in men and women in processing environmental cues. This finding is consistent with our earlier finding related to the environment moderator. Specifically, it is reasonable to assume that an indoor environment (e.g., closed basement maze) may offer more proximal cues and fewer distal cues in comparison to an outdoor environment (e.g., university campus). Given that there were no significant differences in effects found between indoor or outdoor testing environments, the null effect of cues should not come as a surprise. However, the significantly larger sex differences in combined indoor–outdoor settings may point to sex differences in flexibility in cue processing rather than the ability to use one or the other cue and should be further investigated.

It is also noteworthy that year of publication failed to account for significant variability in effect sizes both in the overall sample and in the separate analysis for each task goals. On the surface, this suggests that the magnitude of sex differences in spatial navigation is unaffected by social changes associated with year of publication (e.g., Feingold, 1988). However, in considering this finding, it is also crucial to keep in mind that year of publication reflected a limited range in our sample, from 1977 to 2018 (at least limited from a statistical perspective), despite our search parameters including research published since 1803 (the lower limit for the search engine by default). This range limitation would have adverse effects on the likelihood of obtaining a significant relation between year of publication and the magnitude of effect sizes, as is always the case in correlational designs (Tabachnick & Fidell, 2007). Accordingly, it would be premature to draw definite conclusions on the current null finding for the moderating effect of year of publication on sex differences in spatial navigation.

Limitations

Of course, any comprehensive meta-analysis is not without limitations. In particular, throughout the discussion section we mentioned moderator categories where there were too few effect sizes to allow solid conclusions. All these cases reflect areas that require more research and should encourage researchers to direct their efforts to elucidate factors accounting for sex differences in human spatial navigation. In particular, more studies examining cardinal direction and distance task goals might be warranted considering that these are the categories where there are the fewest effect sizes in our sample (see Table 2). Similarly, it might be worthwhile to conduct more studies comparing timed and untimed conditions in the same experiment.

Future research should also consider manipulating the amount of feedback provided to participants and examine sex differences in improvement in navigation performance that may result from this manipulation. The inability of low-performers to use feedback to self-correct one’s cognitive map of the environment may present an opportunity for training interventions. Finally, large-scale navigation research has been abundantly tested with undergraduate psychology students who represent not only a very specific demographic but also a specific stage in neural development. The emergence of sex difference around the age of 13 years emphasizes the need for more developmental research in the 13 to 17 year age group.

The fact that we were only able to find a small number of effect sizes from clearly unpublished work (k = 10) is also a limitation of our analysis, although it is a very common problem for meta-analysis. We found statistical reassurance in the findings that publication status was not a significant moderator and the Egger et al.’s (1997) test produced no evidence of publication bias. These finding are most likely a consequence of our sampling of much research that did not aim primarily at examining sex differences in spatial navigation. Accordingly, we are quite confident that our results are a valid reflection of the current state of affairs for sex differences in human spatial navigation.

Conclusions

The present meta-analysis provided the first comprehensive quantitative review of sex differences in human spatial navigation. Overall, the take home message from our results is that the male advantage in human spatial navigation is small to moderate and does not vary that much, with few exceptions. It is particularly noteworthy that the effect sizes were generally small for children. This observation is congruent with a recent meta-analysis of the development of sex-differences in mental rotation (Lauer, Yhang, & Lourenco, 2019). Of course, an increasing male advantage with age could reflect either emerging biological constraints or the cumulative effect of environmental opportunities and expectations. But the pattern constrains the search for causal explanations, in that any satisfactory explanation needs to predict the pattern. For example, had this pattern not been observed, an environmental explanation would have seemed less persuasive. Future work should investigate fine-grained hypotheses, such as relations of daily exploration patterns to growth in wayfinding skills. Cross-cultural research can provide a special purchase on these questions, given that gender roles vary culturally. Studies along these lines are beginning to appear, both using anthropological techniques (e.g., Davis & Cashdan, in press; Wood et al., 2019) and the large-scale use of wayfinding games (Coutrot et al., 2018).

An important observation arising from our results is that, in many cases, significant effects of moderators, especially when occurring in task subgroups, were compromised by small numbers of effect sizes. When such findings were theoretically unanticipated and did not have clear interpretations, they should be seen only as potentially intriguing but preliminary. Another caveat is that some of the effects of moderators may arise in cases where testing conditions or task factors promoted floor or ceiling effects, and these were often compounded by the presence of few effect sizes in the sample. Such findings create “effects” that are theoretically uninteresting and may make a male advantage either more or less pronounced.

We have emphasized in the discussion the cases that require more empirical research either on their own right or to address issues relevant to small sample sizes or floor/ceiling effects. In this way, we hope that the research presented here will allow researchers to investigate promising avenues in their future work on spatial navigation and in their efforts to document how performance in such tasks is affected by sex.