Whenever a visual scene is cast onto the retina, much of it will appear degraded
due to poor resolution in the periphery; moreover, optical defocus can cause
blur in central vision. However, the pervasiveness of blurry or degraded input
A hallmark of human vision lies in its robustness to challenging or objects. Indeed, recent studies have found that deep convolutional
ambiguous viewing conditions. Consider the difficulties of navigating neural networks (CNNs) trained on tasks of object recognition provide
traffic in a snowstorm, detecting a pedestrian in the corner of one’s the best current models of the visual system, allowing for reliable
eye, or identifying a distant building that is veiled in fog. In laboratory prediction of visual cortical responses in humans19–24 and neuronal
settings, researchers have characterized the robustness of human responses in the macaque inferotemporal cortex25–29. While these
object recognition to visual noise, blur, and other forms of image initial findings are highly promising, a mounting concern is that CNNs
degradation1–6. Our ability to recognize objects depends on the ventral tend to catastrophically fail where humans do not, especially when
visual pathway, which extends from early visual areas (V1-V4) to higher presented with noisy, blurry or otherwise degraded visual
level object-sensitive areas in the occipitotemporal cortex7–14. Neuroi- stimuli5,6,30–32. Such findings demonstrate that the computations and
maging studies have investigated the functional selectivity and topo- learned representations of these CNNs are not truly aligned with those
graphical organization of the ventral visual system, informing our of the human brain.
understanding of the neural bases of object recognition under clear We considered the susceptibility of CNNs to visual blur to be of
viewing conditions9,15 as well as conditions of visual ambiguity16–18. particular interest, as blur is pervasive in everyday human vision33,34.
To understand the neurocomputational bases of object recogni- A common misperception is that our visual world is entirely clear,
tion, researchers have sought to develop computational models that when much of what we see is either blurry or processed with low
can effectively predict the visual system’s responses to complex spatial resolution. The density of cone photoreceptors, bipolar
Department of Psychology, Vanderbilt Vision Research Center, Vanderbilt University, Nashville, TN, USA.
Massachusetts Institute of Technology, Cambridge, MA, USA. 3Present address: Department of Brain and Cognitive Engineering, Korea University,
Seoul, South Korea. e-mail: [email protected]; [email protected]
neurons, and ganglion cells decreases precipitously from the fovea to Unlike standard CNNs, blur-trained CNNs favor the processing of
the periphery; thus, only stimuli that appear near the center of gaze lower spatial frequency information, allowing for greater sensitivity to
can be processed with high spatial resolution35. To capitalize on the global object shape. Finally, contrary to the notion that CNNs are very
much higher resolution of the fovea, humans make multiple eye poor at recognizing objects in novel or challenging viewing conditions,
movements every second to bring objects of interest to the center of we show that CNNs trained on clear and Gaussian-blurred images
gaze36,37. However, eye movements that involve large changes in ver- exhibit greater robustness to multiple forms of blur, visual noise, as
gence will cause foveated objects to initially appear blurry, due to the well as various types of image compression. From these findings, we
sluggish nature of lens accommodation38,39. Moreover, even after a conclude that instances of blurry vision are not fundamentally pro-
central object is accurately fixated and accommodated, a parafoveal blematic for biological vision; instead blur may constitute a positive
object that appears at a different depth plane may also appear blurry feature that can promote the development of more robust object
due to defocus aberration33,34. Thus, low-resolution vision and blur are recognition in both artificial and biological visual systems.
prominent features of everyday vision. By contrast, the image datasets
commonly used to train CNNs predominantly consist of clear, well- Results
focused images40,41. Neural predictivity of standard and blur-trained CNNs
There is a general bias to consider blurry vision as suboptimal, We compared the performance of 8 standard CNN models trained on
problematic, and in need of correction. However, such assumptions clear images only with the same CNN architectures trained using either
may overlook the potential contributions of blur for real-world vision weak or strong levels of blur (see Fig. 1). For weak-blur CNN training,
and object recognition. For example, humans can leverage blurry clear images (σ = 0) occurred with much greater frequency than blurry
contextual information to support more accurate object images to mimic the extent to which defocus blur would likely occur
recognition17,42, and both face and object recognition remain quite for future saccade targets in natural viewing tasks34. For the strong-blur
robust to substantial levels of blur3,6,43. Recent work has further CNNs, a Gaussian blur kernel of varying size (σ = 0, 1, 2, 4 or 8 pixels)
revealed that defocus blur provides an important cue for depth was applied with equal probability to the training images. This
perception44. Neurophysiological studies have also found that shape- manipulation was informed by the fact that visual acuity systematically
selective neurons in the visual cortex can be tuned to varying degrees declines from the fovea to the periphery35,46, such that varying degrees
of blur, with some neurons preferring blurry over clear depictions of of spatial resolution are always present in one’s visual experience.
2D object shapes45. Thus, blur appears to be an important feature that Further details regarding CNN model training can be found in the
is encoded by the visual system. Methods.
Here, we evaluated the hypothesis that the omission of blurry We first sought to compare standard and blur-trained CNNs on
training inputs may cause CNNs to rely excessively on high spatial their ability to account for functional magnetic resonance imaging
frequency information for object recognition, thereby causing sys- (fMRI) responses obtained from the human visual cortex while
tematic deviations from biological vision. To address this question, we observers viewed clear, low-pass or high-pass filtered object images by
compared the performance of standard CNNs trained exclusively on analyzing the data from a publicly available neuroimaging dataset47. To
clear images with CNNs trained on a combination of clear and blurry do so, we performed representational similarity analysis (RSA) on the
images. By testing standard versus blur-trained CNNs on a diverse set response patterns to the set of object images that were found in
of neural, visual, and behavioral benchmarks, we show that blur- individual human visual areas and in each layer of a CNN15,19, and then
trained CNN models significantly outperform standard CNNs at pre- computed the Pearson correlational similarity between the two RSA
dicting neural responses to object images across a variety of viewing matrices of interest. Peak correlations were typically observed in an
conditions, including those that were never used for training. intermediate CNN layer (Supplementary Fig. 1) and this value was used
Fig. 1 | Examples of images used for 3 different CNN training paradigms. For weak-blur and strong-blur CNNs, object images were blurred by Gaussian kernels of varying
width and presented with varying frequencies, as indicated. Images from the authors.
Fig. 2 | Correlational similarity between CNN model responses and neural model performance of blur-trained CNNs versus standard CNNs (*p < 0.05,
responses in different human visual areas to clear, high-pass filtered and low- **p < 0.01, and ***p < 0.001, uncorrected for multiple comparisons; the exact p
pass filtered images. A set of 8 different standard CNNs (red), weak-blur CNNs values and raw values are provided in the Source Data). Elephant images from the
(blue) and strong-blur CNNs (purple) were evaluated using human fMRI data original publication with permission. LOT lateral occipitotemporal cortex, VOT
(n = 10) obtained from Xu and Vaziri-Pashkam (2021). Error bars represent ±1 stan- ventral occipitotemporal cortex.
dard error of the mean (SEM). Two-tailed paired t-tests were used to compare the
can be seen that blur-trained CNNs significantly outperformed stan- *
dard CNNs at predicting cortical responses in early visual areas (V1-V4) 0.3 ** 0.3
as well as high-level object-sensitive areas (LOT, VOT). Moreover, 0.2 0.2
strong-blur CNNs outperformed weak-blur CNNs in V1 through V4,
0.1 0.1
suggesting that stronger levels of blur training may be particularly
beneficial for CNN models to better account for neural responses in 0 0
early visual areas. V1 V2 V4 IT V1
An analysis of individual viewing conditions revealed a similar Fig. 3 | Correlational similarity between CNN model responses and neural
advantage for blur-trained CNNs in predicting fMRI responses in the responses in macaque visual areas. A Correlation between predicted and actual
early visual cortex to clear images and to high-pass filtered images, neuronal responses in macaque visual areas V1, V2, V4 and IT for regression models
although no differences were noted for the low-pass filtered condition. based on 8 different standard CNNs (red), weak-blur CNNs (blue) and strong-blur
Better prediction of cortical responses to high-pass filtered images is CNNs (purple). The Brain-Score benchmark was employed for data analysis.
notable, as the blur-trained CNNs were not directly trained on high- B Correlation between predicted and actual neuronal responses in macaque V1 to
pass filtered images. This somewhat counterintuitive result was due to thousands of complex images. Error bars indicate ±1 SEM. Gray dots indicate pre-
the fact that early visual areas exhibited highly confusable responses to dictive correlation values of individual CNN models. Two-tailed paired t-tests were
the different high-pass filtered object images, whereas standard CNNs performed to determine statistical significance (*p < 0.05, **p < 0.01, and
***p < 0.001, uncorrected; the exact p values and raw values are provided in the
excelled at discriminating high-pass filtered stimuli (see Supplemen-
Source Data).
tary Fig. 2). By comparison, blur-trained CNNs exhibited more con-
fusable responses to the high-pass filtered objects that led to closer
resemblance to human cortical responses.
We next sought to determine whether blur-trained CNNs might our findings above, we found that both strong-blur CNNs (t(7) = 4.53,
show an advantage at predicting the responses of individual neurons p = 0.0027, d = 1.60) and weak-blur CNNs (t(7) = 4.65, p = 0.0024,
recorded from the macaque visual cortex, as single neurons can d = 1.64) showed better neural predictivity for area V1 than standard
exhibit far greater stimulus selectivity than is otherwise possible to CNNs (Fig. 3B and Supplementary Fig. 4). Across both studies, we find
obtain from fMRI measures of locally averaged neural activity. We first that blur-trained CNNs are better able to predict neuronal responses in
evaluated a popular dataset called BrainScore48 in which monkeys early visual areas such as V1 and V2. These findings are noteworthy
viewed clear images of objects on natural scene backgrounds while given that the monkeys were tested with clear images only, implying
neuronal activity was recorded from areas V1, V2, V4 and infer- that CNNs that are trained on an exclusive diet of clear images acquire
otemporal cortex (IT). We adopted BrainScore’s regression-based learned representations that deviate from biological vision.
approach of using the layer-wise activity patterns of each CNN to fit
each neuron’s response to a set of training images, and then evaluated Visual tuning properties of standard and blur-trained CNNs
its ability to predict responses to independent test images (Supple- How might training a CNN with a combination of blurry and clear
mentary Fig. 3). These analyses revealed that strong-blur CNNs were images modify its visual tuning properties, such that it can better
better able to predict V1 responses than standard CNNs (Fig. 3A, account for neural responses in the visual cortex? To address this
t(7) = 3.97, p = 0.0054, d = 1.41). Moreover, both strong-blur CNNs question, we presented oriented gratings of varying spatial frequency
(t(7) = 5.35, p = 0.0011, d = 1.89) and weak-blur CNNs (t(7) = 2.97, to each CNN and determined which spatial frequencies led to the
p = 0.0208, d = 1.05) outperformed standard CNNs at predicting neu- strongest responses for each convolutional unit in a given layer. This
ronal responses in V2. For areas V4 and IT, predictive performance was analysis revealed that standard CNNs prefer a much higher range of
comparable across standard and blur-trained CNNs. spatial frequencies, whereas weak-blur CNNs prefer intermediate
We also tested CNN performance on another dataset that con- spatial frequencies and strong-blur CNNs prefer the lowest range
sisted of neuronal recordings from macaque V1 during the presenta- of spatial frequencies (Fig. 4A). We further assessed the bandwidth of
tion of thousands of natural and synthetic images49. In agreement with spatial frequency tuning, a measure that reflects the range of spatial
Preferred s.f.
Strong-blur CNN
1 5 1 13 1 16 1 12
Layer Layer Layer Layer
1 Layer 9 1 Layer 17 1 Layer 34 1 Layer 9
1 5 1 13 1 16 1 12
Layer Layer Layer Layer
1 9 1 17 1 34 1 9
Layer Layer Layer Layer
Fig. 4 | Assessing the spatial frequency tuning of CNNs. Mean preferred spatial CNNs (blue), and strong-blur CNNs (purple). Shaded regions indicate 95% con-
frequency (s.f.) (A) and spatial frequency tuning bandwidth (B) of individual con- fidence intervals. Source data are provided as a Source Data file.
volutional units obtained from individual layers of standard CNNs (red), weak-blur
frequencies for which each unit is tuned. Blur training led to broader shift in favor of shape processing for all 16 object categories that were
spatial frequency tuning bandwidth in most CNNs, particularly in the evaluated (Fig. 5A). These findings concur with a recent study that
middle layers (Fig. 4B). Taken together, our findings provide support reported a similarly modest shift in shape sensitivity after a CNN was
for the recent proposal that standard CNNs trained on tasks such as trained on a combination of clear and blurry images52. By comparison,
ImageNet object classification are heavily biased to emphasize the our strong-blur CNNs exhibited a far more pronounced increase in
processing of high spatial frequency for their classification decisions, shape bias, and while these networks did not reach human levels of
and are unable to learn or retain the ability to utilize low spatial fre- shape bias (gray diamonds), the gap between human performance and
quency information for object recognition6. CNN model responses was considerably reduced by strong blur
Given these shifts in preferred spatial frequency following blur training. These findings demonstrate that training CNNs with a subset
training, we asked whether blur-trained CNNs might exhibit greater of highly blurred images can strongly shift their tuning in favor of
sensitivity to object shape information. Although early studies sug- lower spatial frequency shape information, such that the CNN
gested that standard CNNs do show some evidence of shape responses are better aligned with those of human observers.
selectivity22, subsequent work has revealed that CNNs rely more on In addition to quantifying the degree of shape bias exhibited by
textural information than global shape in their classification of hybrid the CNNs’ classification responses, we visualized the image compo-
object images50,51. Two examples of such hybrid images are shown in nents that the CNNs tended to weigh more heavily for their decisions.
Fig. 5B (left), which depicts the global shape of one object filled-in with We used layer-wise relevance propagation to visualize which features
the texture of a different object. As expected, standard CNNs were contributed most to the CNN’s classification response by decompos-
strongly biased to classify these hybrid images based on their textural ing the prediction score backward onto pixel space53. Figure 5B shows
cues, whereas weak-blur CNNs showed a small but highly consistent two examples of texture-shape hybrid stimuli and their layer-wise
‘bottle’ ‘bottle’
Fig. 5 | Evaluating the shape bias of CNNs. A Proportion of shape vs. shape-plus- performed to determine statistical significance (*p < 0.05, **p < 0.01, and
texture classifications made by standard (red), weak-blur (blue) and strong-blur ***p < 0.001, uncorrected; the exact p values and raw values are provided in the
(purple) CNNs (8 per training condition) when tested with cue-conflict stimuli. Source Data). B Two examples of cue-conflict stimuli (bottle or dog shape with
Icons indicate category of shape cue tested and bar plots on far right show average clock texture) from Geirhos et al., 2019 (with permission), shown with corre-
shape bias across all 16 categories. Error bars indicate ±1 SEM. Gray dots indicate sponding layerwise relevance propagation maps depicting the image regions that
shape bias score of individual CNN models. Two-tailed paired t-tests were were heavily weighted by VGG-19 in determining its classification response.
relevance propagation maps. Whereas standard CNNs tended to The benefits of blur training were most evident for the strong-blur
emphasize multiple small image patches corresponding to the texture CNNs, which outperformed standard CNNs at predicting human cor-
cues that were scattered throughout the hybrid image, the strong-blur tical responses in both early visual areas and high-level object-sensitive
CNNs assigned greater weight to coherent diagnostic portions of the areas when all viewing conditions were analyzed together (Fig. 6, left
primary object, such as the bottlecap on a bottle or the head region panel). Focused analyses on fMRI responses to clear objects also
of a dog. revealed better performance for strong-blur than standard CNNs in
early visual areas V1-V3, corroborating our earlier findings (Figs. 2 and
Generalization to challenging out-of-distribution viewing 3). The strong-blur CNNs performed particularly well at accounting for
conditions cortical responses to objects in pixelated Gaussian noise, with
Given that standard CNNs are strongly influenced by high spatial fre- improved neural predictivity found across low-level and high-level
quency textural information, might this account for their unusual visual areas. However, strong-blur CNNs were also better at predicting
susceptibility to visual noise5,30,31? In a recent behavioral and fMRI neural responses in early visual areas (V1-V4) to objects embedded in
study, we found that standard CNNs not only fail to recognize objects Fourier-phase scrambled noise (sometimes called pink noise); such
in moderate levels of noise, but they also fail to capture the repre- structured noise patterns differ greatly from Gaussian white noise as
sentational structure of human visual cortical responses to objects their power spectrum matches that of natural images. Taken together,
embedded in noise5. Here, we compared standard and blur-trained we find that blur-trained CNNs can better account for human cortical
CNNs in terms of their ability to predict human neural responses to responses to challenging out-of-distribution conditions involving
clear objects and those same objects presented in either pixelated multiple forms of visual noise. These results provide compelling evi-
Gaussian noise or Fourier phase-scrambled noise. Examples of such dence that blur-trained CNNs provide a better neurocomputational
stimuli can be seen in Fig. 6 (top row). To do so, we again performed model of the robustness of the human visual system.
representational similarity analysis on the patterns of fMRI responses Given that blur-trained CNNs showed better prediction of visual
in each visual area of interest and each layer of a given CNN (Supple- cortical responses to clear, blurry, high-pass filtered, and noisy object
mentary Fig. 5). images, we were motivated to compare both standard and blur-trained
Weak-blur CNN
Fig. 6 | Correlational similarity between CNN model responses (n = 8) and to determine statistical significance (*p < 0.05, **p < 0.01, and ***p < 0.001, uncor-
neural responses in individual human visual areas (n = 8) to clear objects and rected; the exact p values and raw values are provided in the Source Data). Elephant
objects in visual noise. Leftmost panel shows results pooled across all stimulus image from the authors. LOT, lateral occipitotemporal cortex; VOT, ventral occi-
conditions; subsequent panels show results for clear objects, objects in Gaussian pitotemporal cortex. LOC, lateral occipital cortex; FFA, fusiform face area; PPA,
noise and objects in Fourier phase-scrambled noise; examples of these stimuli are parahippocampal place area.
shown above. Error bars indicate ±1 SEM. Two-tailed paired t-tests were performed
CNN models on their ability to deal with a variety of forms of image respectively). Thus, blur training confers greater robustness to both
degradation by employing a popular benchmark, ImageNet-C54 randomly generated noise and adversarial noise.
( This benchmark consists
of the 1000 object categories from ImageNet’s validation dataset Correspondence with human behavioral responses to out-of-
presented with 19 different types of image degradation (Fig. 7A). Fig- distribution data
ure 7B shows the impact of image degradation on CNN classification We further sought to determine whether blur-trained CNNs might
accuracy with noise strength varying from 1 to 5 (i.e., weakest to provide a better account of human behavioral responses to challen-
strongest). We found that blurry image training proved highly effective ging out-of-distribution conditions by leveraging a toolbox developed
at improving the robustness of CNNs to most forms of image degra- by Geirhos et al. (2021). This toolbox allows for AI models to be
dation. Indeed, we observed a significant improvement in perfor- compared with human performance on 17 different object recognition
mance for 14/19 noise conditions (p < 0.05). Weak-blur CNNs showed a tasks, which include multiple forms of image stylization, image mod-
consistent increase in classification accuracy for all noise types when ification (e.g., rotation, grayscale conversion), visual noise, as well as
compared with standard CNNs, while strong-blur CNNs showed an high-pass and low-pass filtering58. Output measures include overall
even greater advantage in many conditions. Specifically, strong-blur classification accuracy (called out-of-distribution accuracy), human-AI
CNNs exhibited much greater robustness to both Gaussian blur and differences in absolute accuracy, as well as measures of the con-
other forms of blur (i.e., defocus, glass, motion, zoom). Moreover, sistency or agreement between human and AI responses. This analysis
strong-blur CNNs were far more robust to all types of pixel-based revealed that blur training not only improved the out-of-distribution
noise, including Gaussian, speckle, impulse and shot noise. We further accuracy of CNNs (Fig. 8A), it also led to improved consistency
found that strong-blur CNNs are more robust to artificial types of between human and AI responses (Fig. 8BD). For individual CNNs,
image degradation that are known to alter the local image structure of improvements in human-AI agreement were most prevalent for strong-
digital images (e.g., elastic transform, JPEG compression, and pixelate). blur CNNs, followed by weak-blur CNNs, with standard CNNs per-
Our findings run contrary to recent claims that CNNs trained on one forming the most poorly. These results demonstrate that blur training
form of image degradation are unable to generalize to other forms of improves CNN correspondence with human vision, encompassing
image degradation31. However, strong blur training was not effective at human behavioral performance across diverse image conditions.
improving robustness to manipulations involving contrast reduction,
saturation, spatter or weather-related forms of noise (e.g., Brightness, Evaluation of recurrent network CORnet-S
Fog, Frost, and Snow). Thus, blur training leads to enhanced robust- We performed a further set of analyses to evaluate whether recurrent
ness to many though not all forms of image degradation. visual processing might lead to improved neural predictivity or
Given that our blur-trained CNNs proved more robust to many increased robustness in blur-trained CNNs. Recent studies have found
forms of randomly generated noise, we sought to test whether they that CORnet-S59, which performs within-layer recurrent computations
might also exhibit greater robustness to adversarial noise. Adversarial in its first 4 convolutional blocks, provides better predictions of neu-
noise involves modifying the pixel values of an original object image in ronal responses in the monkey visual cortex than most other CNN
a purposefully deceptive manner designed to shift the CNN’s decision models27,48. We compared the performance of CORnet-S with two
to an incorrect object category; even very modest levels of noise that control networks, one that matched the number of convolutional and
are almost imperceptible to humans can lead CNNs astray55,56. We fully connected layers of CORnet-S but lacked recurrent processing
evaluated the adversarial robustness of each CNN by utilizing Pro- (CORnet-Shallow) and another feedforward CNN with additional con-
jected Gradient Descent57 with L1 and L2 norm constraints (ϵ = 0.001 volutional blocks to match the number of feedforward and recurrent
and 1, respectively). Although blur-trained CNNs remained susceptible block operations performed by CORnet-S (CORnet-Deep). Our ana-
to adversarial noise, we found that strong-blur CNNs outperformed lyses revealed pronounced differences in neural predictivity between
standard CNNs with L1 of ϵ = 0.001 (19.63% vs.13.61%, t(7) = 7.92, these CNNs in the studies that presented low-pass and high-pass fil-
p = 0.0001, d = 2.88), and both strong-blur (12.04%) and weak-blur tered images as well as objects in visual noise (see Supplementary
CNNs (7.14%) outperformed standard CNNs (4.47%) with L2 of ϵ = 1 Fig. 6). Specifically, blur training was much more beneficial for CORnet-
(t(7) = 12.45, p < 10-5, d = 4.44 and t(7) = 4.45, p = 0.0030, d = 1.58, Deep and CORnet-S in comparison to CORnet-Shallow. Likewise,
Fig. 7 | Comparison of CNN robustness to multiple forms of image degradation. here were generated by Hendrycks & Dietterich, 2019. B Mean classification accu-
A Examples of 19 types of image degradation used by benchmark ImageNet-C to racy of 8 different standard (red), weak-blur (blue) and strong-blur (purple) CNNs
evaluate the robustness of CNNs. Original cat image obtained from https://www. plotted as a function of noise strength for the 19 types of image degradation. Error and licensed under CC BY 2.0 (with bars indicate ±1 SEM. Source data are provided as a Source Data file.
permission from the copyright owner), from which image distorted versions shown
Fig. 8 | Alignment between human and CNN responses in out-of-distribution CNNs. B Accuracy difference between humans and CNN models. C, D Consistency
scenarios. A Classification accuracy for standard (red), weak-blur (blue) and of responses and error responses between humans and CNNs; higher values indi-
strong-blur CNNs (purple) based on aggregated performance for 17 out-of- cate better human-AI alignment, with gray bars indicating human-to-human con-
distribution datasets provided by Geirhos et al. (2021). Note that 1 of 17 conditions sistency. Source data are provided as a Source Data file.
involved blurry images, which was not out-of-distribution for the blur-trained
strong blur training led to the highest levels of overall robustness to models are believed to excel at extracting spatial-relational informa-
image degradation (i.e., ImageNet-C) and also led to greater shape bias tion, blur training still appears effective at improving their sensitivity
for both CORnet-Deep and recurrent CORnet-S, with negligible dif- to shape.
ference in performance between the latter two CNNs. Our findings
indicate that increased model complexity allows a CNN to acquire Discussion
greater benefits from blur training but we find no additional advantage In this study, we rigorously compared standard versus blur-trained
for recurrent processing over that of feedforward processing. These CNNs on their ability to account for neural responses in the visual
findings are in general agreement with the fact that the other CNN cortex by leveraging multiple datasets obtained from both monkeys
architectures, excluding CORnet-S, exhibited similarly large benefits and humans. We reasoned that the existing gap between CNN models
from blur training. and biological visual systems5,19,22,25,31,47,50,61,62 may be ascribed, at least in
part, to inadequate diversity in the set of images that are commonly
Impact of blur training on visual transformer model ViT used to train CNNs. In particular, we hypothesized that blur may be a
Finally, we asked whether blur training would necessarily lead to critical property of natural vision34,44,45 that contributes to the devel-
improved robustness, shape sensitivity, or neural predictivity, if opment and maintenance of robust visual systems. Although we and
applied to a deep neural network model with an entirely different others have previously posited that exposure to blurry visual input
architecture. Whereas CNNs perform filtering and pooling operations may have the potential to confer some robustness to biological or
designed to mimic the visual system, visual transformer models (ViT) artificial visual systems, the evidence to support this notion so far has
process the information contained in local image patches and the been mixed6,52,63–65.
relational information between combinatorial pairs of patch repre- Our study reveals that blur-trained CNNs provide a much better
sentations through a series of iterative computations60. While ViT neurocomputational model of the visual system’s responses to diverse
models operate in a manner that deviates from biological visual sys- sets of object images. Across multiple neural datasets, we found that
tems, they can nevertheless achieve state-of-the-art performance on blur-trained CNNs outperform standard CNNs at predicting neural
object classification tasks. We performed standard, weak blur, or responses to clear images in the early visual areas of monkeys48,49 and
strong blur training on 3 ViT models and then evaluated their perfor- human observers47,66. Blur-trained CNNs also showed superior neural
mance. Blur training led to prominent trends of improved prediction predictivity for out-of-distribution conditions, including high-pass fil-
of human fMRI responses, increased shape bias, and greater overall tering, objects in pixelated Gaussian noise, and objects in Fourier
robustness to ImageNet-C, although it did not lead to better prediction phase-scrambled noise. Moreover, when we compared CNN versus
of single-unit responses in the monkey (Supplementary Fig. 7). Given human performance on a large number of out-of-distribution image
that ViT models are not considered to be particularly biologically datasets58, blur-trained CNNs consistently outperformed standard
plausible, the improvements in shape sensitivity and robustness to CNNs in terms of their ability to account for human behavior. Thus, by
image degradation are of greater interest here. Thus, even though ViT incorporating blurry images into the visual diet of CNNs, we can
construct computational models that are better aligned with biological Other methods to increase the shape sensitivity of CNNs have
visual systems across a wide range of viewing conditions including also been proposed. For example, large numbers of hybrid shape-
those involving visual noise. texture conflict stimuli can be generated using style transfer
This improved robustness to noise is striking given that most methods70 so that CNNs can be directly trained to categorize these
state-of-the-art CNNs are severely impaired when Gaussian or other cue-conflict stimuli according to their shape51. Another approach is to
forms of visual noise are added to an object image5,30–32. Moreover, it train CNNs to become more robust to adversarial noise, which can
has been reported that if a CNN is trained on one form of visual noise also improve shape sensitivity and decrease texture bias71. Interest-
(e.g., Gaussian), one typically observes negligible benefit if it is sub- ingly, a recent study found that CNNs trained with adversarial noise
sequently tested with a different type of noise (e.g., salt-and-pepper show shifted tuning in favor of lower spatial frequencies in a manner
noise)31 (but see also5,67). Here, we evaluated whether blur training that seems to better match the spatial frequency preferences of V1
might lead to a more generalized improvement in robustness by neurons68.
evaluating the performance of standard and blur-trained CNNs on While training with such artificially generated stimuli can improve
ImageNet-C54. We found that Gaussian blur-trained CNNs can suc- the shape sensitivity of CNNs, it is not clear how these contrived
cessfully generalize to multiple forms of blur and visual noise, as well methods can explain how the human visual system acquires robust,
as various forms of image compression. That said, blur training did not shape-sensitive object representations. Also, although humans do
lead to improved robustness across all conditions; in particular those encounter some forms of natural visual noise on occasion (e.g., snow,
involving the simulation of noisy weather conditions remained visually rain, dust storm), the pervasiveness of blur in everyday vision leads us
challenging. Nevertheless, our findings demonstrate the efficacy of to posit that blur likely has a primary role in bolstering the robustness
blur training for improving the robustness of CNNs to many forms of of the human visual system.
image degradation in addition to enhancing their neural predictivity. One might further ask whether the non-uniform application of
How does blur training modify the visual representations learned blur, say to simulate the lower spatial resolution of peripheral vision,
by CNNs, such that they become both more robust and better aligned might lead to similar improvements in robustness and neural pre-
with the human visual system? We believe that one key factor is the dictivity. Motivated by this question, we conducted an exploratory
shift in spatial frequency tuning to favor the processing of lower fre- analysis by training AlexNet on a mixture of clear images and images
quencies and coarser visual features. Another possible contributing with progressively stronger blur applied to the periphery (see Meth-
factor could be the expanded frequency tuning bandwidth that arose ods). The model was then evaluated while withholding the application
after blur training. Excessive sensitivity to high spatial frequency of peripheral blur. We found that peripheral-blur-trained AlexNet
information appears to be related to a CNN’s susceptibility to adver- showed much better prediction of human fMRI responses (Supple-
sarial noise68 as well as its ability to learn arbitrary mappings from mentary Fig. 8A, B), enhanced shape bias (E), and improved robustness
image datasets with randomly shuffled labels69. Thus, the way that to image degradation (F), and also appeared to show some improve-
standard CNNs process high spatial frequency information seems to ment over clear-trained AlexNet at accounting for neuronal responses
deviate considerably from human vision. in macaque V2, V4 and IT (C). (Previous studies that have explored the
In recent work, we have shown that if CNNs are trained on Ima- impact of peripheral blur training have reported more limited
geNet object classification with a series of images that gradually pro- benefits72, though it can be difficult to compare methodology and
gresses from blurry to clear, the CNNs can initially discriminate blurry findings across studies.) While we are cautious about interpreting the
objects but this ability is quickly lost as they learn to leverage higher potential neuroscientific implications of these findings, as multiple
spatial frequency information to attain superior classification perfor- computational approaches could potentially be adopted to approx-
mance with clearer object images6. Such catastrophic forgetting of imate the lower spatial resolution of human vision in the periphery,
how to recognize blurry objects clearly deviates from our own visual these findings indicate that multiple options for blur training can be
abilities. Moreover, the image datasets that are commonly used to successfully adopted to improve the robustness, shape sensitivity, and
train CNNs lack the diversity of biological vision as they consist almost neural predictivity of CNNs.
entirely of clearly photographed images. Here, by introducing blurry Our results have important implications for both current and
images throughout the training regime of CNNs, the networks must future deep learning models of human vision. While considerations
both learn and retain their ability to utilize lower spatial frequency such as network architecture and the objective learning function are
information in order to recognize objects. certainly important for developing more realistic neural network
Related to this increased sensitivity to low spatial frequency models of the visual system, we propose that the property of blur is
information, we found that blur-trained CNNs become more sensitive likely to be a critical training ingredient for any neural network to
to global shape and less sensitive to texture. Several recent studies learn human-aligned representations of the visual world. Moreover,
have suggested that CNNs trained on standard tasks of object classi- our findings are not only relevant to the development of better
fication are unduly influenced by high spatial frequency textural neurocomputational models of the visual system, they may also
information6,50–52,65,68,69. For example, when CNNs are presented with inform the development of future computer vision applications that
cue conflict stimuli that consist of the global shape of one object and must operate in challenging real-world settings. Indeed, by simply
the textural properties of another, their classification decisions are incorporating a subset of blurry images into a CNN’s training regime,
strongly biased by the texture cues50,51. Our findings with weak-blur one can attain superior robustness, enhanced shape sensitivity, and
CNNs concur with another recent study, which found that training with much better human-AI alignment with minimal downsides in per-
moderate levels of blur can lead to a modest increase in shape bias, formance. A variety of image augmentations have been proposed to
while the gap between CNN and human shape preference remains help bolster the performance of CNNs, including some that have
large52. Here, we found that strong-blur CNNs exhibited a far greater become routine (e.g., random cropping and flipping) and others that
degree of shape bias than standard or weak-blur CNNs, such that they are more exotic73. Based on our findings, we believe it would be
were predisposed to classify the cue conflict stimuli according to their suitable to recommend incorporating blur as a standard form of
shape rather than their texture over 60% of the time. While blur image augmentation for most computer vision applications. Along
training alone may not be sufficient to induce the degree of shape these lines, our CNN training code and the weights of our trained
sensitivity exhibited by human observers, it does appear to help sub- networks can be found on a publicly available website with links
stantially narrow the gap between artificial and biological vision. provided herein.
Methods image (coordinates 112, 112 pixel position) to the periphery, starting
Training of convolutional neural networks with a standard deviation of 0 pixels (i.e., clear) at the center and
We evaluated the impact of blur training on 8 CNN architectures reaching a maximum standard deviation of 8 pixels for eccentricities of
implemented in PyTorch: AlexNet74, VGG16 and VGG1975, GoogLeNet76, 112 pixels or more. We trained AlexNet with a combination of clear and
ResNet18, ResNet50 and ResNet10177, and CORnet-S59. After random peripheral blur images; the results of which are reported in Supple-
initialization, the CNNs were trained to classify 1000 object categories mentary Fig. 8.
from the training dataset of ImageNet40 for 70 epochs using stochastic
gradient descent with a fixed learning rate of 0.001, momentum of 0.9, Comparisons between CNN models and human
and weight decay of 0.0001. Standard CNNs were trained with clear neuroimaging data
images only, while weak-blur and strong-blur CNNs were trained with a We evaluated the correspondence between CNNs and human visual
combination of clear and blurry images. All training images were cortical responses by analyzing two publicly available neuroimaging
grayscaled, resized to 224 × 224 pixels, randomly rotated by ±10 datasets; detailed information can be found in those original papers5,47.
degrees, and flipped horizontally on 50% of occasions. The images The first dataset was acquired from 10 observers who viewed clear,
were then normalized using the mean and standard deviation of the high-pass filtered and low-pass filtered images in a 3T MR scanner47.
pixel intensities of the ImageNet training samples. Images from 6 different object categories (bodies, cars, chairs, ele-
For the weak blur condition, the distribution of blur levels was phants, faces, and houses) were presented using a block paradigm. The
informed by empirical measures of defocus blur that were obtained high-pass filtered images had a cutoff frequency of 4.40 cycles per
from binocular eye and scene tracking data34 while observers per- degree, while the low-pass filtered images had a cutoff frequency of
formed 1 of 4 different everyday tasks (i.e., ordering coffee, making a 0.62 cycles per degree. We analyzed the fMRI data made available for 6
sandwich, indoor or outdoor walking). By calculating a scene-based regions of interest: visual areas V1 through V4, lateral occipitotemporal
stereo-depth map (spanning 10° eccentricity) with concurrent mea- cortex (LOT), and ventral occipitotemporal cortex (VOT). The second
sures of binocular fixation position, it was possible to calculate the dataset was acquired using a 7T MRI scanner from 8 human partici-
depth distance of objects relative to fixation in each video frame. From pants (3 females) while they viewed 16 different clear object images
these data (courtesy of Sprague et al.), we calculated the extent to and the same images presented in either pixelated Gaussian noise or
which a future fixation target would appear blurred relevant to current Fourier phase-scrambled noise5. The object images were selected from
fixation (i.e., blur circle size) based on diopter measures of relative 8 object categories (i.e., bear, bison, elephant, hare, jeep, sports car,
depth, measures of mean pupil size (~5.8 mm), and simplifying table lamp, teapot) obtained from the ImageNet validation dataset.
assumptions pertaining to eye size and other factors78. A frequency The brain regions of interest consisted of visual areas V1 through V4,
distribution of blur magnitudes was then obtained, with the different lateral occipital complex (LOC), fusiform face area (FFA) and para-
tasks weighted according to their estimated frequency based on hippocampal place area (PPA).
Sprague et al.‘s analysis of the American Time Use Survey (ATUS) from Representational similarity analysis (RSA) was used to assess the
the U.S. Bureau of Labor Statistics. The weighted distribution of blur similarity of visual representations across CNN models and human
circle sizes for subsequently foveated targets was then used to inform observers. To do so, we calculated the Pearson correlational similarity
the application of blur to the training images (224 × 224 pixels) by of the response patterns across all relevant stimulus conditions to
assuming that the images were photographed using a 35-mm camera obtain a correlation matrix for each visual area of an observer and each
with a 54° horizontal field of view. An exponential function was used to layer of a CNN. We could then assess the similarity between human and
obtain a smoother estimate of the distribution of blur magnitudes. We CNN matrices by calculating their Pearson correlation with the main
also adopted a Gaussian blur kernel with FWHM matched to the dis- diagonal excluded. We chose to use Pearson correlation over alter-
tribution of blur circle diameters, as blur circles do not adequately native approaches such as Spearman correlation79, as the latter allows
account for additional sources of blur such as chromatic aberration. for non-linear relationships between predicted and actual response
This procedure resulted in a preponderance of clear image presenta- patterns that could allow for excessive model flexibility. Moreover, our
tions (69.4% with σ = 0) and frequencies of 21.3%, 6.5%, 2.0%, 0.6% and analyses of the monkey neurophysiology data relied on linear regres-
0.2% for which the Gaussian blur kernel was set to a sigma value of 1, 2, sion; therefore, the use of Pearson correlation to evaluate the human
3, 4 or 5 pixels, respectively. It should be noted our assumption of fMRI data seemed more appropriate. Nevertheless, it can be noted that
photo zoom size was fairly conservative; if certain training images were almost identical results were obtained when we applied Spearman
taken with a more zoomed-in view (e.g., 50–105 mm), then a greater correlation instead for our analyses. For the fMRI block paradigm
level of blur would need to be applied to simulate defocus blur for study, the analysis was performed on the mean fMRI response patterns
that image. observed for each object category, and on the averaged CNN
For the strong blur condition, we presented images at various blur responses across the 10 images in each object category. For the fMRI
levels with equal frequency (see Fig. 1) based on the fact that the visual objects-in-noise study, the analysis was performed on the mean fMRI
resolution steadily declines as a function of eccentricity or distance responses for each of the 16 object images across the 3 viewing con-
from the fovea35,46. Thus, different levels of resolution remain con- ditions (48 stimulus conditions total).
tinually present during natural vision. In addition to clear images, we For the feedforward hierarchical CNNs (e.g., AlexNet, VGG), we
presented images with Gaussian blur kernels of increasing size (σ = 1, 2, performed RSA analysis on every convolutional and fully-connected
4, or 8 pixels) to approximate how visual resolution declines from the layer after ReLU non-linearity was applied. For the inception, residual
fovea to the mid-periphery. With the largest blur kernel, the spatial and recurrent networks, we focused our analysis on the layers in which
frequency content of the training images (224 x 224 pixels) would be all parallel or recurrent features were combined at the end of each
attenuated below 50% amplitude for frequencies exceeding 6 cycles computational block (see Supplementary Table 1).
per image, which would impair but not abolish human recognition We calculated the Pearson correlational similarity between each
performance3,6. CNN and the response patterns found in a given visual area for each
We performed an additional analysis to explore the effect of observer and then averaged the results across observers to obtain 8
simulating low-resolution vision in the periphery by applying pro- correlational similarity values (1 per CNN architecture), which allowed
gressively stronger levels of blur as a function of distance from the us to test for differences in performance between standard, weak-blur
center of each training image. To achieve this, we applied a linear and strong-blur CNN training. Statistical tests consisted of repeated
increase to the size of the Gaussian blur kernel from the center of the measures ANOVA applied across CNN training regimes and visual areas
of interest, followed by planned paired t-tests (two-tailed, uncorrected each hybrid image, the category with the highest confidence response
for multiple comparisons) to directly compare the predictive perfor- among the 16 categories was identified as the CNN’s classification
mance of the different CNN training regimes. For these statistical response. The degree of shape bias exhibited by a CNN was then
analyses, all correlation coefficients were first converted to z values quantified as the proportion of classification decisions that corre-
using Fisher’s r-to-z transformation. sponded with the hybrid object’s shape in comparison to the total
number of shape-consistent and texture-consistent decisions made by
Comparisons between CNN models and monkey neuronal data that CNN for a given hybrid stimulus set. These CNN results could then
We evaluated the correspondence between CNNs and single-unit be compared with the classification judgments of 10 human partici-
responses obtained from the macaque visual cortex by analyzing two pants who were evaluated in the original study.
publicly available datasets. The first set of analyses focused on data
made available ( as part of the Brain- Layer-wise relevance propagation
Score benchmark (, a site designed to We performed layer-wise relevance propagation to identify the diag-
facilitate the evaluation of neural network models and their ability to nostic features of objects that account for a network’s classification
account for behavioral and neural responses to visual stimuli48. We decision53. This approach works best with strictly hierarchical feed-
largely adopted the analysis pipeline implemented by Brain-Score to forward CNNs; we therefore focused our analysis efforts by primarily
evaluate our CNNs. This involved extracting CNN responses to the working with VGG-19 using methods and parameter settings we have
object images from each layer, applying PCA to reduce the dimen- described elsewhere5. To create pixel-wise heatmaps, the relevance
sionality of these responses (to 300 dimensions), and then applying score of the unit corresponding to the correct category in the last fully
linear partial least squares regression to predict neuronal responses. connected layer was set to a value of 1 while all other units were set to
The Pearson correlation between actual and predicted neuronal 0. Relevance scores were then back-propagated to the input layer to
responses was calculated using separate sets of images for training and construct heatmaps in pixel space. Only positive values were used to
testing, and the median predictivity score across all neurons from a focus on category-relevant features of the target object, and the
visual area of interest (V1, V2, V4, IT) was then outputted by the Brain- resulting heatmap was linearly adjusted to a range of 0 to 1.
Score toolbox.
Our second set of analyses focused on V1 neuronal data obtained Evaluation of adversarial robustness
from two alert male monkeys aged 12 and 9 years while they viewed a To evaluate the adversarial robustness of each CNN model, we per-
large set of 7250 natural and synthetic images presented formed a Projected Gradient Descent-based white-box attack57. A key
parafoveally49. We followed the analysis pipeline of the original study feature of Projected Gradient Descent is its perturbation limit, which
after recoding the analysis in PyTorch. Layerwise CNN responses to the controls the extent of input changes. This constraint is vital for
object images were normalized using batch normalization, and a ensuring the practicality of the adversarial examples and for setting a
regression model was fitted to the responses of each neuron by using uniform standard for comparison, allowing for the evaluation of
80% of the images for model training and 20% of the images for model diverse models under identical conditions. Specifically, this method
testing. Specifically, a linear/non-linear regression model was trained generates adversarial examples by iteratively updating gradient-based
to minimize a Poission-based loss function via the Adam optimizer. In image perturbations with bounded constraints, as formulated by:
addition, three regularization constraints were applied to the weights
of the regression model: L1-norm sparsity (λ = 0.01), spatial smooth- x t + 1 = P x t + αsignð∇x L x t Þ
ness (λ = 0.1), and group sparsity (λ = 0.001), where λ denotes the
regularization rate. The correlation between predicted and actual where x t is the perturbed image at t-th step, PðÞ is the projection
neuronal responses to the independent set of test images was used to operator to ensure that the adversarial perturbations applied to the
evaluate the neural predictivity of the CNN models. image do not exceed a specified threshold, α is the step size, and L is
the loss function. The projection operator maps the perturbed image
Spatial frequency tuning preferences of CNN models back onto the surface of an lp-norm ball centered at the original image
We measured the spatial frequency tuning of the convolutional units in x and bounded by ||l||p ≤ ϵ. With a random initialization of x, the
each layer of a CNN by presenting whole-field sinusoidal grating pat- adversarially perturbed data were generated with 15 iterations using a
terns that varied in spatial frequency (4.48, 8.96, 13.44,…, 112 cycles/ step size of 0.001. We evaluated both l∞ and l2 norm-bounded
stimulus), orientation (0, 12, …, 168°), and spatial phases (0, 90, 180, perturbations with ϵ = 0.001 and 1, respectively.
270°), following previously described methods6. The spatial frequency
tuning curve was then obtained for individual convolutional units Comparison of CNN outputs and human behavioral responses to
(otherwise known as channels) by averaging the responses across all out-of-distribution image datasets
orientations, phases, and spatial positions. The spatial frequency that We evaluated how closely the outputs of CNNs align with human
elicited the maximum response was identified as the preferred spatial behavioral responses under out-of-distribution conditions. This com-
frequency of that unit. We further assessed the bandwidth of spatial parison was based on publicly available benchmark datasets from
frequency tuning by fitting a Gaussian function to spatial frequency Geirhos et al. (2021), encompassing 17 diverse datasets. Twelve of
response profile of the units on a logarithmic scale, calculating the full these datasets include parametric variations such as changes in color
width at half maximum, and scaling this value relative to the center (both color and grayscale), contrast level, high-pass and low-pass fil-
frequency of the peak response. tering, phase noise, power equalization, opponent color processing,
rotation, and three types of Eidolon transformations (I, II, III), as well as
Texture versus shape bias of CNN models uniform noise. The other five datasets focus on nonparametric image
We evaluated whether CNN classification decisions were more strongly alterations, including sketches, stylized, edge, silhouettes, and texture-
influenced by shape or texture cues by presenting shape-texture cue shape cue conflict. The assessment extends beyond simple accuracy
conflict stimuli that were generated using style transfer methods70 in measurements, incorporating three additional metrics: 1) the accuracy
the following study51. The stimulus set consisted of 1280 images from difference, which compares CNN and human accuracy across various
16 ImageNet categories that included airplane, bear, bicycle, bird, boat, out-of-distribution tests; 2) observed consistency, which measures the
bottle, car, cat, chair, clock, dog, elephant, keyboard, knife, oven, and proportion of instances where both humans and a CNN model either
truck (available at For correctly or incorrectly identified the same sample; and 3) error
A peer review file is available.