An Optical Neural Network Using Less Than 1 Photon Per Multiplication
An Optical Neural Network Using Less Than 1 Photon Per Multiplication
An Optical Neural Network Using Less Than 1 Photon Per Multiplication
processors [2]. Optical neural networks have attracted attention as an alternative physical platform
for deep learning [3, 4], as it has been theoretically predicted that they can fundamentally achieve
higher energy efficiency than neural networks deployed on conventional digital computers [5, 6]. Here,
we experimentally demonstrate an optical neural network achieving 99% accuracy on handwritten-
digit classification using ∼3.2 detected photons per weight multiplication and ∼90% accuracy using
∼0.64 photons (∼2.4 × 10−19 J of optical energy) per weight multiplication. This performance was
achieved using a custom free-space optical processor that executes matrix-vector multiplications in a
massively parallel fashion, with up to ∼0.5 million scalar (weight) multiplications performed at the
same time. Using commercially available optical components and standard neural-network training
methods, we demonstrated that optical neural networks can operate near the standard quantum limit
with extremely low optical powers and still achieve high accuracy. Our results provide a proof-of-
principle for low-optical-power operation, and with careful system design including the surrounding
electronics used for data storage and control, open up a path to realizing optical processors that
require only 10−16 J total energy per scalar multiplication—which is orders of magnitude more
efficient than current digital processors.
I. INTRODUCTION
Much of the progress in deep learning over the past decade has been facilitated by the use of ever-larger models,
with commensurately larger computation requirements and energy consumption [2]. The widespread commercial
implementation of increasingly complex deep neural networks (DNNs) has resulted in rapid growth in the total energy
consumption of machine learning, and in large-scale deployments, 80-90% of the cost is for inference processing [7].
The rate at which the energy requirements for state-of-the-art DNNs is growing is unsustainable and urgently needs
to be addressed [2]. Both software and hardware advances are important for reducing energy consumption: DNN
models (the software) can be made more efficient [2], and hardware for executing DNN models can be made more
efficient, by specializing the hardware to the type of computations involved in DNN processing [8].
Optical processors have been proposed as deep-learning accelerators that can in principle achieve better energy
efficiency and lower latency than electronic processors [3–6, 9]. For deep learning, optical processors’ main proposed
role is to implement matrix-vector multiplications [10, 11], which are typically the most computationally-intensive
operations in DNNs [8] (Fig. 1a).
Theory and simulations have suggested that optical neural networks (ONNs) built using optical matrix-vector
multipliers can exhibit extreme energy efficiency surpassing even that of irreversible digital computers operating
at the fundamental thermodynamic limit [5]. In order to achieve an energy advantage, an optical matrix-vector
multiplier needs massively parallel execution of the scalar multiplications and additions that constitute a matrix-
vector multiplication [5, 6]. It has been predicted that for sufficiently large vector sizes, matrix-vector multiplication
can be performed with an optical energy cost of less than 1 photon per scalar multiplication, assuming the standard
quantum limit for noise [5, 6]. The energy of a single photon at visible wavelengths is on the order of 10−19 J; this
suggests a possibility for optical processors to have an energy advantage of several orders of magnitude over electronic
processors using digital multipliers, whose energy cost is currently between 10−14 and 10−13 J per scalar multiplication
with equivalent precision [12, 13].
∗ tw329@cornell.edu
† pmcmahon@cornell.edu
2
The key to realizing an energy advantage in an optical matrix-vector multiplier is to maximize the sizes of the
matrices and vectors that are to be multiplied. With large operands, there are many constituent scalar multiplication
and accumulation operations that can be performed in parallel completely in the optical domain, and the costs of
conversions between electronic and optical signals can be amortized [6]. Optics provides several different ways to
implement operations in parallel, including using wavelength multiplexing [14, 15], spatial multiplexing in photonic
integrated circuits [10, 14, 16–19], and spatial multiplexing in 3D free-space optical processors [11, 20–29].
To date, across all multiplexing approaches and architectures, demonstrations of analog ONNs have involved small
vector-vector dot products (as a fundamental operation in implementing convolutional layers [14, 15] and fully con-
nected layers [27]) or matrix-vector multiplications (for realizing fully connected layers [10]): the vectors have been
limited [30] to sizes of at most 64. This is substantially below the scale (vector sizes >103 ) at which sub-photon-
per-multiplication energy efficiency is predicted to be feasible [5, 6]. This is the fundamental reason that the optical
energy consumption in recently demonstrated optical processors is still several orders of magnitude higher than that of
theoretical predictions (10−14 -10−13 versus 10−18 J per scalar multiplication) [5, 6, 15, 27, 30]. One ONN architectural
approach that is promising for near-term explorations of large-scale ONN operation is to perform spatial multiplexing
in 3D free space, since a 2D cross section can contain [31], for example, >106 modes in an area of 1 mm2 . While the
potential for large-scale operation exists based on the available parallelism in free-space spatial modes, this potential
has not yet been realized.
Here, we report on our experimental implementation of a 3D free-space optical processor that can perform matrix-
vector multiplications at large scale (with vector sizes of up to 0.5 million). The large scale has enabled us to
demonstrate the computation of matrix-vector products and vector-vector dot products each using less than 1 photon
per scalar multiplication. With this optical matrix-vector multiplier we constructed an ONN (Fig. 1a) that per-
forms image classification using less than 1 photon per scalar multiplication, matching theoretical predictions for the
quantum-limited optimal efficiency of an ONN [5].
The optical processor we designed and constructed uses the following scheme to perform matrix-vector multiplica-
tions y = W ~x. Each element xj of the input vector ~x is encoded in the intensity of a separate spatial mode illuminated
by a pixel of a light source, and each matrix element wij is encoded as the transmissivity of a modulator pixel. We used
an organic light-emitting diode (OLED) display as the light source and a spatial light modulator (SLM) for intensity
modulation. Matrix-vector multiplications were computed in three physical steps: 1) Fan-out: The input vector’s
elements were spatially arranged into a 2D block (Fig. 1b, top left). The 2D block representing ~x was replicated a
number of times equal to the number of rows in the matrix W , and then tiled on the OLED display as shown in Fig.
1b (top row). 2) Element-wise Multiplication: Each OLED pixel encoding a single vector element xj was aligned and
imaged to a corresponding pixel on the SLM, whose transmissivity was set to be ∝ wij . This performs the scalar
multiplication wij xj (Fig. 1b bottom middle). 3) Optical Fan-in: The intensity-modulated pixels from each block
were physically summed by focusing the light transmitted by them onto a detector. The total number of photons
impinging on the ith detector is proportional to the element yi of the matrix-vector product y = W ~x (Fig. 1b bottom
right). We can interpret each yi as the dot product between the input vector ~x and the ith row of the matrix W .
This design computes all the scalar multiplications and additions involved in a matrix-vector multiplication in
parallel in a single pass of light through the setup. The encoding of vector elements in optical intensity constrains
the setup to performing matrix-vector multiplications with matrices and vectors that have non-negative elements.
However, we can use the system to perform matrix-vector multiplications with matrices and vectors that have signed
elements (elements that can take both positive and negative values) by using offsets and scaling to convert the
calculations to matrix-vector multiplications involving only non-negative numbers [21] (see Supplementary Information
Section 11).
Our 2D-block design can be, and was, implemented with spherical-lens systems, which are well-corrected for errors
in large-field-of-view imaging (The setup schematic is shown in Supplementary Information Section 1.) The setup
enabled us to align an array of 711×711 pixels on the OLED display to an array of 711×711 pixels on the SLM
(Supplementary Information Sections 5, 6, and 7), allowing 711×711=505,521 scalar multiplications to be performed
in parallel. Our experimental setup was, with a single pass of light through the setup, capable of performing matrix-
vector multiplications with matrices having arbitrary dimensions, subject to the total number of matrix elements
being no larger than 505,521. In the special case of the matrix merely being a vector, our setup performed a single
vector-vector dot product with vectors sizes up to 505,521.
The ability to perform dot products between very large vectors with our block-encoded design enabled us to
achieve extremely low optical energy consumption. For each vector-vector dot product that the system computes, the
summation of the element-wise products is performed by focusing the spatial modes corresponding to the element-wise
3
FIG. 1. A 3D free-space optical matrix-vector multiplier that uses less than one photon per scalar multiplication
performed. a, The role of optical matrix-vector multiplication in executing the forward-pass (inference-mode) operation in a
fully connected neural network. The matrix multiplications involved in propagating each layer forward are performed optically
and the element-wise nonlinear activation functions are performed electronically. Each neuron in the middle (hidden) layer is
color-coded to show the correspondence between the mathematically abstract neurons in (a) and their optical implementation
in (b). b, A step-by-step illustration of how optical matrix-vector-multiplication can be performed by decomposing the matrix-
vector operation W ~x into blocks of vector-vector dot products that are implemented as scalar multiplications performed
in parallel, followed by summations (accumulations) performed in parallel. The depiction is for a 4×4 matrix and a 4-
dimensional vector, but the concept extends naturally to matrices of arbitrary dimension. The top row shows mathematically
abstract operations, and the bottom row shows the corresponding physical operations with optics. “◦” denotes element-wise
multiplication between two matrices or vectors of the same size. Individual scalar multiplications are realized optically by
encoding one operand (xj ) in the intensity of a single spatial mode (depicted as a beam for illustrative purposes), and the
other operand (wij ) in the transmissivity of a single pixel of a modulator: propagation of the beam through the modulator’s
pixel
P yields the scalar multiplication, the result of which is encoded in the intensity of the transmitted light. Each summation
j wij xj is performed optically by focusing multiple spatial modes onto a single detector. c, An illustration of how optical
matrix-vector multiplication can consume less than 1 photon per scalar multiplication when for a large vector size. A single
lens is used to sum the intensities of the spatial modes encoding the element-wise products wij xj between the ith row of W
and the vector ~x. For a vector of size N , there are N spatial modes whose intensities are summed. If N is sufficiently large then
even if each individual spatial mode contains <1 photon on average, the total number of photons
P impinging on the detector
will be 1, allowing high signal-to-noise-ratio measurement of the summation result yi = j wij xj .
4
products onto a single detector. If the vectors have size N , then N spatial modes are incoherently summed on the
detector.
Consequently the √detector’s output, which is proportional to the dot-product answer, has a signal-to-noise ratio
(SNR) that scales as N under the shot-noise limit [6]. If the vector size is sufficiently large then even if the individual
spatial modes each have an average photon number far less than 1, the total number of photons impinging on the
detector can be much greater than 1, and so precise readout of the dot-product answer is possible (Figure 1c).
FIG. 2. Vector-vector dot products computed with high accuracy and high effective numerical precision using
as few as 0.001 photons per scalar multiplication. a, Simplified schematic of optical setup for computation of vector-
vector dot products, and characterization procedure. N -pixel images were used as test vectors by interpreting each image as
an N -dimensional vector. The setup was used to compute the dot products between many different random pairs of vectors,
with each computation producing a result ymeas (top and center rows; example experimental measurement of element-wise
~ ◦~
multiplication w x was captured with a camera before optical fan-in for illustrative purposes). For each vector pair, the
dot-product ground truth ytruth was computed on a digital computer (bottom row). The error was calculated as ymeas − ytruth .
b, The root-mean-square (RMS) error of the dot product computation as a function of the average number of detected photons
per scalar multiplication. The vector length N was ∼0.5 million (711×711), which is sufficiently large that we observed an
RMS error of <6% even when only 0.001 photons per multiplication were used, and an RMS error of <1% when 0.1 photons
per multiplication were used. The insets show error histograms (over different vector pairs and repeated measurements) from
experiments using 10 and 0.001 photons per multiplication, respectively. The error bars in the main plot show 10× the standard
deviation of the RMS error, calculated using repeated measurements. c, The RMS error as a function of the vector size N ,
equal to the number of scalar multiplications needed to compute a vector-vector dot product. For each vector size, the RMS
error was computed using five different photon budgets, ranging from 0.001 to 10 photons per scalar multiplication. The shaded
column indicates data points that are also shown in Panel b. The error bars show 3× the standard deviation of the RMS error,
calculated using repeated measurements.
To understand how well our system performed in practice in the regime of low optical power consumption, we
5
characterized its accuracy while varying the number of photons used. In our first characterization experiments, we
computed the dot products of randomly chosen pairs of vectors (Figure 2a; see Methods). The results from our
characterization with dot-product computations apply directly to the setting of matrix-vector multiplications with
generic matrices because our setup computes matrix-vector multiplications as a set of vector-vector dot products. For
a dot-product computation, the answer is a scalar, so only a single detector was used: the optical signal encoding the
dot-product solution was measured by a sensitive photodetector capable of resolving single photons (see Methods),
and the number of photons used for each dot product was controlled by changing the detector integration time and
by inserting neutral-density filters immediately after the OLED display.
To demonstrate our setup could perform computations using less than 1 photon per scalar multiplication for large
vector sizes, we measured the numerical precision of dot products between vectors each of size ∼0.5 million. With
0.001 photons per scalar multiplication, the error was measured to be ∼6% (Figure 2b; see Supplementary Information
Section 12 for the details of RMS-error calculation); the dominant contribution to this error was from shot noise at the
detector (Supplementary Information Section 8). As we increased the number of photons used, the error decreased
until it reached a minimum of ∼0.2% at 2 photons per multiplication or higher (Figure 2b). We hypothesize that
the dominant sources of error at high photon counts (>2 photons per multiplication) are imperfect imaging of the
OLED display pixels to SLM pixels, and crosstalk between SLM pixels. We note that the experimental runs to test the
performance of the system when using 0.001 photons per multiplication (which resulted in ∼6% error) were performed
with a detector integration time of ∼100 ns. This shows that matrix-vector multiplications can be performed with
<1 photon per multiplication at a rate of at least 10 MHz, although this is merely an experimentally demonstrated
lower-bound : in principle, with sufficiently fast modulators and detectors, the system should be able to perform
matrix-vector multiplications at rates >10 GHz [6]. To enable comparison between the experimentally achieved
analog numerical precision with the numerical precision in digital processors, we can interpret each measured analog
error percentage (Figure 2b) as corresponding to an effective bit-precision for the computed dot product’s answer.
Using the metric noise-equivalent bits [6], an analog RMS error of 6% corresponds to 4 bits, and 0.2% RMS error
corresponds to ∼9 bits (see Methods).
We also verified that we could compute dot products between shorter vectors when using low numbers of photons
per scalar multiplication (Figure 2c). For photon budgets ranging from 0.001 to 0.1 photons per multiplication, the
numerical error was dominated by shot noise for all vector sizes tested. When the number of photons used was
sufficiently large, the error was no longer dominated by shot noise, which is consistent with the single-vector-size
results shown in Figure 2b. For every photon budget tested, dot products between larger vectors had lower error; we
attribute this to dot products between larger vectors involving the effective averaging of larger numbers of terms.
Having characterized the accuracy of our experimental setup for performing multiplication operations with random
vectors, we set out to demonstrate its use as the core of an experimental ONN implementation. We realized an
ONN comprised of fully connected layers where the matrix-vector multiplications between each layer were performed
optically using our experimental setup, and where the nonlinearity was applied electronically (using a digital processor)
between each layer.
Our main goal was to determine the extent to which our ONN could tolerate multiplication inaccuracy resulting from
the use of a very limited photon budget. Our approach was to run a trained neural network with our setup and measure
the classification accuracy as a function of the number of photons used. We used handwritten-digit classification with
the MNIST dataset as our benchmark task and trained a four-layer fully connected multi-layer perceptron (MLP)
(Figure 3a) with a back-propagation procedure designed for use with low-precision inference hardware (quantization-
aware training—QAT [32]; see Methods).
We evaluated the first 130 test images in the MNIST dataset under 5 different photon budgets: 0.03, 0.16, 0.32, 0.64,
and 3.2 photons per scalar multiplication (Figure 3b, center panel, orange dots). We found that using 3.2 photons
per multiplication led to a classification accuracy of ∼99% (Figure 3b, top-right panel), which was almost identical to
the accuracy (99%) of the same trained neural network run on a digital computer. In the sub-photon regime, using
0.64 photons per multiplication, the ONN achieved >90% classification accuracy (Figure 3b, top-middle panel). The
experimental results agree well with the results from simulations of the same neural network being executed by an
ONN that is subject to shot noise (Figure 3b, center panel, dark-blue line). To achieve an accuracy of 99%, the total
optical energy detected for each inference of a handwritten digit was ∼1 pJ (Figure 3b). For the weight matrices used
in these experiments, the average SLM transmission was ∼46%, so when considering the unavoidable loss at the SLM,
the total optical energy needed for each inference was ∼2.2 pJ. For comparison, 1 pJ is close to the energy used for
only a single scalar multiplication in electronic processors [33], and our model required 89,400 scalar multiplications
per inference (see Supplementary Information Section 14).
6
7
FIG. 3. MNIST handwritten-digit classification demonstrated with an optical neural network using less than
one photon per scalar multiplication. a, Illustration of the neural-network architecture for handwritten-digit classification
that we implemented as an ONN. The neural network is composed of a sequence of fully connected layers: an input layer
consisting of 28×28 neurons encodes the image to be classified (left); two hidden layers consisting of 100 neurons each (middle),
and an output layer containing 10 neurons (right). The activation of each neuron is illustrated as green brightness for the
example of an input image containing the digit 3 (vertical bars, top panel). The weights of the connections between neurons
for all four layers are shown (grayscale square arrays, bottom panel); each square array shows the weights from all the neurons
in one layer to one of the neurons in the next layer. b, Classification accuracy tested using the MNIST dataset as a function
of optical energy consumption (middle panel), and confusion matrices of each corresponding experiment data point (top and
bottom panels). The optical energy per inference is defined as the total optical energy received by the photodetectors during
execution of the three matrix-vector multiplications comprising a single forward pass through the entire neural network.
V. DISCUSSION
Here we have presented experimental results that support the notion that optical neural networks can in principle
[3–6] have a fundamental energy advantage over electronic neural-network implementations. We showed that ONNs
can operate in a photon-budget regime in which the standard quantum limit (i.e., optical shot noise) governs the
achievable accuracy. In particular, we achieved high classification accuracies using our ONN even when restricted
to a photon budget of less than one photon per scalar multiplication. We used a standard neural-network model
architecture and standard training techniques, so did not specialize the model to our hardware, nor did we need
to perform any re-training to run the model on our hardware setup. This successful separation of software and
hardware development shows that our ONN could be used as a drop-in replacement for other more conventional
neural-network-accelerator hardware [8] without the need for any major changes to the machine-learning software
workflow.
Our results have been enabled by several key design choices. The 2D-block design presented in this work, which
can be seen as a generalization of the Stanford matrix-vector multiplier [21] that avoids cylindrical lenses and their
practical limitations, takes full advantage of the number of parallel modes in 3D free space, allowing extremely large
numbers of scalar multiplications to be performed in parallel—up to ∼0.5 million in our experimental realization. This
enabled us to implement a fully parallelized optical matrix-vector multiplier for matrices having size up to 711×711.
The same optical setup could be, and was, used to compute vector-vector dot products with vectors of size up to ∼0.5
million, which is many orders of magnitude larger than the vector sizes used in previous demonstrations of optical
processors [14, 16, 27, 34]. The use of 2D blocks was also an important contributor to the achieved accuracy, since
this layout reduces the impact of crosstalk between pixels (Supplementary Information Section 10).
We have shown that the optical energy used to perform each scalar multiplication can be <1 × 10−18 J in a functional
ONN performing MNIST handwritten-digit classification. The ONN demonstrated in this work was of moderate size,
comprising layers with at most 784 neurons, and even better energy efficiency can be achieved for neural networks with
wider layers [5]. Our results support an estimate that an ONN with appropriate system-level design could have an
advantage of at least two orders-of-magnitude over electronic DNN accelerators, using a total energy across the entire
system of only 1 × 10−16 J per scalar multiplication, even when operating at GHz speeds (Supplementary Information
Section 15).
Our proof-of-principle results for sub-photon-per-multiplication ONN operation should readily translate to other
ONN architectures, including those using coherent light, provided that they involve summation of a sufficiently large
number of modes (be they spatial, frequency, or temporal modes) [5, 6].
One critical step towards building practical ONNs with high overall energy efficiency is to design scalable optical
fan-out and fan-in using miniaturized passive components. Optical fan-out is crucial to realizing sub-photon scalar
multiplications: an optical fan-out stage [21] would be used to spread photons from a single light source across a
number of spatial modes larger than the number of photons emitted (in the experiments reported in this paper, less
than 1 photon per mode was achieved by applying attenuation, which, while enabling our scientific demonstration,
is not suitable as an ultimate engineering solution). Both optical fan-out and optical fan-in can be implemented, for
example, with lens arrays [35, 36]; demonstrating an ONN using highly scalable optical fan-out/in stages, integrated
with high-efficiency modulators [37, 38] and detectors [39], is an important next step.
More broadly, the ability to perform low-precision matrix-vector multiplications more efficiently could find applica-
tion beyond neural networks, including in other machine-learning algorithms [40] and for combinatorial-optimization
heuristics [41–43].
8
The datasets generated during the characterization of dot-product accuracy (Figure 2) and the execution of the
ONN using different photon budgets (Figure 3), along with the code used to analyze them, are available at: https:
//doi.org/10.5281/zenodo.4722066. The code we used to train neural networks with QAT in PyTorch is available
at: https://github.com/mcmahon-lab/ONN-QAT-SQL. The code for controlling the devices used in the experiments
is available at: https://github.com/mcmahon-lab/ONN-device-control.
ACKNOWLEDGMENTS
P.L.M. acknowledges membership of the CIFAR Quantum Information Science Program as an Azrieli Global Scholar.
T.W. and L.G.W. acknowledge partial financial support from the Mong Cornell Neurotech Fellow Program. The
authors wish to thank NTT Research for their financial and technical support. We thank Frank Wise for the loan of
spatial light modulators, and Chris Xu for the loan of imaging lenses that were used in our preliminary experiments.
We thank Ryan Hamerly for a helpful discussion on the energy scaling of optical fan-in.
Competing interests: T.W. and P.L.M. are listed as inventors on a U.S. provisional patent application (No.
63/149,974) on the techniques to implement 2D-block optical matrix-vector multipliers.
9
[1] Y. LeCun, Y. Bengio, and G. Hinton, Deep learning. Nature 521, 436–444 (2015).
[2] N. C. Thompson, K. Greenewald, K. Lee, and G. F. Manso, The computational limits of deep learning. arXiv:2007.05558
(2020).
[3] B. J. Shastri, A. N. Tait, T. F. de Lima, W. H. Pernice, H. Bhaskaran, C. D. Wright, and P. R. Prucnal, Photonics for
artificial intelligence and neuromorphic computing. Nature Photonics 15, 102–114 (2021).
[4] G. Wetzstein, A. Ozcan, S. Gigan, S. Fan, D. Englund, M. Soljačić, C. Denz, D. A. B. Miller, and D. Psaltis, Inference in
artificial intelligence with deep optics and photonics. Nature 588, 39–47 (2020).
[5] R. Hamerly, L. Bernstein, A. Sludds, M. Soljačić, and D. Englund, Large-scale optical neural networks based on photo-
electric multiplication. Physical Review X 9, 021032 (2019).
[6] M. A. Nahmias, T. F. De Lima, A. N. Tait, H.-T. Peng, B. J. Shastri, and P. R. Prucnal, Photonic multiply-accumulate
operations for neural networks. IEEE Journal of Selected Topics in Quantum Electronics 26, 1–18 (2020).
[7] A. Jassy, Keynote address at AWS re:Invent. URL: www.youtube.com/watch?v=7-31KgImGgU (2019).
[8] V. Sze, Y.-H. Chen, T.-J. Yang, and J. S. Emer, Efficient processing of deep neural networks: A tutorial and survey.
Proceedings of the IEEE 105, 2295–2329 (2017).
[9] H. J. Caulfield and S. Dolev, Why future supercomputing requires optics. Nature Photonics 4, 261–263 (2010).
[10] Y. Shen, N. C. Harris, S. Skirlo, M. Prabhu, T. Baehr-Jones, M. Hochberg, X. Sun, S. Zhao, H. Larochelle, D. Englund,
and M. Soljačić, Deep learning with coherent nanophotonic circuits. Nature Photonics 11, 441 (2017).
[11] X. Lin, Y. Rivenson, N. T. Yardimci, M. Veli, Y. Luo, M. Jarrahi, and A. Ozcan, All-optical machine learning using
diffractive deep neural networks. Science 361, 1004–1008 (2018).
[12] A. Reuther, P. Michaleas, M. Jones, V. Gadepally, S. Samsi, and J. Kepner, Survey of machine learning accelerators.
arXiv:2009.00993 (2020).
[13] M. Horowitz, Computing’s energy problem (and what we can do about it). In 2014 IEEE International Solid-State Circuits
Conference Digest of Technical Papers (ISSCC), 10–14 (2014).
[14] J. Feldmann, N. Youngblood, M. Karpov, H. Gehring, X. Li, M. Stappers, M. Le Gallo, X. Fu, A. Lukashchuk, A. S. Raja,
C. D. Wright, A. Sebastian, T. J. Kippenberg, W. H. P. Pernice, and H. Bhaskaran, Parallel convolutional processing using
an integrated photonic tensor core. Nature 589, 52–58 (2021).
[15] X. Xu, M. Tan, B. Corcoran, J. Wu, A. Boes, T. G. Nguyen, S. T. Chu, B. E. Little, D. G. Hicks, R. Morandotti,
A. Mitchell, and D. J. Moss, 11 TOPS photonic convolutional accelerator for optical neural networks. Nature 589, 44–51
(2021).
[16] A. N. Tait, T. F. De Lima, M. A. Nahmias, H. B. Miller, H.-T. Peng, B. J. Shastri, and P. R. Prucnal, Silicon photonic
modulator neuron. Physical Review Applied 11, 064043 (2019).
[17] P. Stark, F. Horst, R. Dangel, J. Weiss, and B. J. Offrein, Opportunities for integrated photonic neural networks. Nanopho-
tonics 9, 4221–4232 (2020).
[18] W. Bogaerts, D. Pérez, J. Capmany, D. A. B. Miller, J. Poon, D. Englund, F. Morichetti, and A. Melloni, Programmable
photonic circuits. Nature 586, 207–216 (2020).
[19] C. Wu, H. Yu, S. Lee, R. Peng, I. Takeuchi, and M. Li, Programmable phase-change metasurfaces on waveguides for
multimode photonic convolutional neural network. Nature Communications 12, 1–8 (2021).
[20] M. Miscuglio, Z. Hu, S. Li, J. K. George, R. Capanna, H. Dalir, P. M. Bardet, P. Gupta, and V. J. Sorger, Massively
parallel amplitude-only Fourier neural network. Optica 7, 1812–1819 (2020).
[21] J. W. Goodman, A. Dias, and L. Woody, Fully parallel, high-speed incoherent optical method for performing discrete
Fourier transforms. Optics Letters 2, 1–3 (1978).
[22] D. Psaltis, D. Brady, and K. Wagner, Adaptive optical networks using photorefractive crystals. Applied Optics 27, 1752–
1759 (1988).
[23] J. Dong, M. Rafayelyan, F. Krzakala, and S. Gigan, Optical reservoir computing using multiple light scattering for chaotic
systems prediction. IEEE Journal of Selected Topics in Quantum Electronics 26, 1–12 (2019).
[24] J. Chang, V. Sitzmann, X. Dun, W. Heidrich, and G. Wetzstein, Hybrid optical-electronic convolutional neural networks
with optimized diffractive optics for image classification. Scientific Reports 8, 1–10 (2018).
[25] M. W. Matthès, P. del Hougne, J. de Rosny, G. Lerosey, and S. M. Popoff, Optical complex media as universal reconfigurable
linear operators. Optica 6, 465–472 (2019).
[26] J. Bueno, S. Maktoobi, L. Froehly, I. Fischer, M. Jacquot, L. Larger, and D. Brunner, Reinforcement learning in a
large-scale photonic recurrent neural network. Optica 5, 756–760 (2018).
[27] J. Spall, X. Guo, T. D. Barrett, and A. Lvovsky, Fully reconfigurable coherent optical vector–matrix multiplication. Optics
Letters 45, 5752–5755 (2020).
[28] L. Bernstein, A. Sludds, R. Hamerly, V. Sze, J. Emer, and D. Englund, Freely scalable and reconfigurable optical hardware
for deep learning. Scientific Reports 11, 1–12 (2021).
[29] T. Zhou, X. Lin, J. Wu, Y. Chen, H. Xie, Y. Li, J. Fan, H. Wu, L. Fang, and Q. Dai, Large-scale neuromorphic optoelec-
tronic computing with a reconfigurable diffractive processing unit. Nature Photonics (2021). Advance online publication,
doi:10.1038/s41566-021-00796-w.
[30] C. Ramey, Silicon photonics for artificial intelligence acceleration. In Proceedings of the IEEE Hot Chips 32 Symposium
(HCS), 1–26 (2020).
[31] D. A. B. Miller, Waves, modes, communications, and optics: a tutorial. Advances in Optics and Photonics 11, 679–825
10
(2019).
[32] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko, Quantization and training of
neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), 2704–2713 (2018).
[33] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers,
R. Boyle, P.-l. Cantin, C. Chao, C. Clark, C. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb, T. V. Ghaemmagham et al.
In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium
on Computer Architecture (ISCA), 1–12 (2017).
[34] X. Xu, M. Tan, B. Corcoran, J. Wu, T. G. Nguyen, A. Boes, S. T. Chu, B. E. Little, R. Morandotti, A. Mitchell, D. G.
Hicks, and D. J. Moss, Photonic perceptron based on a soliton crystal Kerr microcomb for high-speed, scalable, optical
neural networks. arXiv:2003.01347 (2020).
[35] W. Andregg, M. Andregg, R. T. Weverka, and L. Clermont, Wavelength multiplexed matrix-matrix multiplier. (U.S.
Patent No. 10,274,989). U.S. Patent and Trademark Office (2019).
[36] Y. Hayasaki, I. Tohyama, T. Yatagai, M. Mori, and S. Ishihara, Optical learning neural network using Selfoc microlens
array. Japanese Journal of Applied Physics 31, 1689 (1992).
[37] Y. Su, Y. Zhang, C. Qiu, X. Guo, and L. Sun, Silicon photonic platform for passive waveguide devices: materials,
fabrication, and applications. Advanced Materials Technologies 5, 1901153 (2020).
[38] G. W. Burr, R. M. Shelby, S. Sidler, C. Di Nolfo, J. Jang, I. Boybat, R. S. Shenoy, P. Narayanan, K. Virwani, E. U.
Giacometti, B. Kurdi, and H. Hwang, Experimental demonstration and tolerancing of a large-scale neural network (165
000 synapses) using phase-change memory as the synaptic weight element. IEEE Transactions on Electron Devices 62,
3498–3507 (2015).
[39] N. Youngblood, C. Chen, S. J. Koester, and M. Li, Waveguide-integrated black phosphorus photodetector with high
responsivity and low dark current. Nature Photonics 9, 247–252 (2015).
[40] C. De Sa, C. Zhang, K. Olukotun, and C. Ré, Taming the wild: A unified analysis of hogwild!-style algorithms. In Advances
in Neural Information Processing Systems 28 (NeurIPS 2015), 28 (2015).
[41] M. Prabhu, C. Roques-Carmes, Y. Shen, N. Harris, L. Jing, J. Carolan, R. Hamerly, T. Baehr-Jones, M. Hochberg,
V. Čeperić, J. D. Joannopoulos, D. R. Englund, and M. Soljačić, Accelerating recurrent Ising machines in photonic
integrated circuits. Optica 7, 551–558 (2020).
[42] P. L. McMahon, A. Marandi, Y. Haribara, R. Hamerly, C. Langrock, S. Tamate, T. Inagaki, H. Takesue, S. Utsunomiya,
K. Aihara, R. L. Byer, M. M. Fejer, H. Mabuchi, and Y. Yamamoto, A fully programmable 100-spin coherent Ising machine
with all-to-all connections. Science 354, 614–617 (2016).
[43] T. Inagaki, Y. Haribara, K. Igarashi, T. Sonobe, S. Tamate, T. Honjo, A. Marandi, P. L. McMahon, T. Umeki, K. Enbutsu,
O. Tadanaga, H. Takenouchi, K. Aihara, K.-i. Kawarabayashi, K. Inoue, S. Utsunomiya, and H. Takesue, A coherent Ising
machine for 2000-node optimization problems. Science 354, 603–606 (2016).
[44] A. Coates, A. Ng, and H. Lee, An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the
Fourteenth International Conference on Artificial Intelligence and Statistics (AISTATS), 215–223 (2011).
[45] X. Glorot, A. Bordes, and Y. Bengio, Deep sparse rectifier neural networks. In Proceedings of the Fourteenth International
Conference on Artificial Intelligence and Statistics (AISTATS), 315–323 (2011).
[46] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Des-
maison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala,
PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing
Systems 32 (NeurIPS 2019), 32 (2019).
[47] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, Binarized neural networks. In Advances in Neural
Information Processing Systems 29 (NeurIPS 2016), 29 (2016).
[48] C. Shorten and T. M. Khoshgoftaar, A survey on image data augmentation for deep learning. Journal of Big Data 6, 1–48
(2019).
11
METHODS
Experimental Setup
We used the OLED display of an Android phone (Google Pixel 2016) as the incoherent light source for encoding input
vectors to the matrix-vector multiplier. Only green pixels (with an emission spectrum centered around 525 nm) were
used in the experiments; the OLED display contains an array of ∼2 million (1,920x1,080) green pixels (Supplementary
Information Fig. S3). Custom Android software was developed to load bitmap images onto the OLED display through
Python scripts running on a control computer. The phone was found capable of displaying 124 distinct brightness
levels (∼7 bits) in a linear brightness ramp (Supplementary Information Fig. S3). At the beginning of each matrix-
vector-multiplication computation, the vector was reshaped into a 2D block (Figure 1a) and displayed as an image on
the phone screen for the duration of the computation. The brightness of each OLED pixel was set to be proportional to
the value of the non-negative vector element it encoded. Fan-out of the vector elements was performed by duplicating
the vector block on the OLED display.
Scalar multiplication of vector elements with non-negative numbers was performed by intensity modulation of the
light that was emitted from the OLED pixels. An intensity-modulation module was implemented by combining a
phase-only reflective liquid-crystal spatial light modulator (SLM, P1920-500-1100-HDMI, Meadowlark) with a po-
larizing beam splitter and a half-wave plate in a double-pass configuration (Supplementary Information Section 3).
An intensity look-up table (LUT) was created to map SLM pixel values to transmission percentages, with an 8-bit
resolution (Supplementary Information Section 3).
Element-wise multiplication between two vectors w ~ and ~x was performed by aligning the image of each OLED
pixel (encoding an element of ~x) to its counterpart pixel on the SLM (encoding an element of w) ~ (Figure 1b). By
implementing such pixel-to-pixel alignment, as opposed to aligning patches of pixels to patches of pixels, we maximized
the size of the matrix-vector multiplication that could be performed by this setup. A zoom-lens system (Resolve 4K,
Navitar) was employed to de-magnify the image of the OLED pixels by ∼0.16× to match the pixel pitch of the SLM
(Supplementary Information Section 5). The image of each OLED pixel was diffraction-limited with a spot diameter
of ∼6.5 µm, which is smaller than the 9.2 µm size of pixels in the SLM, to avoid cross-talk between neighboring pixels.
Pixel-to-pixel alignment was achieved for ∼0.5 million pixels (Supplementary Information Section 5). This enabled the
setup to perform vector-vector dot products with 0.5-million-dimensional vectors in single passes of light through the
setup (Figure 2b). The optical fan-in operation was performed by focusing the modulated light field onto a detector,
through a 4f system consisting of the rear adapter of the zoom-lens system and an objective lens (XLFLUOR4x/340,
NA=0.28, Olympus) (Fig. S1 and Supplementary Information Section 9).
The detectors measured optical power by integrating the photon flux incident on the detector’s active area over
a specified time window. Different types of detector were employed for different experiments. A multi-pixel photon
counter (MPPC, C13366-3050GA, Hamamatsu) was used as a bucket detector for low-light-level measurements. This
detector has a large dynamic range (pW to nW) and moderately high bandwidth (∼3 MHz) (Supplementary Infor-
mation Section 4). The MPPC outputted a single voltage signal representing the integrated optical energy of the
spatial modes focused onto the detector area by the optical fan-in operation (Supplementary Information Section 9).
The MPPC is capable of resolving the arrival time of single-photon events for low photon fluxes (<106 per second);
for higher fluxes that exceed the bandwidth of MPPC (∼3 MHz), the MPPC output voltage is proportional to the
instantaneous optical power (Fig. S6 and Supplementary Information Section 4). The SNR of the measurement
of optical energy with MPPC is about half of the shot-noise-limited SNR (Supplementary Information Section 8).
Since the MPPC does not provide spatial resolution within its active area, it effectively acts as a single-pixel detector
(Figure 1c) and consequently could only be used to read out one dot product at a time. For parallel computation of
multiple dot products (as is desirable when performing matrix-vector multiplications that are decomposed into many
vector-vector dot products), a CMOS camera (Moment-95B, monochromatic, Teledyne) was used. The intensity of
the modulated light field was captured by the camera as an image, which was divided into regions of interest (ROIs),
each representing the result of an element-wise product of two vectors. The pixels in each ROI could be then summed
digitally to obtain the total photon counts, which correspond to the value of the dot product between the two vectors.
Compared to the MPPC, the CMOS camera was able to capture the spatial distribution of the modulated light
but could not be used for the low-photon-budget experiments due to its much higher readout noise (∼2 electrons
per pixel) and long frame-exposure time (≥10 µs). Consequently the camera was only used for setup alignment and
for visualizing the element-wise products of two vectors with large optical powers, and the MPPC was used for the
principal experiments in this work—namely, vector-vector dot-product calculation and matrix-vector multiplication
involving low numbers of photons per scalar multiplication (Figure 2 and Figure 3).
12
The numerical accuracy of dot products was characterized with pairs of vectors consisting of non-negative elements;
since there is a straightforward procedural modification to handle vectors whose elements are signed numbers, the
results obtained are general (Supplementary Information Section 11). The dot-product answers were normalized such
that the answers for all the vector pairs used fall between 0 and 1; this normalization was performed such that the
difference between true and measured answers could be interpreted as the achievable accuracy in comparison to the
full dynamic range of possible answers (for the equations used for the error calculation, see Supplementary Information
Section 12).
Before the accuracy-characterization experiments were performed, the setup was calibrated by recording the output
of the detector for many different pairs of input vectors and fitting the linear relationship between the dot-product
answer and the detector’s output (Supplementary Information Section 12).
The vector pairs used for accuracy characterization were generated from randomly chosen grayscale natural-scene
images (STL-10 dataset [44]). The error of each computed dot product was defined as the difference between the
measured dot-product result and the ground truth calculated by a digital computer (Figure 2a). The number of
photons detected for each dot product was tuned by controlling the integration time window of the detector (Sup-
plementary Information Section 12). The measurements were repeated many times to capture the error distribution
resulting from noise. For each vector size displayed in Figure 2c, the dot products for 100 vector pairs were computed.
The root-mean-square (RMS) error was calculated based on data collected for different vector pairs and multiple
measurement trials. Therefore, the RMS error includes contributions from both the systematic error and trial-to-trial
error resulting from noise. The RMS error can be interpreted as the “expected” error from a single-shot compu-
tation of a dot product with the optical processor. The noise equivalent bits were calculated using the formula [6]
NEB = − log2 (RMS Error).
To perform handwritten-digit classification, we trained a neural network with 4 fully connected layers (Figure 3a).
The input layer consists of 784 neurons, corresponding to the 28 × 28 = 784 pixels in grayscale images of handwritten
digits. This is followed by two fully connected hidden layers with 100 neurons each. We used ReLU [45] as the
nonlinear activation function. The output layer has 10 neurons; each neuron corresponds to a digit from 0 to 9, and
the prediction of which digit is contained in the input image is made based on which of the output neurons had
the largest value. The neural network was implemented and trained in PyTorch [46]. The training of the neural
network was conducted exclusively on a digital computer (our optical experiments perform neural-network inference
only). To improve the robustness of the model against numerical error, we employed quantization-aware training
(QAT) [32], which was set to quantize the activations of neurons to 4 bits and weights to 5 bits. The PyTorch
implementation of QAT was adapted from Ref. [47]. In addition, we performed data augmentation: we applied small
random affine transformations and convolutions to the input images during training. This is a standard technique in
neural-network training for image-classification tasks to avoid overfitting [48] and intuitively should also improve the
model’s tolerance to potential hardware imperfections (e.g., image distortion and blurring). The training parameters
we used are documented in Supplementary Information Section 13. The training methods used not only effectively
improved model robustness against numerical errors but also helped to reduce the optical energy consumption during
inference. We note that the 4-bit quantization of neuron activations was only performed during training, and not
during the inference experiments conducted with the optical matrix-vector multiplier: the activations were loaded
onto the OLED display using the full available precision (7 bits).
To execute the trained neural network with the optical matrix-vector multiplier, we needed to perform 3 different
matrix-vector multiplications, each responsible for the forward propagation from one layer to the next. The weights
of each matrix of the MLP model were loaded onto the SLM, and the vector encoding the neuron values for a
particular layer was loaded onto the OLED display. (There is a technicality associated with the handling of negative
values, as was mentioned in the above Methods section on dot-product characterization and is explained in detail
in Supplementary Information Section 11.) We performed matrix-vector multiplication as a set of vector-vector dot
products. For each vector-vector dot product, the total photon counts (or optical energy) measured by the detector
were mapped to the answer of the dot product through a predetermined calibration curve. The calibration curve
was made using the first 10 samples of the MNIST test dataset by fitting the measured photon counts to the ground
13
truth of the dot products (Supplementary Information Section 14). The number of photons per multiplication was
controlled by adjusting the detector’s integration time (Supplementary Information Section 12). The measured dot-
product results were communicated to a digital computer where bias terms were added and the nonlinear activation
function (ReLU) was applied. The resulting neuron activations of each hidden layer were used as the input vector to
the matrix-vector multiplication for the next weight matrix. At the output layer, the prediction was made in a digital
computer based on the neuron with the highest value.
Supplementary Information for
CONTENTS
I Experimental Setup 3
11. Computing Dot Products with Signed Elements using Incoherent Light 19
∗ tw329@cornell.edu
† pmcmahon@cornell.edu
2
References 29
3
Part I
Experimental Setup
1. OVERVIEW OF THE EXPERIMENTAL SETUP AND COMPONENTS
The optical matrix-vector multiplier setup consists of an array of light sources, a zoom lens imaging system, an
intensity modulator, and a photodetector (Fig. S1). We used an organic light-emitting diode (OLED) display of
a commercial smartphone (Google Pixel 2016 version) as the light source for encoding input vectors. The OLED
display consist of a 1920 × 1080 pixel array, with individually controllable intensity for each pixel (for details, see
Section 2). A reflective liquid-crystal spatial light modulator (SLM, P1920-500-1100-HDMI, Meadowlark Optics)
was combined with a half-wave plate (HWP, WPH10ME-532, Thorlabs) and a polarizing beamsplitter (PBS, CCM1-
PBS251, Thorlabs) to perform intensity modulation as weight multiplication (for details, see Section 3). The SLM
has a pixel array of dimensions 1920 × 1152, with individually controllable transmission for each pixel. A zoom lens
system (Resolv4K, Navitar) was used to image the OLED display onto the SLM panel (for details, see Section 5). The
intensity-modulated light field reflected from the SLM was further de-magnified and imaged onto the detector, by a
telescope formed by the rear adapter of the zoom lens (1-81102, Navitar) and an objective lens (XLFLUOR4x/340,
Olympus). An additional band-pass filter (BPF, FF01-525/15-25, Semrock) and polarizer (LPVISE100-A, Thorlabs)
were inserted into the telescope (Fig. S1) in order to reduce the bandwidth and purify the polarization of the light
reflected by the PBS, resulting in more precise results. During alignment and troubleshooting, we used a camera
(Prime 95B Scientific CMOS Camera, Teledyne Photometrics) as a multi-pixel detector (Fig. S2a). For sensitive
measurements under extremely low photon fluxes, we used a multi-pixel photon counter (MPPC, C13366 series GA
type, Hamamatsu Photonics) as a bucket detector (Fig. S2b) (for details, see Section 9). When it was necessary to
further reduce the optical power, an additional neutral density filter (ND=0.4, NE2R04B, Thorlabs) was placed in
front of the zoom lens to attenuate light.
4
Fig. S1: Experimental Layout for the Optical Matrix-vector Multiplier Setup. a, The schematic of the
optical setup. An illustration of the element-wise multiplication is also shown. In this example, nine copies of an
input vector of handwritten digits, all ’3’—which our setup accepts as 2D images—are intensity modulated by
different weight vectors, which are each encoded as a 2D block on the SLM. Images of the vectors before and after
the intensity modulation were taken by a camera placed at the detector location. (PBS: polarizing beam splitter;
HWP: half-wave plate; BPF: band-pass filter) b, Photos of the core setup corresponding to the schematic are shown
in panel (a).
5
Fig. S2: Photos of the Entire Setup. a, The setup configured with a CMOS camera as a multi-pixel detector.
b, The setup configured with an MPPC as a sensitive bucket detector.
6
We chose an OLED display as the light source for several reasons. Unlike coherent sources (e.g., lasers), OLEDs
are unable to encode phase information and generally have slower modulation speeds when used as an incoherent
light source. However, the technology to integrate millions of OLED pixels is more commercially available and is less
expensive than integrated laser units (e.g., VCSELs) or passive modulator arrays. A high pixel count enabled us to
encode very large vectors, and was essential for demonstrating the optical energy advantage of our setup.
Compared to other options of integrated incoherent light sources (e.g., liquid-crystal display, LCD) OLED pixels
can be turned off completely, while LCD screens are always backlit with LED panels and thus always transmit some
residual light. The true darkness of OLED pixels allowed us to achieve high dynamic range in intensity modulation
and to reduce noise caused by background light pollution. Finally, the OLED display pixels used in this experiment
were conveniently point-shaped and arranged in the same square lattice array as the SLM pixels (Fig. S3a), which
facilitated pixel-to-pixel alignment (see Section 5). In contrast, commercial LCD pixels are typically arranged in bars,
which require optical image transformation before they can be aligned to SLM pixels.
Fig. S3: Properties of the Organic Light-Emitting Diode (OLED) Display. a, An image of the green
OLED pixels taken under an inspection microscope. The pixels form a square lattice with a pixel pitch of 57.5 µm.
Scale bar, 200 µm. b, Emission spectrum of the green OLED pixels. The shaded area indicates the transmission
band of the BPF (> 90% transmission). c, The 7-bit linear look-up table (LUT) calibrated to control the OLED
display intensity.
For our study, we used an OLED display with three different colors of pixels: red, blue, and green. We used only
7
the green pixels, which form a 1080 × 1920 square lattice array (∼2 × 106 total pixels) as shown in Fig. S3a. The
pixel pitch was measured to be 57.5 µm. The maximum power of each pixel was measured to be ∼1 nW, emitted in
a very wide angle (> 60 degrees). Since the light emitted from the OLED screen had a rather broad spectrum, we
used a band-pass filter (FF01-525/15-25, Semrock) to reduce the bandwidth in order to improve coherence for more
precise and stable phase modulation by the SLM (Fig. S3b). The intensity of each individual pixel could be controlled
independently with 256 (8-bit) control levels. However, since the actual output intensity was not linear with the pixel
control level, we calibrated a linear look-up table (LUT) that contains 124 distinct intensity levels (∼7 bits, Fig. S3c).
We converted a phase-only SLM into an intensity modulator with a half-wave plate (HWP) and a polarizing beam
splitter (PBS). The SLM pixels are made of birefringent liquid crystal layers, whose refractive index can be tuned by
applying voltage across them. By controlling the refractive index of extraordinary light, the SLM pixels introduce a
phase difference φe − φo between the extraordinary and ordinary light, whose polarizations are perpendicular to each
other. When a PBS and HWP were placed in front of a reflective SLM, the light field passed the components twice,
once during the trip towards the SLM and once after being reflected by the SLM (Fig. S1a). One of the functions of
PBS was to separate the output from the input light: the input light (incident to the SLM) was horizontally polarized
and transmitted by the PBS, while the output light (reflected from the SLM) was vertically polarized, and therefore
reflected by the PBS. The other function of the PBS is to convert the polarization state of the output light to its
amplitude: the light modulated by the SLM was in general elliptically polarized, controlled by the phase difference
φe − φo . The amplitude of the light field (and intensity in this case too) was modulated by selecting only the vertical
component of the SLM-modulated light at the output port of the PBS. The HWP was placed with its fast axis rotated
22.5 degrees from the extraordinary axis of the SLM such that the intensity transmission could be tuned from 0 to
100%. Fig. S4a shows the calculated relationship between the intensity transmission and phase difference φe − φo .
The maximum extinction ratio of the transmission intensity was measured to be ∼50 (Fig. S4b). The SLM consists
of 1920 × 1152 ∼ 2.2 × 106 pixels, each of which can be independently controlled for intensity modulation with a
256 (8-bit) LUT (Fig. S4c). Alternatively, instead of using a phase-modulation SLM, the intensity modulator can be
more compactly implemented with a monolithic LCD panel in a transmission geometry.
Fig. S4: Intensity Modulation with a Spatial Light Modulator (SLM). a, Simulation results of output
intensity as a function of phase difference φe − φo . α is the angle between the half-wave plate (HWP) fast axis and
the extraordinary axis of the SLM. b, The 8-bit LUT of the SLM for intensity modulation. The minimum
transmission was measured to be ∼0.02 times the maximum transmission, which is equivalent to an extinction ratio
of ∼50.
8
For single-photon detection, we used a multi-pixel photon counter (MPPC) as a bucket detector. We chose an
MPPC for its high signal-to-noise ratio (SNR), large measurement range, and moderately high bandwidth. The
MPPC is composed of an array of Geiger-mode photodiodes with high intrinsic gain, which enables the photodiodes
to detect single-photon events. The detection of each photon results in a spike-shaped impulse response in the output
voltage of the detector, with a sharp rising edge and an approximately exponentially decaying tail (Fig. S5a).
Fig. S5: Time and Frequency Characteristics of the Multi-Pixel Photon Counter (MPPC). a, The
impulse response of single-photon detection averaged over 500 trials. b, The frequency response and Power Spectral
Density of the MPPC.
When the photon flux rate is extremely low (<107 photons per second), the detected photons can be enumerated
by counting the number of spikes (Fig. S6a). The maximum measurable photon flux rate is limited by the bandwidth
of the detector (Fig. S5b) and potentially the dead time after each photon detection. To increase the maximum
measurable photon flux (or optical power), the MPPC detector was spatially multiplexed with a 60 × 60 photodiode
array (with 50 µm×50 µm pixel pitch), the outputs of which are then pooled into a single analog voltage trace. When
the photon flux far exceeds the MPPC bandwidth (107 photons per second), the pulses induced by individual
photons overlap in time and can no longer be resolved (Fig. S6b). In this scenario, we measured the average output
voltage, which maintains an excellent linear relationship with the average optical power impinging on the detector.
The MPPC output voltage was calibrated against the power reading of a semiconductor power meter (818-UV-L-
FC/DB, Newport), and the calibration result closely agreed with the manufacturer’s specifications of the MPPC (Fig.
S6d). Therefore, we were able to use the detector as a fast power meter to measure instantaneous optical power from
pW up to several nW (Fig. S6d). In this experiment, optical power >6 pW was measured by converting the output
voltage of the MPPC to the optical power impinging on the detector. Compared to regular semiconductor power
meters without intrinsic gain—which can also measure ∼pW levels of optical power—the MPPC can maintain a high
SNR for a much higher bandwidth (∼3 MHz, Fig. S5b), since its signal is amplified to overcome the noise integrated
over the larger bandwidth.
When the MPPC is used as a power meter, the minimum power that can be measured is determined by the analog
noise floor (including dark counts, thermal noise, and other electronic noises), which was measured to be equivalent
to 1.25 pW optical power at the full bandwidth (Fig. S6c). The dark count of the MPPC was measured to be ∼104
photons per second (<10 fW), which accounted for less than 1% of the total noise. Therefore, the detector can in
principle measure optical power even below the analog noise equivalent power of 1.25 pW by means of photon counting.
In our experiments, the photon counting measurement was conducted only to verify that the detector could indeed
resolve single-photon events, and to determine the minimum valid optical power it could measure (∼10 fW). Since
the optical powers involved in our experiments were higher than the analog noise floor (∼1.25 pW), they were all
measured via the direct readout of detector’s output voltage.
9
Fig. S6: Photon Detection and Optical Power Measurement with the MPPC. a, Single-photon detection
under low photon flux. Different colors indicate instances of independent measurement trials. b, Instantaneous
optical power measurement under high photon flux. c, The detector noise floor, which determines the lowest
measurable optical power when the MPPC is used as a power meter. The left panel shows examples of independent
measurements of the baseline analog noise; the right panel shows the noise distribution and statistics. d, The linear
relationship between the average MPPC output voltage and the average optical power impinging on the detector.
Calibration of the optical power (plot group “Measurement”) was performed independently with a semiconductor
power meter.
10
In order to maximize the matrix-vector multiplication size—and thus maximize the energy benefits of optical
processing—we aligned as many pixels as possible from the OLED display to the SLM. Three conditions must be
satisfied for this pixel-to-pixel alignment:
1. The OLED display must be imaged onto the SLM with a precise de-magnification factor to match the pitch of
OLED pixels to that of the SLM pixels.
2. The imaging resolution needs to be high enough such that the image of each OLED pixel on the SLM must be
no larger than the size of an SLM pixel. This is to prevent crosstalk.
3. The image of each OLED pixel must be aligned to the corresponding pixel on the SLM, which requires fine
adjustment of the translation, rotation, pitch, and yaw of each device involved.
To match the pixel size of the OLED pixel image to the SLM pixel size, the zoom lens was set at a de-magnification
of 9.2 µm/57.5 µm = 0.16. This zoom factor was achieved when the zoom lens (Resolv4K 1-80100, Navitar) was
configured with a 0.25× lens attachment (1-81201, Navitar) and 1× rear adapter (1-81102, Navitar). The zoom
factor of the zoom lens could be mechanically tuned continuously to precisely match the OLED and SLM pixel pitch.
Under this configuration, the spot size in the object plane of the zoom lens (the OLED side) is 40.85 µm in diameter
(Rayleigh criterion) according to the manufacturer’s specifications. This spot size is smaller than the OLED pixel
pitch size of 57.5 µm. Meanwhile, the spot size in the imaging plane (the SLM side) is specified to be 6.52 µm in
diameter (Rayleigh criterion), which is also smaller than the SLM pixel pitch size of 9.2 µm. In fact, the performance
of the zoom lens system was close to the diffraction limit, and the images of OLED pixels on the SLM plane were
well separated (Fig. S7). Therefore, the setup achieved the correct de-magnification factor and possessed adequate
resolution, and both conditions (1) and (2) were satisfied.
To align each OLED pixel to the corresponding SLM pixel, mechanical alignment was performed using the following
method (Fig. S7). First, identical images of the same size were displayed on both the OLED display and SLM. The
bright pixels on the OLED display corresponded to pixels of full transmission on the SLM. The dark pixels on the
OLED display corresponded to the pixels of zero transmission on the SLM. Therefore, the SLM functioned like a
mask, with its light-transmitting parts identical in shape and size to the bright image on the OLED display. After
intensity modulation by the SLM, the image on the OLED can only be preserved without any clipping if and only if
the OLED and SLM pixels are exactly aligned to each other, and with the correct orientation (Fig. S7).
The maximum number of pixels that could be aligned was determined by the imaging error on the side of the
field-of-view (FOV) of the zoom lens. According to the specifications of the zoom lens, at most 3.6 million OLED
pixels in a square array can be aligned to the SLM. However, this estimation assumes diffraction-limited performance
across the entire FOV. In practice, we managed to align 711 × 711 ∼ 0.5 million pixels. There were three reasons for
the deterioration of pixel-to-pixel alignment towards the side of the FOV:
1. Vignette: The optical transmission drops off towards the edge of the FOV, which causes up to a ∼90 % decrease
in intensity for the 711 × 711 pixel array. As a result, the pixels around the center of the FOV must be dimmed
in order to keep all pixels at the same brightness, as required for the computation of large dot products.
2. Image Distortion: nonlinear image distortion (such as barrel, or pin cushion, or other higher-order distortions)
lead to a non-uniform local zoom factor of the OLED pixel image. Although linear image distortions can be
corrected by mechanical alignment, nonlinear distortions cannot be completely fixed by alignment. Even slight
distortions can cause pixels to slowly drift away from each other until eventually there is misalignment towards
the side of the FOV.
3. Aberration: The aberration of the zoom lens increases towards the edge of its FOV and causes the focal spots
to deviate from the diffraction-limited spot size. This causes the expansion of the OLED pixel image on the
SLM plane, which couples part of the optical energy emitted from each OLED pixel into the surrounding SLM
pixels that have incorrect transmissions for weight encoding.
The non-uniform transmission caused by the vignette could be fixed by making a pixel-wise LUT of the OLED display,
which is described in Section 6. While there is no easy solution to nonlinear image distortion and aberration, we
characterize them in Section 7.
11
200 μm
Fig. S7: Pixel-to-pixel Alignment Between the OLED and SLM Pixels. a, The image of a viewfinder
pattern on the OLED display. The scale bar measures the distance on the SLM panel, 200 µm. b, The image of the
identical viewfinder pattern modulated by the SLM. The entire SLM panel was uniformly illuminated by an ambient
light source. The SLM performed intensity modulation which functioned as a mask with a hollow of the viewfinder
shape. c, Visualization of the alignment between the OLED and SLM pixels.
To enable computation of large dot products, we corrected for intensity fall-off towards the edge of the FOV, caused
by optical system vignettes as discussed in Section 5. The correction was especially important for optical fan-in
(discussed in Section 9), since each pixel, regardless of its position in the FOV, should contribute the same amount
of optical energy to the detector, if set at the same pixel value.
Correction was performed by making an attenuation map that compensates for different transmissions of pixels
at different locations. We first configured the OLED to display at the maximum pixel brightness uniformly across
the entire FOV, and then captured an image of the OLED display at the detector plane. Due to the vignette effect
of the optical system, the intensity distribution was not uniform at the detector plane (Fig. S8, top left panel). A
region of interest (ROI) was then chosen, in which we sought to achieve uniform intensity for all OLED pixels. The
minimum intensity in the ROI was set as the target intensity value (circled areas in the top left panel of Fig. S8).
The brightness of other OLED pixels was reduced iteratively until the intensity of their images matched that of the
target value. The result of the correction is shown in Fig. S8, top right panel, where uniform intensity was achieved
in an ROI of size 720 × 720. Meanwhile, an attenuation map was established to determine the percentage by which
each OLED pixel should be attenuated in order to achieve a uniform output intensity.
12
Fig. S8: Correction of Non-uniform Transmission of the Optical System. The original intensity
distribution is shown in the top left panel (“Original”), with intensity falling off towards the edges of the region of
interest (ROI). The correction procedure reduced the intensity of pixels near the center to match the target value
near the darkest corner (720, 720) in the selected ROI. The image after correction is shown in the top right panel
(“Corrected”). The bottom panel shows intensity along the diagonal pixels of the ROI (solid white lines) before and
after correction.
We examined two aspects of the optical system’s imaging quality: walk-off of pixel alignment due to image distortion,
and degradation of focal spots due to aberration (both discussed in Section 5). To visualize these effects, we captured
an image of a sparse 2D grid of pixels displayed on the OLED screen. The grid was composed of blocks with an edge
length of 80 pixels, with only the pixel at the center of each block was turned on. Fig. S9a shows how the imaging
quality of single pixels changed across an ROI of 720 × 720 pixels. For example, images of a single pixel tended to be
sharp and focused near the center of the FOV, while images of pixels towards the corner of the FOV spread out into a
streak along the radial direction. This is probably due to coma aberration, which is common to most imaging systems
(Fig. S9a). Meanwhile, the walk-off of pixel alignment could be observed by the deviation of the focus from the center
of each block (Fig. S9a). As discussed in Section 5, pixel walk-off due to linear image distortion can be corrected
with careful mechanical alignment, while nonlinear distortion cannot be eliminated. Based on Fig. S9a, the pixel
walk-off was insignificant for the ROI of 720 × 720. Incidentally, the largest possible ROI for optical matrix-vector
multiplication was determined by the trade-off between optical power transmission and imaging quality distribution
(Fig. S8 and Fig. S9a). Since the imaging quality was better on the right side, the ROI was shifted slightly to the
right of the brightest part of the intensity distribution.
Pixel walk-off and crosstalk both resulted in errors in weight modulation, due to the coupling of optical energy into
13
Fig. S9: Imaging Quality of the OLED Pixels. a, Images of the OLED pixels in a region of interest (ROI) of
size 720 × 720. Only the pixels on the grid points of a 2D grid (with edge length of 80 pixels) were switched on. The
OLED display was first imaged onto the SLM, and then relayed to the detector plane where the images were
captured by a camera. The intensity inside each block was normalized to the maximum pixel value in the block.
b(c), The 7 × 7 crosstalk kernels near the center (corner) of the FOV. Each entry denotes the percentage of optical
power coupled into each individual SLM pixel, with all the power supposed to couple to the central pixel.
neighboring SLM pixels with incorrect modulation weights. These effects were quantified and modeled as so-called
crosstalk kernels, which are similar to convolution kernels but vary gradually in space. Fig. S9b, c show the intensity
distribution of a focal spot near the center (corner) of the FOV in a 7 × 7 block of SLM pixels, with the central pixel
denoting the SLM pixel that the bright OLED pixel should align to. When both input vectors and weights were
natural images (whose pixel value variation was smoother, and usually constitute the first layer of neural networks
for image classification), the error caused by walk-off and crosstalk was less severe. For applications in optical neural
networks (ONNs), such imaging errors were modeled during the training process by random affine transforms and 2D
convolution to enhance the model’s resilience to imaging errors (Section 13).
We examined temporal fluctuations of each part of the system and describe hindrances in approaching shot noise-
limited performance. Overall, when the OLED was set at a constant brightness and the SLM at a constant transmission
across all pixels, the SNR of optical power measurements were about half the shot noise-limited SNR (Fig. S10).
Sources of excess noise, in addition to shot noise, include intensity fluctuation of the OLED display, phase instability
of the SLM, and the intrinsic noise of the detector. At high optical power, noise from external sources dominates the
SNR measurement; as the power decreases, shot noise becomes a dominant source of noise. At extremely low optical
power, the intrinsic noise of the detector is mainly responsible for deviations from the shot noise-limited performance.
There are three main components causing intensity fluctuations in OLED displays: raster scanning during screen
refreshing, pulse width modulation for brightness control, and the thermal noise of OLEDs. When a stationary image
14
Fig. S10: The Signal-to-noise Ratio (SNR) as a Function of Photon Flux, as Measured by the
MPPC. The integration window was 150 ns under analog mode.
was shown on the OLED display, no perceivable raster scanning pattern was observed in high-speed videos of the
display. There were occasional black scanning stripes and short bursts of flashing, which are likely to have been caused
by some refreshing mechanism. The OLED display did not seem to use pulse width modulation to adjust brightness
until very low brightness settings were reached (below ∼35 %). Therefore, we avoided setting pixels to low values, and
instead used neutral density filters to attenuate light for extremely low-light measurements. The intensity fluctuation
of the light source was mitigated by the high attenuation of the imaging system. Due to the large emission angle
of OLED pixels (> 60 degrees), most of the optical power was not collected by the zoom lens, which has a small
collection angle. It was estimated that only ∼0.7 % of light was collected by the zoom lens (Fig. S11). The high
loss converted the thermal state of the OLED light closer to a coherent state by coupling vacuum states to it [1],
which improved the SNR and brought the ratio closer to the shot noise limit. It should be noted that even though
the light collection efficiency was low for the zoom lens, the transmission from before the SLM to the detector (where
the computation took place) was quite high at ∼22 % (Fig. S11). Thus, optical energy efficiency can be drastically
improved if the OLED and zoom lens are replaced with a stable coherent source and an optical system with high
transmission in a customized setup.
The phase fluctuation of the SLM stems from the constant switching of voltage across the liquid crystal layers. This
fluctuation could be measured by monitoring the intensity fluctuation of the laser diffraction pattern generated by
a phase grating on the SLM. According to the manufacturer, the SLM used in this experiment oscillated at 53 kHz,
with a measured peak-to-peak power ripple of 0.24% for the first-order diffraction spot. Compared to OLED intensity
fluctuations, instabilities caused by the SLM are relatively minor.
The intrinsic noise of the detector (e.g., thermal noise or dark counts) contributed to excess noise that was only
apparent when the optical power to be measured was extremely weak. As discussed in Section 4, the detector’s
intrinsic noise was negligible for high optical power (1 pW). The analog noise floor becomes significant for low
optical power at ∼1 pW, which necessitates photon counting for even lower photon flux.
The optical fan-in operation carries out the summation in vector-vector dot product computation by focusing optical
spatial modes onto the active area of a detector. Unlike digital summation, which sums the readouts from different
detector pixels with digital electronic circuits, optical fan-in implements summation through the accumulation of
photoelectrons generated in the same piece of detector material. Optical fan-in reduces energy consumption by
skipping the digital summation circuits and reducing the number of pixels together with associated amplification
circuits. The energy consumption caused by charging and discharging of the detector can be reduced by shrinking
15
From To Transmission
Stage 1 Stage 2 0.007
Stage 2 Stage 3 0.07
Stage 3 Stage 4 0.22
Stage 1 Stage 4 0.00011
Fig. S11: The Optical Power Transmission at Each Stage of the Setup.
the size as well as the capacitance of the detector through optical focusing.
In this experiment, the optical fan-in was performed by an objective lens, which projected a de-magnified image of
the SLM onto the active area of the MPPC. The de-magnification factor from the SLM to the detector was 0.276×,
and the total de-magnification factor from the OLED display to the detector was ∼ 0.0442×. In other words, each
OLED pixel of original pitch 57.5 µm was imaged to a size of 2.54 µm on the detector. Therefore, the entire 711 × 711
pixel array could be imaged onto the detector within a square size of 1.806 mm × 1.806 mm, which fits into the
3 mm × 3 mm active area of the MPPC. During ideal optical fan-in, all spatial modes associated with a single dot
product should be integrated by a single piece of detector material. In this experiment, since the MPPC consists of
multiple photodiodes, not all the spatial modes were focused onto a single photodiode (however, it should be noted
that the photocurrents of these pixels were still summed as analog signals to form a single output, and there was no
digital operation involved). Each photodiode covered (50 µm/2.54 µm)2 ∼ 388 spatial modes for in-pixel summation.
Even with so many spatial modes, no saturation was observed. This is because the photon flux was extremely low for
each spatial mode, and thus the simultaneous arrival of multiple photons was rare. The photoelectrons generated by
different photodiodes were superimposed as a single analog voltage signal at the detector output.
The lower bound of energy consumption of optical fan-in is determined by the capacitive noise of the detector [2],
and can be viewed as the energy cost of analog summation/accumulation. In this experiment, the MPPC worked at a
typical wall-plug power of 1.1 W. This came out to 0.3 mW consumed for each photodiode, after dividing the wall-plug
power by the total number of the photodiodes (3,600). If the data input and output rates were matched to the 3 MHz
bandwidth of the detector, each photodiode would consume 0.3 mW/3 MHz = 100 pJ per update cycle. Dividing this
by the number of spatial modes captured by each photodiode yields the energy consumed for each addition. This
comes out to 100 pJ/388 = 258 fJ.
In principle, the energy budget for each addition can be reduced by using smaller focal spots, lower detector gain,
and increasing the number of spatial modes for in-pixel summation. The focal spot can be reduced to the diffraction
limit of λ/2 in air (Abbe’s resolution limit with a numerical aperture of 1), which makes each spatial mode occupy
(λ/2)2 = (532 nm/2)2 = 0.071 µm2 area on the detector. The energy consumption for each addition scales with area,
which would be 258 fJ × (0.532 µm/2/2.54 µm)2 = 2.83 fJ. The area of each spatial mode can be further reduced by
focusing light in materials with refractive index larger than 1 (e.g., glass).
In this experiment, the photodiodes were operated with a high gain (Geiger mode) to provide extremely high
SNR for single photon detection. In practice, a lower gain can be better for two reasons: 1. The energy cost per
addition can be further reduced with a lower gain. 2. A lower gain prevents saturation of the detector, and allows the
accumulation of more terms in a dot product, which is essential for an optical energy advantage. Ideally, the detector
volume should be minimized to reduce capacitance, which in turn reduces thermal noise and potentially allows a high
16
voltage across the detector to exempt the use of amplifiers [2, 3], Meanwhile, the number of carriers scales with the
detector volume, which results in a limited full well capacity for a small detector volume. A high gain results in each
photoelectron being amplified into many electrons and depletes the carriers in the detector volume quickly. For this
reason, the gain of the detector should be reduced to be just enough to amplify the signal to overcome thermal noise.
For a detector area of only√one spatial√ mode (assuming λ/2 focal spot size at 525 nm), the RMS value of thermal noise
electrons is calculated as kT C = 4.14 × 10−21 J × 27.5 aF = 3.38 × 10−19 C = 2.1e− at room temperature [2, 3].
At the detection level of 1 photon per spatial mode, a moderately low gain can be applied such that the number of
electrons generated by each photon is greater than the number of noise electrons.
Besides applying gains, the SNR of photon detection can also be improved by increasing the number of spatial
modes for in-pixel summation. Even though both detector √ area and capacitance scale with the number of spatial
modes to be summed (N ), the overall noise scales with N while the total signal photons scale with N . For example,
with 0.5 signal photons (without any detector gain) and 2 noise electrons per spatial
√ mode, the SNR = 0.25 for each
spatial mode; with 10,000 spatial modes, the SNR can be enhanced to 0.25 × 10, 000 = 25. Therefore, even in the
thermal noise-limited (as opposed to shot noise-limited) case, it is possible to anticipate less than 1 photon consumed
for each accumulation in the dot product computation, as long as the vector size is sufficiently large.
Based on the optimization measures mentioned above, the energy consumption for optical fan-in can be reduced
to 100 aJ per addition, which is sufficiently low to enable optical processors at least 102 times more energy-efficient
than state-of-art machine learning accelerators (e.g., at ∼102 fJ per multiply-and-accumulate operation (MAC) = one
multiplication plus one addition operation) [4].
It is noted that optical fan-in not only implements the summation with physical accumulation of electrons, but also
reduces the memory usage for storing intermediate results. For running typical neural networks, memory-associated
energy costs often account for the majority of the total energy cost, higher than the energy spent on arithmetic
operations [5]. Even though digital electronic processors are improving in energy efficiency by tailoring data flow
to machine-learning tasks [5] and by incorporating more on-chip memory [6], there is still a significant amount of
memory that can be further reduced. For example, in systolic arrays, vector-vector dot products are computed by
adding wi xi (i = 1...N ) term by term to a partial sum, which requires N dedicated memory units each to read and
write the partial sum once. Digital electronic adders usually have a small number of input operands, and cannot
perform the summation of a large number of terms without saving the intermediate results. In comparison, optical
fan-in can implement the summation of a large number of terms (103 -105 ) physically in a single step, which exempts
the use of any intermediate memory units, potentially leading to a substantial amount of energy saved.
Our setup can be viewed as a generalization of the classical Stanford matrix-vector multiplier, whose operation
can be summarized in three steps: fan-out, element-wise multiplication, and fan-in (Fig. S12). The major difference
between our setup and the Stanford matrix-vector multiplier lies in the geometric arrangement of the input vector
and its corresponding weights (the weight vector) in 2D blocks instead of 1D arrays, which lead to different physical
implementations. Compared to 1D arrays, the 2D-block arrangement of both input and weight vectors offer advantages
in scalability, especially for vector-vector dot products with a large vector size.
The 2D-block arrangement is especially suitable for applications in image classification, where the input data
are usually 2D images of high dimensions. For natural-scene images, and the weights trained for them by neural
networks, (e.g., see Fig. S17), the pixel values usually do not vary drastically in local areas except for few high-
contrast boundaries. Preserving smooth local features in the 2D form reduces errors resulting from minor shifting or
blurring of the input and weight vectors. Therefore, our setup is more tolerant of minor errors in instrumentation,
such as imperfect imaging or crosstalk between SLM pixels (Fig. S13a). In comparison, flattening the vectors and
weights into 1D arrays, as required by the Stanford matrix-vector multiplier, would break the smoothness of local
features, which can result in more severe errors (e.g., the crosstalk level between the neighboring SLM pixels increases
with the level of contrast between them).
The 2D-block arrangement also reduces the amount of crosstalk between different input or weight vectors tiled next
to each other (Fig. S13b). For each block, crosstalk occurs to the pixels in contact with adjacent √ blocks. When a
vector of size N is wrapped into a square, the total number of pixels on its perimeter equals 4 N (Fig. S13b top
panel). If the vector were instead shaped into a 1D array, it would share a boundary of a total of 2N pixels with
neighboring arrays (Fig. S13b bottom panel). For a large vector of size N , 2D blocks can potentially reduce √ the
amount of crosstalk by over an order of magnitude more than 1D arrays, due to the difference in scaling of N (Fig.
S13b). Even though a gap can be introduced between the regions corresponding to different vectors, it would also
reduce the fill factor as well as the computational throughput. Therefore, through decreasing the surface-to-volume
17
Fig. S12: The Comparison between the Classical Stanford Matrix-vector Multiplier and our 2D-block
Scheme.
ratio, the 2D-block arrangement can effectively reduce crosstalk between different vectors with a high fill factor on
both OLED and SLM panels.
It should be noted that in the classical Stanford matrix-vector multiplier both optical fan-out and fan-in are
implemented with cylindrical lenses as reverse processes of each other. In the 2D-block arrangement, this symmetry
is broken because the optical fan-out process needs to create copies of the same block, while the optical fan-in focuses
all elements in each block. In our demonstration, we performed the fan-out operation digitally by displaying different
copies of the same vector on the OLED display, which was sufficient for the purpose of demonstrating sub-photon
scalar multiplication. Purely optical fan-out is necessary to achieve the total energy benefit for optical matrix-vector
multiplication (for details, see Section 15). For the 2D-block scheme, it is possible to implement optical fan-out with
several techniques, including imaging with a microlens array [7, 8] or beam splitting using an array of beam splitters
[9]. The optical fan-out operation would allow a very low number of photons in each spatial mode: the optical fan-out
splits the photons emitted from a reasonably weak light source and distribute each portion of the photons into many
spatial modes, with each mode performs a scalar multiplication. In addition, since each spatial mode is the result of
many vacuum modes being coupled to the original spatial mode populated by a single light source, the noise of each
spatial mode can be reduced to be quite close to shot noise, even if the light source intensity is not perfectly stable.
18
Fig. S13: Comparison between the 2D-block and 1D-array Representation of the Input Vector and
Weight Matrices. a, Error reduction effects of 2D-block representation of natural-scene images. The smooth local
features intrinsic to the image data and corresponding weight matrices mitigate the crosstalk between neighboring
pixels caused by imperfections in imaging (e.g., degradation of focal qualities) or instrumentation (e.g., crosstalk
between SLM pixels). Reshaping 2D images into 1D arrays breaks the continuity intrinsic to these images. b, The
2D-block representation reduces the crosstalk between√ neighboring vector and weight blocks. The 2D blocks share
substantially shorter boundaries with each other (∝ N ) compared to 1D arrays (∝ N ), which helps to reduce
crosstalk between different vector and weight blocks, or to reduce the gaps between them for a higher fill factor on
the SLM.
19
Part II
Vector-Vector Dot Product Precision
We characterized the precision of our optical matrix-vector multiplication by computing vector-vector dot products,
which constitute general matrix-vector multiplication. The answer y (scalar) to the dot product between input vector
~x and weight vector w
~ is defined as:
N
X
~ · ~x =
y=w w k xk . (1)
k=1
11. COMPUTING DOT PRODUCTS WITH SIGNED ELEMENTS USING INCOHERENT LIGHT
Our setup can only perform dot products between vectors with non-negative elements, because xk is encoded with
the intensity of each spatial mode, and wk is encoded with the transmission of each spatial mode. However, dot
products between vectors of signed elements can always be reduced to those between non-negative-valued vectors
with minimal digital processing overhead [10]. For a general vector with signed elements ~xsigned , where each element
xsigned
k ∈ [xsigned signed signed signed
min , xmax ], xmin , xmax ∈ R, a non-negative vector ~x0 can be obtained by adding bias terms and
rescaling:
signed 0
x0max − x0min xsigned 0
max xmin − xmin xmax ~
~x0 = ~xsigned + 1, (2)
xsigned signed
max − xmin xsigned signed
max − xmin
where ~1 = (1, 1, · · · , 1) is a constant vector of size N . It can be verified that x0k is tightly bounded between [x0min , x0max ]
with x0max ≥ x0min ≥ 0. Therefore, two vectors with signed elements w ~ signed and ~xsigned can be converted to non-
negative vectors w ~ 0 and ~x0 , and the dot product w ~ signed · ~xsigned equals to the linear combination of w ~ 0 · ~1, ~1 · ~x0
~ 0 · ~x0 , w
and a constant term. All three dot products are between vectors of non-negative elements. The dot product between
~1 and any vector equals the summation of all of the vector elements, which were computed optically like any other
dot product.
In machine learning applications, the input vector ~x0 is either the input to the neural network or the neural activation
of the previous layer, both of which are non-negative values after the ReLU nonlinear activation function. Therefore,
since ~xsigned is already non-negative with xsigned
min = 0,
w ~ 0 · ~x0 + c2 ~1 · ~x0 .
~ signed · ~x signed = c1 w (3)
For simplicity, both ~x0 and w ~ 0 can be normalized to the range [0, 1], and the coefficients in Eq. 3 can be solved as:
signed signed signed
signed
c1 = (wmax − wmin )xsigned max , c2 = wmin xmax . The normalized vectors ~ x0 and w ~ 0 were loaded onto the OLED
display and the SLM, respectively, according to the hardware LUTs (e.g., Fig. S3c and Fig. S4b). In reality, the
SLM could not achieve zero transmission which led to a minimum modulation xsigned min = = 0.02 (Fig. S4b). In
1 signed signed
other words, wk0 was normalized to the range [, 1] instead of [0, 1]. In this case, c1 = 1− signed
(wmax − wmin )xmax ,
1 signed signed signed
c2 = 1− (wmin − wmax )xmax .
In summary, the dot product between two vectors with signed elements can be converted to two dot products
between non-negative vectors, which can be first computed purely with optics, and then combined with only 2 digital
multiplications and 2 digital additions. In other words, the price of conversion is a doubling of the amount of optical
computation with a constant digital overhead, independent of vector size N .
For matrix-vector multiplication, the computational overhead can be further reduced, since ~1 · ~x0 remains the same
for the dot product between ~x and any row vector of the matrix. As a result, ~1 · ~x0 only needs to be computed once
optically and can be reused afterwards. In other words, to compute a matrix of size N 0 × N multiplied with a vector
of size N , in addition to the N 0 optical dot products (which constitute N N 0 MACs), only one additional optical dot
product (~1 · ~x0 ) is required. Thus, the amount of digital overhead is on the order of O(N 0 ).
20
To generate a test dataset representative of general dot products, we randomly generated vector pairs ~x and w ~
based on natural scene images from the STL10 dataset. Each vector was generated from a single color channel of
one or more images patched together, depending on the target vector size (each image of size L × L contributes
N = L2 elements to the vector). We chose natural images since they are more representative of the inputs in image
classification with globally inhomogeneous and locally smooth features. To adjust the sparsity of the vectors, different
thresholds were applied to the image pixel values such that the dot product results cover a wider range of possible
values. This was achieved by shifting the original pixel values (float point numbers normalized to the range 0-1) in
the entire image up or down by a certain amount, unless the value was already saturated at 1 (the maximum) or 0
(dark). For example, a shift of -1 would make the whole image dark. A shift of +0.2 would make all the pixel values
that were originally larger than 0.8 saturated, and would increase all other pixel values by 0.2. This method allowed
us to tune the overall intensity of the modulated images without losing the randomness of the distribution.
The computation of dot products on the setup followed the same steps of element-wise multiplication and optical
fan-in, as described in the main text. Fig. S14 shows a few more examples of element-wise multiplication, similar to
Figure 2a in the main text.
Fig. S14: Example Measurement of Element-wise Multiplication between Random Vectors. The top
two rows show the corresponding input vectors on the OLED display and the weight vectors on the SLM. The
bottom row displays modulated light captured by a camera. The input and weight vectors were generated from
images in the STL10 dataset of size 64 × 64 (or 64 × 64 = 4, 096 elements). Individual pixels are visible in the
captured images.
In order to study how dot product accuracy changes with photon budget, we used a sensitive detector (MPPC) to
measure the integrated optical energy. The optical energy consumed for each dot product computation was controlled
by tuning the detector integration times (e.g., Fig. S10 had an integration time of 150 ns). To get enough statistics
for noise distribution under low optical power, each detector readout measurement was repeated T times for each
21
vector pair w~ and ~x. To get error statistics representative of general vector pairs, we also repeated the measurement
for S randomly generated vector pairs of different sparsity from randomly chosen images, as discussed in Section 12 A
and Fig. S14). We call this set of vector pairs the calibration dataset, and collected a total of S × T data points.
Detector readout vi,j denotes the jth (j = 1, 2, . . . , T ) measurement made on the ith (i = 1, 2, . . . , S) vector pair. For
PT
each vector pair, the mean value of the detector readouts v̄i = 1/T j=1 vi,j was calculated for large enough T to
eliminate the noise. The detector readouts were quantified either in optical energy, or in number of photons, which
is optical energy divided by the photon energy (i.e., ∼0.4 aJ at 525 nm). To enable energy efficiency comparisons
between different vector sizes, the total optical energy, or number of photons, detected for each dot product was
further divided by the number of multiplications in the dot product.
Fig. S15: A Calibration Curve Converting Detector Readouts to Dot Product Results. The mean
detector readout (v¯i , x axis) is plotted against the corresponding ground truth (ytruth,i , y axis) for every vector pair
in the calibration dataset. The unit of detector readout is either the number of photons (bottom axis) or the
number of photons per multiplication (top axis). The vector length was 711 × 711 (N = 505, 521). The calibration
curve (red dashed line) was obtained using a least-squares fit to the data points. The shaded area indicates 3
standard deviations of the repeated measurements. The average number of photons per multiplication of the data
points in the plot is 0.0025.
A calibration model f was made to convert the average detector readout v̄i to the dot product result ymeas,i as
ymeas,i = f (v̄i ). The calibration involved plotting the ground truth of the dot product ytruth,i = ~xi · w ~ i versus v̄i ,
followed by fitting the data points to a linear curve using a least-squares criterion. Fig. S15 shows an example of
data points measured on vector pairs of length N = 505521. The calibration curve f is plotted in the dashed red
line. The range of ytruth √ was normalized to [0, 1] by rescaling ~x0 and w
~ 0 , based on their definitions in Eq. 2, with
a multiplicative factor 1/ N . With the calibration model, we could read out the dot product result based on the
detector readout value. In principle, the calibration only needed to be performed once with the calibration dataset,
unless the setup changed (e.g., adding extra attenuation) or has drifted over time.
After obtaining the calibration curve, we generated another random vector pair test dataset in order to quantify
the error statistics of dot product computation performed by our setup. Error was defined as the difference between
the measured result and ground truth ytruth − ymeas . Suppose we have S vector pairs in the data set and each is
22
Fig. S16: Dot Product Error Analysis at 4 Typical Photon Budgets. The dot products were computed
with vector size N = 711 × 711 = 505, 521). The average number of photons per multiplication (indicated at the top
of each plot) was controlled by the integration time of the MPPC detector, and averaged over the entire test
dataset. For each vector pair, the measurement ymeas,i,j was repeated multiple times, and all the data points were
plotted. The ground truth ytruth,i is plotted against the corresponding measurements ymeas,i,j . The histogram of
errors ytruth,i − ymeas,i,j is shown in each inset. The overall accuracy representative of the dataset is characterized by
the root-mean-square error (RMSE), which is also similar to the value of the standard deviation of the error
distribution. The color code is the same as that in Figure 2 in the main text.
repeated T times. For each detector readout vi,j (i = 1, 2, . . . , S, j = 1, 2, . . . , T ), Errori,j = ytruth,i − ymeas,i,j , where
ymeas,i,j = f (vi,j ). Unlike calibration, here we used single-shot readouts vi,j rather than the mean value v̄i .
For each vectorqP pair ~xi and w~ i , the root-mean-square (RMS) error for different measurement trials was calculated
1 T 2
as RMSEi = T j=1 Errori,j , which can be interpreted roughly as the most likely error one would get from a single-
shot computation qPfor vector pair index qPi. The total RMS error across different vectors in the dataset was calculated
1 S 2 1 2
as RMSE = S i=1 RMSEi = ST i,j Errori,j , which could be interpreted as the most likely error one would get
from a single-shot computation by randomly selecting a vector pair from the entire test dataset. Histograms of the
errors of the test dataset are shown in Fig. S16 insets.
The scatter plots in Fig. S16 show how well the computed dot product results matched the ground truth, under
different photon budgets. The average number of photons detected during these experiments were different, since
they were determined by detector integration time. With higher photon budgets, the error decreases as the noise
contribution to the error decreases. For a higher photon budget (>1 photon per multiplication), the RMS error stops
decreasing and is instead limited by systematic error due to imperfections in the setup. The four scatter plots in Fig.
S16) correspond to the total RMSE data point of the same color in Figure 2b in the main text.
23
Part III
Optical Neural Network for Image Classification
13. TRAINING PROTOCOL OF NOISE RESILIENT OPTICAL NEURAL NETWORKS
For handwritten digit classification (MNIST database), we trained a 3-layer neural network with full connections
(i.e., a multilayer perceptron, MLP). The inputs are 8-bit grayscale images of size 28 × 28 = 784 total pixels, followed
by two fully-connected hidden layers, each comprising 100 neurons with ReLU as the nonlinear activation function.
The output layer has 10 neurons, with each neuron corresponding to a digit from 0 to 9. The neural network was
implemented and trained in PyTorch (1.7.0). To improve the robustness of the model to numerical error, we employed
several techniques during training:
1. Quantization-aware training (QAT): The activation of neurons were quantized to 4 bits and weights to 5 bits
to adapt to the numerical precision of the setup. For example, for vector size of 784, we found the SNR of
dot products is equivalent to ∼4-bit numerical precision. Even though the SLM could be controlled with 8-bit
numbers, we decided to quantize its weight to 5 bits, which matched better with its extinction ratio of 50 and
did not seem to have any negative impact on MLP accuracy. To reduce the numerical sensitivity caused by
quantization (e.g., in the regular nearest neighbor scheme, 0.49 is rounded down to 0, while 0.51 is rounded
up to 1), we used random digitization between the adjacent levels, which was observed to improve the model
robustness against random noise. We found that a few (6-12) warm-up training epochs with full 32-bit float
precision helped to protect the model from aggressive quantization in the initial stage, after which the application
of quantization noise was less likely to derail the training process, but still helped to fine-tune the parameters.
2. Data augmentation with random image transform and convolution: To improve model tolerance to potential
hardware imperfections, we imitated similar kinds of errors on input images. For example, the misalignment
was modeled as random rotation (within ±5◦ ) and translation (±4% of image size in any direction), mismatched
zoom factor as random zooming (±4% of image size), and intra-pixel crosstalk as a mild convolution with a
3 × 3 blurring kernel. We observed that these measures not only helped to improve model immunity to imaging
error, but also improved overall model accuracy and robustness to photon noise by reducing overfitting with
regularization. The data augmentation was only performed on the input layer rather than all hidden layers, due
to computational complexity and the observation that hidden layers were usually more sparse, making crosstalk
between neighboring pixels was less of an issue.
3. Optimizing training parameters: We used a stochastic gradient optimizer for training with a learning rate
typically between 0.03 and 0.05, and a momentum between 0.7 and 1. Learning rate decay was applied every
20 epochs with a decaying rate between 0.3 and 0.5. The training parameters were randomly generated within
the range for different trials of training, and fine-tuned by using the package Optuna [11].
It is important to note that the quantization of neuron activations was only performed during training on a digital
computer, but not during the inference stage on ONN. We observed that even though the models were trained with
noise, they still performed better when run with full precision. Therefore, the trained weights and neural activations
were loaded with the maximum allowable precision for the hardware (i.e., 7 bits for the OLED display, and 8 bits for the
SLM). The training code can be found in this GitHub repository: https://github.com/mcmahon-lab/ONN-QAT-SQL.
The trained neural network was executed with optical matrix-vector multiplication in the following steps:
1. Starting from the input layer, the matrix-vector multiplication involved in the forward propagation from the
current layer to the next layer was computed optically, according to the procedure described in Section 11.
The matrix weights loaded onto the SLM were exactly the same as those in the neural network trained on the
digital computer. The number of photons per multiplication in matrix-vector multiplication was controlled by
adjusting the number of detector samples to sum.
2. The bias terms and the nonlinear activation function were applied digitally to the matrix-vector multiplication
result, and these parameters followed exactly as those in the trained neural network, without any modification
or retraining. The resulting neuron activations were used as the input vector to the matrix-vector multiplication
that leads to the next layer (go back to step 1) unless there is none (go to step 3).
24
Fig. S17: An Example of the Matrix-vector Multiplication Results of the First Fully-connected Layer
of the ONN. The images show the results of element-wise multiplication between an input vector (28 × 28 = 784
elements) and each row of the matrix P of the first fully-connected layer (100 × 784) of our ONN model, as captured
by a camera. The last row computed j xj , which was used to offset the output vector for negative elements. Only
the output of the first layer is shown.
25
3. At the output layer, the prediction was made based on the highest score.
For step 1, since the inputs and neural activations were both non-negative due to our choice of ReLU nonlinearity,
only the weight matrices needed to be shifted and normalized. The element-wise multiplication of the first layer
is visualized in
P Fig. S17b, with the weight matrix displayed on the SLM visualized in Fig. S17a for comparison.
The ~1 · ~x = j xj term in Eq. 3 was computed by adding an additional input vector block and setting all the
corresponding SLM pixels’ transmissivity to the maximum (e.g., Fig. S17b, last row in the image. An entire row was
used for illustration purposes and redundancy, while in fact only one additional block was needed for the entire layer.)
To obtain the answer to the matrix-vector multiplication, element-wise modulated spatial modes in each block were
summed up by optical fan-in as described in Section 9, which is equivalent to summing all the pixels in each block
shown in the image taken by the camera in Fig. S17. The integrated optical energy was translated to the answer
of the dot product according to a calibration curve, which was made by fitting the measured optical energy to the
ground truth of the dot products using the first 10 samples of the MNIST test dataset, in a fashion similar to that
described in Section 12.
For each forward propagation of the neural network, 784 × 100 + 100 × 100 + 100 × 10 = 89, 400 multiplications
and 89,400 additions were performed optically. The total digital assistance for each forward propagation include
100 + 100 + 10 = 210 additions for applying the ~1 · ~x term (to shift the dot product between vectors of non-negative
elements in order to obtain the dot product between vectors of signed elements), 100 + 100 + 10 = 210 additions
for applying digital biases, and 100 + 100 = 200 applications of ReLU nonlinear activation functions (involving only
simple operations, such as comparing each element to 0 and setting an element to 0 if it is smaller than 0).
26
The total energy consumption, including optical and electrical contributions, has been analyzed in detail for var-
ious optical computing/communication systems in the literature [2, 3, 12]. Based on the methodology of previous
studies, here we discuss the energy consumption of near-future designs of spatially multiplexed optical matrix-vector
multipliers.
The total energy consumption of a general optical matrix-vector multiplier can be broken down according to its
constituent components: light sources, light modulators, detectors for optical signal, detector signal amplifiers, analog-
to-digital converters (ADCs), digital-to-analog converters (DACs), and electronic data storage and movement (Fig.
S18). The energy cost associated with each part belongs to one of the the two categories of energy scaling, with only
one exception that will be pointed out in later discussion:
1. Energy cost that scales with the dimension of input (output) vector size N (N 0 ). In this case, the energy con-
sumption per operation scales with N/(N N 0 ) = 1/N 0 or N 0 /(N N 0 ) = 1/N . Therefore, the energy consumption
per operation decreases with the size of matrix-vector multiplication.
2. Energy cost that scales with the size of the matrix N N 0 . In this case, the energy consumption per operation
does not change with N or N 0 . Therefore, such energy costs cannot be amortized with larger matrix-vector
multiplications.
For a near-future blueprint, we are targeting a system performing matrix-vector multiplication of size 4,096×4,096
(i.e., N = N 0 = 4,096) at a 1 GHz update rate with low-precision arithmetic (4-5 bits for both matrix weights and
vector activation). The 1 GHz rate can be readily achieved for input data encoding and output results readout with
technologies regularly used in telecommunications applications. Electro-optic modulators can achieve up to ∼100 GHz
in integrated devices compatible with CMOS platforms [13]. Here, we evaluate energy efficiency in terms of energy
per multiply-and-accumulate (MAC), which consists of one multiplication and one addition operation as commonly
used in deep learning literature [5].
Optical transmission
Memory
Metal wire
Optical Optical
𝑁 input units 𝑁′ output units
fan-out fan-in
With proper implementation, most electronic operations fall into category 1 in terms of energy scaling. For example,
the energy consumed by signal amplification, ADCs, memory, and data movement all scales with the output vector
size N 0 , and the energy consumed by DACs scales with input vector size N . We discuss the energy consumption of
each system component in more details below:
• Transimpedance amplification is usually required to convert the current signal of photodiodes to voltage; how-
ever, if the capacitance of photodiodes is sufficiently small (e.g., 1 fF, which allows the voltage across the
capacitor to be 1 V with ∼104 electron charges), the voltage can be high enough to switch subsequent ADC
circuits without amplification [3, 14, 15]. In this case, the energy consumption for each charge transfer is on the
order of CVDD2
∼ 1 fJ, or 1 fJ/4,096 = 0.24 aJ/MAC for N 0 = 4,096.
• An ADC is required to convert the voltage signal of each detector to a digital readout for further processing
steps, such as applying bias and nonlinear activation functions in neural networks. A high-rate ADC can achieve
∼24 GHz conversion rate at ∼1 pJ per sample for low precision numbers (e.g., 4.5 bits) [16, 17], which means the
energy efficiency of the ADC operating at 1 GHz is at least 1 pJ/24 = 42 fJ per sample (since an ADC operating
27
at a higher sampling rate is usually less energy-efficient, due to increased thermal noise). The ADC energy
expense per MAC can therefore be reduced to 42 fJ/4,096 = 10 aJ for N 0 = 4,096.
• A DAC is required to convert the digital format of each input vector entry to the current (voltage) controlling
the active light source (passive light modulator) that encodes inputs to the system. In existing commercial
photonic processors, DAC costs about the same amount of energy per sample as ADC [18]. In similar electric
crossbar applications (5-bit resolution at 100 MHz), each DAC sample has been estimated [19] to consume as
little as 15 fJ per sample (3.7 aJ/MAC).
• The data transmission in metal wires costs ∼100 aJ/µm/bit on-chip [3]. With judicious wire planning, it is
possible to keep the total energy associated with each MAC on the order of 100 aJ, since optical transmission
covers quite a bit of distance, which signals would otherwise traverse electronically.
• The memory is used to store the detector readouts for digital operations and as inputs to the next layers. The
energy cost is usually on the order of 1 pJ/bit for off-chip memory [5]. Therefore, without any optimization, the
energy costs associated with data storage is relatively high compared to other costs, at 4 pJ per 4-bit number.
However, much of this energy is probably consumed for signal transmission during memory access, and the
amount of energy to read/write each bit of information in transistors can be reduced to the level of 10 fJ, based
on the switch energy of transistors [20]. With tight integration of memory units, the associated energy cost
can be hopefully reduced to 100 fJ per 4-bit number, which probably has already been achieved based on the
total energy efficiency reported in existing digital machine learning accelerators (10 fJ/MAC-10 pJ/MAC) [4].
Therefore, after amortization, the memory cost can be reduced to the order of 2 × 100 fJ/4,096 ∼ 50 aJ/MAC
(2× accounts for both input and output).
Overall, with reasonably large input vector sizes (e.g., N = N 0 = 4,096), it is possible to keep energy costs in category
1 on the order of 100 aJ/MAC.
The energy consumption associated with matrix weight modulation scales with the number of entries in the matrix
(category 2). For reconfigurable spatial light modulators, changes to the weights are either accomplished all at once
with a single investment of energy (e.g., phase-change materials, ferroelectric materials, mechanical phase modulators,
etc.), or must be maintained by constant power consumption (e.g., liquid crystal display, LCD). Based on the LCDs
equipped on existing commercial mobile devices, it is possible to power millions of pixels with <1 W power. The
update rate of commercial LCDs is 60-240Hz. For machine-learning inference applications, where the weights can
stay stationary most of the time, the energy cost per multiplication can be amortized by increasing the update
rates of system input and output devices. For example, with a 16-million-pixel LCD consuming 1 W with stationary
weights, if other parts of the system update at 1 GHz, the energy consumption for each multiplication would be
1 W/(109 MACs/second)/(4,096 × 4,096) = 62.5 aJ/MAC.
The energy scaling of the light source and detector is in general more complicated, depending on whether the system
is working in a shot noise-limited regime (category 1 for fixed output precision), or a thermal-noise-limited regime
(in-between categories 1 and 2 with weaker energy benefits for larger N or N 0 , see Section 9). Here, we argue that
no matter which case is true, the optical energy consumption is insignificant compared to the electronic energy costs
analyzed above. As discussed in Section 9, at extremely low photon fluxes (<1 photon per multiplication), the thermal
noise becomes a more prominent factor in determining detection SNR. Even though the detector gain can be increased
to improve SNR, such measures would not improve the overall energy efficiency, since it is more economical to have
stronger signal with more photons, than amplifying photoelectrons. Therefore, in a realistic device, a few photons
(∼10−18 J) should be budgeted for each multiplication at the detector in order to ensure sufficiently high SNR. With
optical fan-out, such a low photon budget for each spatial mode can be achieved by distributing photons generated
by a single, intense light source to a large number of N spatial modes. The use of a single light source substantially
reduces the number of light sources that needs to be integrated and the energy cost spent on distributing data among
the light sources through electronic data transmission [3]. Therefore, a 10% “wall-plug” efficiency can be realistically
expected from an active coherent light source array (e.g., VCSEL), or a passive modulator array (e.g., thin-film lithium
niobate modulators) with a single external laser source. Incidentally, the additional energy for operating the passive
modulator array can be reduced to <10 fJ for 4-bit numbers running at >1 GHz [13]. The energy per MAC is then
<10 fJ/4,096 = 2.5 aJ/MAC, which is insignificant compared to other energy costs. Overall, it is reasonable to expect
the lumped quantum efficiency, including the light source and detector, to achieve 1%, and thus the electrical energy
cost per MAC would be 10−18 J/1% = 100 aJ.
In summary, based on currently available technologies, it is realistic to build an optical matrix-vector multi-
plier performing matrix-vector multiplication of size 4,096 × 4,096 at an update rate of 1 GHz reaching a speed
of ∼17 × 1015 MACs/s (or 17 POPS). The energy consumption for each MAC can be on the order of several hundred
aJ (excluding memory), or 1 fJ (including memory), which leads to a total power of ∼40 W (including memory). The
28
energy efficiency will be 2 to 3 orders of magnitude better than the state-of-the-art digital electronic machine learning
accelerators (e.g., 10 fJ/MAC-10 pJ/MAC) [4, 21] or photonic chips [18], and better than the blueprint for analog
electronic computing [19]. In the case where the average power is limited by heat dissipation rate, energy efficiency
translates to computational speed, which means the optical processor can achieve computation speeds 10× to 100×
higher than digital processors with the same average power.
It is worthwhile to reiterate that the optical energy advantage stems from low-loss optical communication over a
distance of 10-100 µm for on-chip platforms [3]. This enables spatial multiplexing in a region much larger than what is
possible with pure electronic circuits, which would either reach the limit of heat dissipation by scaling down element
sizes or result in most of the energy being spent on communication in a large area. As a result, optical platforms allow
the matrix size to scale up more easily, to the level of 103 × 103 or even 104 × 104 , limited only by the number of light
sources, modulators, and detectors that can be integrated. In addition, optical operations such as fan-in and fan-out
feature extremely high levels of branching (103 -104 ) that are not practical for electronics, which help to optimize the
data flow and to save a significant portion of energy that would otherwise be spent on storing or relaying intermediate
results (for more details, see Section 9). Overall, optical operations not only allow the energy cost per MAC to scale
down with the size of optical matrix-vector multiplication, but also has a short-term potential to reach a sufficiently
large matrix size in order to achieve competitive energy efficiency.
29
[1] I. R. Berchera and I. P. Degiovanni, Quantum imaging with sub-Poissonian light: challenges and perspectives in optical
metrology. Metrologia 56, 024001 (2019).
[2] R. Hamerly, L. Bernstein, A. Sludds, M. Soljačić, and D. Englund, Large-scale optical neural networks based on photo-
electric multiplication. Physical Review X 9, 021032 (2019).
[3] D. A. B. Miller, Attojoule optoelectronics for low-energy information processing and communications. Journal of Lightwave
Technology 35, 346–396 (2017).
[4] A. Reuther, P. Michaleas, M. Jones, V. Gadepally, S. Samsi, and J. Kepner, Survey of machine learning accelerators.
arXiv:2009.00993 (2020).
[5] V. Sze, Y.-H. Chen, T.-J. Yang, and J. S. Emer, Efficient processing of deep neural networks: A tutorial and survey.
Proceedings of the IEEE 105, 2295–2329 (2017).
[6] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers,
R. Boyle, P.-l. Cantin, C. Chao, C. Clark, C. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb, T. V. Ghaemmagham et al.
In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium
on Computer Architecture (ISCA), 1–12 (2017).
[7] W. Andregg, M. Andregg, R. T. Weverka, and L. Clermont, Wavelength multiplexed matrix-matrix multiplier. (U.S.
Patent No. 10,274,989). U.S. Patent and Trademark Office (2019).
[8] Y. Hayasaki, I. Tohyama, T. Yatagai, M. Mori, and S. Ishihara, Optical learning neural network using Selfoc microlens
array. Japanese Journal of Applied Physics 31, 1689 (1992).
[9] A. Hemmi, R. Mizumura, R. Kawanishi, H. Nakajima, H. Zeng, K. Uchiyama, N. Kaneki, and T. Imato, Development of
a novel two dimensional surface plasmon resonance sensor using multiplied beam splitting optics. Sensors 13, 801–812
(2013).
[10] J. W. Goodman, A. Dias, and L. Woody, Fully parallel, high-speed incoherent optical method for performing discrete
Fourier transforms. Optics Letters 2, 1–3 (1978).
[11] T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama, Optuna: A next-generation hyperparameter optimization
framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
(2019).
[12] A. N. Tait, T. F. De Lima, M. A. Nahmias, H. B. Miller, H.-T. Peng, B. J. Shastri, and P. R. Prucnal, Silicon photonic
modulator neuron. Physical Review Applied 11, 064043 (2019).
[13] C. Wang, M. Zhang, X. Chen, M. Bertrand, A. Shams-Ansari, S. Chandrasekhar, P. Winzer, and M. Lončar, Integrated
lithium niobate electro-optic modulators operating at CMOS-compatible voltages. Nature 562, 101–104 (2018).
[14] K. Nozaki, S. Matsuo, T. Fujii, K. Takeda, M. Ono, A. Shakoor, E. Kuramochi, and M. Notomi, Photonic-crystal nano-
photodetector with ultrasmall capacitance for on-chip light-to-voltage conversion without an amplifier. Optica 3, 483–492
(2016).
[15] L. Tang, S. E. Kocabas, S. Latif, A. K. Okyay, D.-S. Ly-Gagnon, K. C. Saraswat, and D. A. Miller, Nanometre-scale
germanium photodetector enhanced by a near-infrared dipole antenna. Nature Photonics 2, 226–229 (2008).
[16] B. Murmann, ADC Performance Survey 1997-2020 (http://web.stanford.edu/~murmann/adcsurvey.html). Online Ac-
cessed: 2021-02-18 (2020).
[17] B. Xu, Y. Zhou, and Y. Chiu, A 23mW 24GS/s 6b time-interleaved hybrid two-step ADC in 28nm CMOS. In Proceedings
of the IEEE Symposium on VLSI Circuits (VLSI-Circuits), 1–2 (2016).
[18] C. Ramey, Silicon photonics for artificial intelligence acceleration. In Proceedings of the IEEE Hot Chips 32 Symposium
(HCS), 1–26 (2020).
[19] S. Cosemans, B. Verhoef, J. Doevenspeck, I. Papistas, F. Catthoor, P. Debacker, A. Mallik, and D. Verkest, Towards
10000TOPS/W DNN inference with analog in-memory computing–a circuit blueprint, device options and requirements. In
2019 IEEE International Electron Devices Meeting (IEDM), 22–2 (2019).
[20] D. Caimi, M. Sousa, S. Karg, and C. B. Zota, Scaled III–V-on-Si transistors for low-power logic and memory applications.
Japanese Journal of Applied Physics 60, SB0801 (2021).
[21] IEEE, International roadmap for devices and systems 2020 edition. IEEE IRDS™, URL: https://irds.ieee.org/
editions/2020 (2020).