More Is Less: Signal Processing and The Data Deluge: Specialsection

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

SPECIALSECTION

need to be established if a data-sharing network is ing access could be stifled. A particular anxi- 2. E. Wenger, W. Snyder, Harv. Bus. Rev. 2000, 139
to succeed, particularly when it comes to the ety resulting from disparities between wealthy (Jan.-Feb. 2000).
3. A. de-Graft Aikins et al., Global. Health 6, 5 (2010).
ethical and privacy issues surrounding patient and resource-limited nations is the removal of data 4. P. Kowal et al., Glob. Health Action 3 (suppl. 2),
data (23, 24). and loss of ownership. Ownership and govern- 10.3402/gha.v3i0.5302 (2010).
ance arrangements need to be made transparent- 5. OpenXData, www.openxdata.org.
Shifting Attitudes ly for fair access and maintenance of security, 6. EAIDSNet, www.eac.int.
7. H. F. Wertheim et al., PLoS Med. 7, e1000231
Widely dispersed researchers in resource-limited and whenever possible the technology should be (2010).
countries may have few opportunities to travel to transferred rather than the data. These issues 8. The South East Asia Infectious Disease Clinical Research
courses or attend meetings, but they can meet therefore need to be tackled openly and compre- Network, www.seaicrn.org.
online and share experiences, guide each other, hensively early in the formation of data-sharing 9. S. I. Hay, R. W. Snow, PLoS Med. 3, e473 (2006).
10. The Malaria Genomic Epidemiology Network, Nature
and access resources. Learning and knowledge collaborations. Groups would be advised to seek 456, 732 (2008).
sharing online could play a vital role in adjusting advice and obtain example policy documents (such 11. M. Parker et al., PLoS Med. 6, e1000143 (2009).
the imbalance in research capacity. However, this as agreements and terms of reference) from other 12. G. W. Fegan, T. A. Lang, PLoS Med. 5, e6 (2008).
medium for learning needs to become accepted, successful data-sharing groups. 13. T. A. Lang et al., PLoS Negl. Trop. Dis. 4, e619 (2010).
14. A. M. Dondorp et al., Lancet 376, 1647 (2010).
and senior research staff need to encourage and A striking range of data sets spanning a wide
15. P. J. Guerin, S. J. Bates, C. H. Sibley, Curr. Opin.
enable their colleagues to take up the numerous range of healthcare issues, including infectious Infect. Dis. 22, 593 (2009).
free and open-access learning opportunities that and noncommunicable diseases, are accumulat- 16. M. Pirmohamed, K. N. Atuah, A. N. Dodoo, P. Winstanley,
are increasingly available online (13). ing with use of new technology and online col- Br. Med. J. 335, 462 (2007).
Undoubtedly integration and knowledge shar- laboration. All this stands to make real changes in 17. A. S. Kanter et al., Int. J. Med. Inf. 78, 802 (2009).

Downloaded from http://science.sciencemag.org/ on February 27, 2020


18. D-Tree International, www.d-tree.org/.
ing can be vastly improved to make the most use the lives of people affected by diseases of pov- 19. S. F. Noormohammad et al., Int. J. Med. Inf. 79, 204
of gathered data, but many organizations in global erty. While scientists are rapidly adapting and (2010).
health exist to address a single disease or work in taking up these approaches, funding agencies and 20. B. A. Fischer, M. J. Zigmond, Sci. Eng. Ethics 16, 783
a specific sector. There is a real need for mech- regulators also need to adapt to ensure that all (2010).
21. E. Pisani, C. AbouZahr, Bull. W. H. O. 88, 462 (2010).
anisms allowing research organizations, govern- interested communities are able to take maximum 22. J. Whitworth, Bull. W. H. O. 88, 467 (2010).
ments, and universities to collaborate outside their advantage of the digital environment to drive 23. B. Malin, D. Karp, R. H. Scheuermann, J. Investig. Med.
usual remits and locations to maximize the impact improvements in global health. 58, 11 (2010).
of data and available resources. 24. R. Horton, Lancet 355, 2231 (2000).
25. The author received no specific funding for this work and
Governance and ethical issues are also a ma- References and Notes has no conflicts of interest to declare.
jor concern, because if mistakes are made trust 1. P. Mwaba, M. Bates, C. Green, N. Kapata, A. Zumla,
will be quickly lost and enthusiasm for open- Lancet 375, 1874 (2010). 10.1126/science.1199349

PERSPECTIVE In just a few years, the sensor data deluge


has shifted the bottleneck of many data acqui-

More Is Less: Signal Processing sition systems from the sensor back to the pro-
cessing, communication, or storage subsystems
(Fig. 1). To see why, consider the exponentially
and the Data Deluge growing gap between global sensing and data
storage capabilities. A recent report (1) found
Richard G. Baraniuk that the amount of data generated worldwide
(which is now dominated by sensor data) is grow-
The data deluge is changing the operating environment of many sensing systems from data-poor ing by 58% per year; in 2010 the world generated
to data-rich––so data-rich that we are in jeopardy of being overwhelmed. Managing and 1250 billion gigabytes of data—more bits than all
exploiting the data deluge require a reinvention of sensor system design and signal processing of the stars in the universe. In contrast, the total
theory. The potential pay-offs are huge, as the resulting sensor systems will enable radically amount of world data storage (in hard drives,
new information technologies and powerful new tools for scientific discovery. memory chips, and tape) is growing 31% slower,
at only 40% per year. A milestone was reached
in 2007, when the world produced more data
ntil recently, the scientist’s problem was that have both enabled and accelerated the in- than could fit in all of the world’s storage; in 2011

U a “sensor bottleneck.” Sensor systems


produced scarce data, complicating sub-
sequent information extraction and interpretation.
formation age.
These hardware advances have fueled an even
faster exponential explosion of sensor data produced
we already produce over twice as much data as
can be stored. This expanding gap between sen-
sor data production and available data storage
In response to the resulting challenge of “doing by a rapidly growing number of sensors of rapidly means that sensor systems will increasingly face
more with less,” signal-processing researchers growing resolution. Digital camera sensors have a deluge of data that will be unavailable later for
have spent the last several decades creating power- dropped in cost to nearly $1/megapixel; this has en- further analysis. Similar exponentially expand-
ful new theory and technology for digital data abled billions of people to acquire and share high- ing gaps exist between sensor data production
acquisition (digital cameras, medical scanners), resolution images and videos. Millions of security and both computational power and communi-
digital signal processing (machine vision; speech, and surveillance cameras, including unmanned cation rates.
audio, image, and video compression), and dig- drone aircraft prowling the skies, have joined high- The danger is that more sensor data can lead
ital communication (high-speed modems, Wi-Fi) resolution telescopes, digital radio receivers, and to less efficient sensor systems. Consider two
many other types of sensors in the environment. brief illustrations. The first is the Defense
Department of Electrical and Computer Engineering, Rice Uni- As a result, a sensor data deluge is beginning to Advanced Research Projects Agency (DARPA)
versity, Houston, TX 77251–1892, USA. E-mail: [email protected] swamp many of today’s critical sensing systems. Autonomous Real-Time Ground Ubiquitous

www.sciencemag.org SCIENCE VOL 331 11 FEBRUARY 2011 717


Surveillance Imaging System (ARGUS-IS) de- which are then fed into a real-time computing Such low-dimensional signal structure may
veloped for military reconnaissance and real- farm to further process and compress for storage. manifest itself in a number of different ways. In a
time monitoring that features a 1.8-gigapixel All other events are lost in the acquisition process. sparse signal model, N raw data samples can be
digital camera constructed from hundreds of cell Given the growing gap between the amount transformed to a domain where only K (much
phone camera chips (2). Each camera image of data we produce and the amount of data we less than N) representation coefficients are non-
covers up to 160 km2 (almost the size of greater can process, communicate, and store, systems zero (5, 6). Sparse models lie at the heart of pop-
Los Angeles) with a 30-cm ground resolution. like ARGUS-IS and the CMS will become more ular compression and processing algorithms such
When acquiring video at 15 frames per second, the norm than the exception over time. Success- as JPEG. In a manifold signal model, the raw
the camera produces raw data at a rate of 770 fully navigating the data deluge calls for funda- data can be parameterized (nonlinearly, in gen-
gigabits per second (Gbps). In stark contrast, mental advances in the theory and practice of eral) using just K parameters (7). Such a model is
the wireless communications link to the ground sensor design; signal processing algorithms; wide- natural for imaging problems involving a known
station (where the data are to be exploited by band communication systems; and compression, object and K unknown camera parameters. Re-
signal processing algorithms) has a maximum triage, and storage techniques. cent research on compressive sensing has led to
two results that in combination promise to temper
the data deluge. First, signals from both sparse
and manifold models can be acquired without
information loss using just on the order of KlogN
Processing Communication Storage compressive measurements rather than N raw-

INFORMATION
data measurements (5, 6, 8). Second, a range of

Downloaded from http://science.sciencemag.org/ on February 27, 2020


different signal processing algorithms can extract
Sensor
the salient signal characteristics directly from the
low-rate compressive measurements (9). The
sensing protocols that achieve this low measure-
ment rate are inherently random and distinct
from the classical Shannon-Nyquist sampling
theory that dominates digital sensing theory and
practice.
In another promising direction, researchers
are turning the data deluge to their advantage by
Sensor
replacing conventional signal processing algo-
Processing Communication Storage
rithms based on mathematical models with new
INFORMATION
Sensor algorithms that mine the deluge. One striking
example is a tool that fuses a large collection of
Sensor unorganized images of a scene (say, photos of
Notre Dame cathedral from the photo-sharing
Sensor
Web site Flickr) and automatically computes
each photo’s viewpoint and a three-dimensional
model of the scene (10).
Sensor
In the long run, without radical superexponen-
tial advances in computer processing, communi-
Fig. 1. Dealing with the sensor data deluge. In a conventional sensing system (top), the sensor is the cation, and storage capabilities, the data deluge
performance bottleneck. In a data deluge–era sensing system (bottom), the number and resolution of the is here to stay. The next generation of sensor de-
sensors grow to the point that the performance bottleneck moves to the sensor data processing, signs and signal processing theory will have to
communication, or storage subsystem. harness the deluge in order to do more, rather than
less, with its bounty. The broader implications
rate of just 274 megabits per second (Mbps). A recent Frontiers of Engineering event ex- for science and engineering are appreciable. Can
Even using today’s state-of-the-art video com- amined some of the encouraging preliminary scientific conclusions be trusted when the raw
pression algorithms, the camera sensor produces results in these directions (4). One promising experimental data are lost and the data triage or
hundreds of times more image and video data direction is the design of new kinds of data acqui- compression algorithm might be suspect? Can
than can ever be communicated off the platform. sition systems that replace conventional sensors we resist the temptation to equate correlation with
Moreover, moving the ground station’s signal with compressive sensors that combine sensing, causation when mining massive data sets for sci-
processing hardware up to the sensing platform is compression, and data processing in one oper- entific conclusions? Can we develop the new low-
out of the question, because it occupies several ation. The key enabler is the recognition that the complexity mathematical models and the new
large racks of computers. amount of information in many interesting sig- practical sensing protocols that are needed to
The second example is the Compact Muon nals is much smaller than the amount of raw data effectively extract information from the bulk of
Solenoid (CMS) detector of the Large Hadron produced by a conventional sensor. More techni- the deluge? Clearly, these are exciting times for
Collider at CERN, which will produce raw mea- cally, many interesting signals inhabit an extremely sensor system design.
surement data at a rate of 320 terabits per second low-dimensional subset of the high-dimensional
(Tbps), far beyond the capabilities of either pro- raw sensor data space. Rather than first acquiring References
cessing or storage systems today (3). As a stop- a massive amount of raw data and then boiling it 1. J. Gantz, D. Reinsel, “The Digital Universe Decade—Are
You Ready?” IDC White Paper, May 2010;
gap measure, custom hardware carefully triages down into information via signal processing al-
http://idcdocserv.com/925.
the raw data stream to a rate of 800 Gbps by se- gorithms, compressive sensors attempt to acquire 2. DARPA ARGUS-IS program, www.darpa.mil/i2o/programs/
lecting only the potentially “interesting” events, the information directly. argus/argus.asp.

718 11 FEBRUARY 2011 VOL 331 SCIENCE www.sciencemag.org


SPECIALSECTION
3. The CMS Collaboration, J. Instrumentation 3, S08004 5. E. J. Candès, J. Romberg, T. Tao, IEEE Trans. Inf. Theory 9. S. Muthukrishnan, Found. Trends Theor. Comput. Sci.
(2008). 52, 489 (2006). 1 (issue 2), 117 (2005).
4. U.S. National Academy of Engineering and Royal 6. D. L. Donoho, IEEE Trans. Inf. Theory 52, 1289 (2006). 10. N. Snavely, S. M. Seitz, R. Szeliski, ACM Trans. Graph. 25,
Academy of Engineering, Frontiers of Engineering, EU-US 7. J. B. Tenenbaum, V. de Silva, J. C. Langford, Science 290, 835 (2006).
Symposium, Cambridge, UK, 31 August to 3 September 2319 (2000).
2010; www.raeng.org.uk/international/activities/ 8. R. G. Baraniuk, M. B. Wakin, Found. Comput. Math. 9, 51
frontiers_engineering_symposium.htm. (2009). 10.1126/science.1197448

corporations, privacy concerns sometimes lead


PERSPECTIVE
to public policies that require the data be
destroyed after the research is completed—a
Ensuring the Data-Rich Future step that obviously makes scientific replication
impossible (6) and that some think will increase
of the Social Sciences fraudulent publications (7).
Indeed, we appear to be in the midst of a
massive collision between unprecedented increases
Gary King in data production and availability about individ-
uals and the privacy rights of human beings
Massive increases in the availability of informative social science data are making dramatic worldwide, most of whom are also effectively

Downloaded from http://science.sciencemag.org/ on February 27, 2020


progress possible in analyzing, understanding, and addressing many major societal problems. research subjects (Fig. 1).
Yet the same forces pose severe challenges to the scientific infrastructure supporting data Consider how much more informative to re-
sharing, data management, informatics, statistical methodology, and research ethics and policy, searchers, and potentially intrusive to people, the
and these are collectively holding back progress. I address these changes and challenges and new data can be. Researchers now have the
suggest what can be done. possibility of continuous-time location informa-
tion from cell phones, Fastlane or EZPass tran-
ifteen years ago, Science published pre- crobiologists, social scientists are getting to the sponders, IP addresses, and video surveillance.

F dictions from each of 60 scientists about


the future of their fields (1). The physical
and natural scientists wrote about a succession of
point in many areas at which enough information
exists to understand and address major previous-
ly intractable problems that affect human society.
We have information about political preferences
from person-level voter registration, primary par-
ticipation, individual campaign contributions, sig-
breathtaking discoveries to be made, inventions Want to study crime? Whereas researchers once nature campaigns, and ballot images. Commercial
to be constructed, problems to be solved, and relied heavily on victimization surveys, huge information is available from credit card trans-
policies and engineering changes that might be- quantities of real-time geocoded incident reports actions, real estate purchases, wealth indicators,
come possible. In sharp contrast, the (smaller are now available. What about the influence of credit checks, product radio-frequency identifica-
number of ) social scientists did not mention a citizen opinions? Adding to the venerable ran- tion (RFIDs), online product searches and purchases,
single problem they thought might be addressed, dom survey of 1000 or so respondents, research- and device fingerprinting. Health information is
much less solved, or any inventions or discoveries ers can now harvest more than 100 million social being collected via electronic medical records,
on the horizon. Instead, they wrote about social media posts a day and use new automated text hospital admittances, and new devices for contin-
science scholarship—how we once studied this, analysis methods to extract relevant information uous monitoring, passive heart beat measurement,
and in the future we’re going to be studying that. (4). At the same time, parts of the biological movement indicators, skin conductivity, and tem-
Fortunately, the editor’s accompanying warning sciences are effectively becoming social scien- perature. Extensive quantities of information in
was more prescient: “history would suggest that ces, as genomics, proteomics, metabolomics, and unstructured textual format are being produced
scientists tend to underestimate the future” (2). brain imaging produce large numbers of person- in social media posts, e-mails, product reviews,
Indeed. What the social scientists did not level variables, and researchers in these fields join speeches, government reports, and other Web
foresee in 1995 was the onslaught of new social in the hunt for measures of behavioral phenotypes. sources. Satellite imagery is increasing in resolu-
science data—enormously more informative than In parallel, computer scientists and physicists are tion and scholarly usefulness. Social everything—
ever before—and what this information is now delving into social science data with their new networking, bookmarking, highlighting, com-
making possible. Today, huge quantities of digital methods and data-collection schemes. menting, product reviewing, recommending, and
information about people and their various group- The potential of the new data is considerable, annotating—has been sprouting up everywhere
ings and connections are being produced by the and the excitement in the field is palpable. The on the Web, often in research-accessible ways.
revolution in computer technology, the analog-to- fundamental question is whether researchers can Participation in online games and virtual worlds
digital transformation of static records and devices find ways of accessing, analyzing, citing, preserv- produces even more detailed data. Commercial
into easy-to-access data sources, the competition ing, and protecting this information. Although entities are scrambling to generate data to im-
among governments to share data and run ran- information overload has always been an issue prove their business operations through tracking
domized policy experiments, the new technology- for scholars (5), today the infrastructural chal- employee behavior, Web site visitors, search pat-
enhanced ways that people interact, and the many lenges in data sharing, data management, infor- terns, advertising click-throughs, and every man-
commercial entities creating and monetizing new matics, statistical methodology, and research ner of cloud services that capture more and more
forms of data collection (3). ethics and policy risk being overwhelmed by information.
Analogous to what it must have been like the massive increases in informative data. Many Efforts in the social sciences that make data,
when they first handed out microscopes to mi- social science data sets are so valuable and code, and information associated with individual
sensitive that when commercial entities collect published articles available to other scholars have
Institute for Quantitative Social Science, 1737 Cambridge
them, external researchers are granted almost no been advancing through software, journal poli-
Street, Harvard University, Cambridge, MA 02138, USA. access. Even when sensitive data are collected cies, and improved researcher practices for some
E-mail: [email protected] originally by researchers or acquired from time (8, 9). However, this movement is at risk of

www.sciencemag.org SCIENCE VOL 331 11 FEBRUARY 2011 719


More Is Less: Signal Processing and the Data Deluge
Richard G. Baraniuk

Science 331 (6018), 717-719.


DOI: 10.1126/science.1197448

Downloaded from http://science.sciencemag.org/ on February 27, 2020


ARTICLE TOOLS http://science.sciencemag.org/content/331/6018/717

RELATED http://science.sciencemag.org/content/sci/331/6018/692.full
CONTENT

REFERENCES This article cites 7 articles, 1 of which you can access for free
http://science.sciencemag.org/content/331/6018/717#BIBL

PERMISSIONS http://www.sciencemag.org/help/reprints-and-permissions

Use of this article is subject to the Terms of Service

Science (print ISSN 0036-8075; online ISSN 1095-9203) is published by the American Association for the Advancement of
Science, 1200 New York Avenue NW, Washington, DC 20005. The title Science is a registered trademark of AAAS.
Copyright © 2011, American Association for the Advancement of Science

You might also like