IEEE International Conference on Web Services (ICWS'05), 2005
The Sloan Digital Sky Survey (SDSS) science database describes over 140 million objects and is ov... more The Sloan Digital Sky Survey (SDSS) science database describes over 140 million objects and is over 1.5 TB in size. The SDSS Catalog Archive Server (CAS) provides several levels of query interface to the SDSS data via the SkyServer website. Most queries execute in seconds or minutes. However, some queries can take hours or days, either because they require non-index scans of the largest tables, or because they request very large result sets, or because they represent very complex aggregations of the data. These "monster queries" not only take a long time, they also affect response times for everyone else -one or more of them can clog the entire system. To ameliorate this problem, we developed a multi-server multi-queue batch job submission and tracking system for the CAS called CasJobs. The transfer of very large result sets from queries over the network is another serious problem. Statistics suggested that much of this data transfer is unnecessary; users would prefer to store results locally in order to allow further joins and filtering. To allow local analysis, a system was developed that gives users their own personal databases (MyDB) at the server side. Users may transfer data to their MyDB, and then perform further analysis before extracting it to their own machine. MyDB tables also provide a convenient way to share results of queries with collaborators without downloading them. CasJobs is built using SOAP XML Web services and has been in operation since May 2004.
The NVO Open SkyQuery portal allows users to query large, physically distributed databases of ast... more The NVO Open SkyQuery portal allows users to query large, physically distributed databases of astronomical objects such as the Sloan Digital Sky Survey, 2MASS catalog, ROSAT All-Sky Survey, and the FIRST and NVSS radio surveys. Queries can be generated through a simple forms-based interface or through an advanced query page in which the user generates a SQL-type query. The available databases are determined by examining the NVO registry, so that as new databases are published as SkyNodes they immediately become available to Open SkyQuery.
Traditional science searched for new objects and phenomena that led to discoveries. Tomorrow'... more Traditional science searched for new objects and phenomena that led to discoveries. Tomorrow's science will combine together the large pool of information in scientific archives and make discoveries. Scientists are currently keen to federate together the existing scientific databases. The major challenge in building a federation of these autonomous and heterogeneous databases is system integration. Ineffective integration will result in
Science projects are data publishers. The scale and complexity of current and future science data... more Science projects are data publishers. The scale and complexity of current and future science data changes the nature of the publication process. Publication is becoming a major project component. At a minimum, a project must preserve the ephemeral data it gathers. De- rived data can be reconstructed from metadata, but meta- data is ephemeral. Longer term, a pr oject should
Modern scientific repositories are growing rapidly in size. Scientists are increasingly intereste... more Modern scientific repositories are growing rapidly in size. Scientists are increasingly interested in viewing the latest data as part of query results. Current scientific middleware cache systems, however, assume repositories are static. Thus, they cannot answer scientific queries with the latest data. The queries, instead, are routed to the repository until data at the cache is refreshed. In data-intensive scientific disciplines, such as astronomy, indiscriminate query routing or data refreshing often results in runaway network costs. This severely affects the performance and scalability of the repositories and makes poor use of the cache system. We present Delta a dynamic data middleware cache system for rapidly-growing scientific repositories. Delta's key component is a decision framework that adaptively decouples data objects-choosing to keep some data object at the cache, when they are heavily queried, and keeping some data objects at the repository, when they are heavily updated. Our algorithm profiles incoming workload to search for optimal data decoupling that reduces network costs. It leverages formal concepts from the network flow problem, and is robust to evolving scientific workloads. We evaluate the efficacy of Delta, through a prototype implementation, by running query traces collected from a real astronomy survey.
Cross-match spatially clusters and organizes several astronomical point-source measurements from ... more Cross-match spatially clusters and organizes several astronomical point-source measurements from one or more surveys. Ideally, each object would be found in each survey. Unfortunately, the observation conditions and the objects themselves change continually. Even some stationary objects are missing in some observations; sometimes objects have a variable light flux and sometimes the seeing is worse. In most cases we are
Enterprise and scientific data sets double every year, forcing similar growths in storage size an... more Enterprise and scientific data sets double every year, forcing similar growths in storage size and power consumption. As a consequence, current architectures used to build data warehouses are hitting a power consumption wall. We propose a novel alternative architecture comprising a large number of so-called "Amdahl blades" that combine energy-efficient CPUs and GPUs with solid state disks to increase sequential I/O throughput by an order of magnitude while keeping power consumption constant. Offloading throughput-intensive analysis to GPUs integrated in the server's chipset increases the relative performance per Watt. While keeping the power consumption constant, Amdahl blades offer 13 times the throughput of a state-of-the-art computing cluster for data-intensive applications. Finally, using the scaling laws originally postulated by Amdahl, we show that systems for data-intensive computing must maintain a balance between low power consumption and per-server throughput to optimize performance per Watt. Using a mix of SSDs and low-power hard disks we achieve a remarkably balanced system with excellent IO, power consumption and disk capacity.
The SkyServer is an Internet portal to the Sloan Digital Sky Survey Catalog Archive Server. From ... more The SkyServer is an Internet portal to the Sloan Digital Sky Survey Catalog Archive Server. From 2001 to 2006, there were a million visitors in 3 million sessions generating 170 million Web hits, 16 million ad-hoc SQL queries, and 62 million page views. The site currently averages 35 thousand visitors and 400 thousand sessions per month. The Web and SQL
The next-generation astronomy digital archives will cover most of the universe at fine resolution... more The next-generation astronomy digital archives will cover most of the universe at fine resolution in many wavelengths, from X-rays to ultraviolet, optical, and infrared. The archives will be stored at diverse geographical locations. One of the first of these projects, the Sloan Digital Sky Survey (SDSS) will create a 5-wavelength catalog over 10,000 square degrees of the sky (see http://www.sdss.org/). The 200 million objects in the multi-terabyte database will have mostly numerical attributes, defining a space of 100+ dimensions. Points in this space have highly correlated distributions.
Proceedings of the 2006 ACM/IEEE conference on Supercomputing - SC '06, 2006
National Center for Data Mining at UICIn our SC06 BWC entry, we will transfer SDSS (Sloan Digital... more National Center for Data Mining at UICIn our SC06 BWC entry, we will transfer SDSS (Sloan Digital Sky Survey) Data Release 5 (DR5) between the SC06 show floor in Tampa and one of the NCDM labs on the UIC campus. We will use SECTOR, our newly developed distributed data space management system, to transfer DR5 in parallel between two Linux
2006 Second IEEE International Conference on e-Science and Grid Computing (e-Science'06), 2006
In this paper, we describe a peer-to-peer storage system called Sector that is designed to access... more In this paper, we describe a peer-to-peer storage system called Sector that is designed to access and transport large data sets over wide area high performance networks. We also describe our recent experience using Sector to distribute the Sloan Digital Sky Survey BESTDR4 catalog data.
User-generated Structured Query Language (SQL) queries are a rich source of information for datab... more User-generated Structured Query Language (SQL) queries are a rich source of information for database analysts, information scientists, and the end users of databases. In this study a group of scientists in astronomy and computer and information scientists work together to analyze a large volume of SQL log data generated by users of the Sloan Digital Sky Survey (SDSS) data archive in order to better understand users' data seeking behavior. While statistical analysis of such logs is useful at aggregated levels, efficiently exploring specific patterns of queries is often a challenging task due to the typically large volume of the data, multivariate features, and data requirements specified in SQL queries. To enable and facilitate effective and efficient exploration of the SDSS log data, we designed an interactive visualization tool, called the SDSS Log Viewer, which integrates time series visualization, text visualization, and dynamic query techniques. We describe two analysis scenarios of visual exploration of SDSS log data, including understanding unusually high daily query traffic and modeling the types of data seeking behaviors of massive query generators. The two scenarios demonstrate that the SDSS Log Viewer provides a novel and potentially valuable approach to support these targeted tasks.
Web Services form a new, emerging paradigm to handle distributed access to resources over the Int... more Web Services form a new, emerging paradigm to handle distributed access to resources over the Internet. There are platform independent standards (SOAP, WSDL), which make the developers' task considerably easier. This article dis cusses how web services could be used in the context of the Virtual Observatory. We envisage a multi-layer architecture, with interoperating services. A well-designed lower layer consisting of simple, standard services implemented by most data providers will go a long way towards establishing a modular architecture. More complex applications can be built upon this core layer. We present two prototype applications, the SdssCutout and the SkyQuery as examples of this layered architecture.
Astronomy is posing imminent big challenges to database management systems. Projects such as Pan-... more Astronomy is posing imminent big challenges to database management systems. Projects such as Pan-STARRS will build a 300 TB database system by year 2011. In order to achieve such as an ambitious goal, we must divide to conquer. We present the GrayWulf framework where computational and data resources are integrated through powerful workflow, and query tools. The system is built on top of a cluster of commodity servers. Its scalable architecture makes it a great host for data intensive applications such as large scale cross-matching.
We present the fifth edition of the Sloan Digital Sky Survey (SDSS) Quasar Catalog, which is base... more We present the fifth edition of the Sloan Digital Sky Survey (SDSS) Quasar Catalog, which is based upon the SDSS Seventh Data Release. The catalog, which contains 105,783 spectroscopically confirmed quasars, represents the conclusion of the SDSS-I and SDSS-II quasar survey. The catalog consists of the SDSS objects that have luminosities larger than M i = −22.0 (in a cosmology with H 0 = 70 km s −1 Mpc −1 , Ω M = 0.3, and Ω Λ = 0.7), have at least one emission line with FWHM larger than 1000 km s −1 or have interesting/complex absorption features, are fainter than i ≈ 15.0, and have highly reliable redshifts. The catalog covers an area of ≈ 9380 deg 2 . The quasar redshifts range from 0.065 to 5.46, with a median value of 1.49; the catalog includes 1248 quasars at redshifts greater than four, of which 56 are at redshifts greater than five. The catalog contains 9210 quasars with i < 18; slightly over half of the entries have i < 19. For each object the catalog presents positions accurate to better than 0.1 ′′ rms per coordinate, five-band (ugriz) CCD-based photometry with typical accuracy of 0.03 mag, and information on the morphology and selection method. The catalog also contains radio, near-infrared, and X-ray emission properties of the quasars, when available, from other large-area surveys. The calibrated digital spectra cover the wavelength region 3800-9200Å at a spectral resolution of ≃ 2000; the spectra can be retrieved from the SDSS public database using the information provided in the catalog. Over 96% of the objects in the catalog were discovered by the SDSS. We also include a supplemental list of an additional 207 quasars with SDSS spectra whose archive photometric information is incomplete.
Putting data into the public domain is not the same thing as making those data accessible for int... more Putting data into the public domain is not the same thing as making those data accessible for intelligent analysis. A distinguished group of editors and experts who were already engaged in one way or another with the issues inherent in making research data public came together with statisticians to initiate a dialogue about policies and practicalities of requiring published research to be accompanied by publication of the research data. This dialogue carried beyond the broad issues of the advisability, the intellectual integrity, the scientific exigencies to the relevance of these issues to statistics as a discipline and the relevance of statistics, from inference to modeling to data exploration, to science and social science policies on these issues.
In this paper, we describe two distributed, data intensive applications that were demonstrated at... more In this paper, we describe two distributed, data intensive applications that were demonstrated at iGrid 2005 (iGrid Demonstration US109 and iGrid Demonstration US121). One involves transporting astronomical data from the Sloan Digital Sky Survey (SDSS) and the other involves computing histograms from multiple high volume data streams. Both rely on newly developed data transport and data mining middleware. Specifically, we describe a new version of the UDT network protocol called Composible-UDT, a file transfer utility based upon UDT called UDT-Gateway, and an application for building histograms on high volume data flows called BESH for Best Effort Streaming Histogram. For both demonstrations, we include a summary of the experimental studies performed at iGrid 2005.
A commercial, object-oriented database engine with custom tools for data-mining the multiterabyte... more A commercial, object-oriented database engine with custom tools for data-mining the multiterabyte Sloan Digital Sky Survey archive did not meet its performance objectives. We describe the problems, technical issues, and process of migrating this large data set project to relational database technology.
IEEE International Conference on Web Services (ICWS'05), 2005
The Sloan Digital Sky Survey (SDSS) science database describes over 140 million objects and is ov... more The Sloan Digital Sky Survey (SDSS) science database describes over 140 million objects and is over 1.5 TB in size. The SDSS Catalog Archive Server (CAS) provides several levels of query interface to the SDSS data via the SkyServer website. Most queries execute in seconds or minutes. However, some queries can take hours or days, either because they require non-index scans of the largest tables, or because they request very large result sets, or because they represent very complex aggregations of the data. These "monster queries" not only take a long time, they also affect response times for everyone else -one or more of them can clog the entire system. To ameliorate this problem, we developed a multi-server multi-queue batch job submission and tracking system for the CAS called CasJobs. The transfer of very large result sets from queries over the network is another serious problem. Statistics suggested that much of this data transfer is unnecessary; users would prefer to store results locally in order to allow further joins and filtering. To allow local analysis, a system was developed that gives users their own personal databases (MyDB) at the server side. Users may transfer data to their MyDB, and then perform further analysis before extracting it to their own machine. MyDB tables also provide a convenient way to share results of queries with collaborators without downloading them. CasJobs is built using SOAP XML Web services and has been in operation since May 2004.
The NVO Open SkyQuery portal allows users to query large, physically distributed databases of ast... more The NVO Open SkyQuery portal allows users to query large, physically distributed databases of astronomical objects such as the Sloan Digital Sky Survey, 2MASS catalog, ROSAT All-Sky Survey, and the FIRST and NVSS radio surveys. Queries can be generated through a simple forms-based interface or through an advanced query page in which the user generates a SQL-type query. The available databases are determined by examining the NVO registry, so that as new databases are published as SkyNodes they immediately become available to Open SkyQuery.
Traditional science searched for new objects and phenomena that led to discoveries. Tomorrow'... more Traditional science searched for new objects and phenomena that led to discoveries. Tomorrow's science will combine together the large pool of information in scientific archives and make discoveries. Scientists are currently keen to federate together the existing scientific databases. The major challenge in building a federation of these autonomous and heterogeneous databases is system integration. Ineffective integration will result in
Science projects are data publishers. The scale and complexity of current and future science data... more Science projects are data publishers. The scale and complexity of current and future science data changes the nature of the publication process. Publication is becoming a major project component. At a minimum, a project must preserve the ephemeral data it gathers. De- rived data can be reconstructed from metadata, but meta- data is ephemeral. Longer term, a pr oject should
Modern scientific repositories are growing rapidly in size. Scientists are increasingly intereste... more Modern scientific repositories are growing rapidly in size. Scientists are increasingly interested in viewing the latest data as part of query results. Current scientific middleware cache systems, however, assume repositories are static. Thus, they cannot answer scientific queries with the latest data. The queries, instead, are routed to the repository until data at the cache is refreshed. In data-intensive scientific disciplines, such as astronomy, indiscriminate query routing or data refreshing often results in runaway network costs. This severely affects the performance and scalability of the repositories and makes poor use of the cache system. We present Delta a dynamic data middleware cache system for rapidly-growing scientific repositories. Delta's key component is a decision framework that adaptively decouples data objects-choosing to keep some data object at the cache, when they are heavily queried, and keeping some data objects at the repository, when they are heavily updated. Our algorithm profiles incoming workload to search for optimal data decoupling that reduces network costs. It leverages formal concepts from the network flow problem, and is robust to evolving scientific workloads. We evaluate the efficacy of Delta, through a prototype implementation, by running query traces collected from a real astronomy survey.
Cross-match spatially clusters and organizes several astronomical point-source measurements from ... more Cross-match spatially clusters and organizes several astronomical point-source measurements from one or more surveys. Ideally, each object would be found in each survey. Unfortunately, the observation conditions and the objects themselves change continually. Even some stationary objects are missing in some observations; sometimes objects have a variable light flux and sometimes the seeing is worse. In most cases we are
Enterprise and scientific data sets double every year, forcing similar growths in storage size an... more Enterprise and scientific data sets double every year, forcing similar growths in storage size and power consumption. As a consequence, current architectures used to build data warehouses are hitting a power consumption wall. We propose a novel alternative architecture comprising a large number of so-called "Amdahl blades" that combine energy-efficient CPUs and GPUs with solid state disks to increase sequential I/O throughput by an order of magnitude while keeping power consumption constant. Offloading throughput-intensive analysis to GPUs integrated in the server's chipset increases the relative performance per Watt. While keeping the power consumption constant, Amdahl blades offer 13 times the throughput of a state-of-the-art computing cluster for data-intensive applications. Finally, using the scaling laws originally postulated by Amdahl, we show that systems for data-intensive computing must maintain a balance between low power consumption and per-server throughput to optimize performance per Watt. Using a mix of SSDs and low-power hard disks we achieve a remarkably balanced system with excellent IO, power consumption and disk capacity.
The SkyServer is an Internet portal to the Sloan Digital Sky Survey Catalog Archive Server. From ... more The SkyServer is an Internet portal to the Sloan Digital Sky Survey Catalog Archive Server. From 2001 to 2006, there were a million visitors in 3 million sessions generating 170 million Web hits, 16 million ad-hoc SQL queries, and 62 million page views. The site currently averages 35 thousand visitors and 400 thousand sessions per month. The Web and SQL
The next-generation astronomy digital archives will cover most of the universe at fine resolution... more The next-generation astronomy digital archives will cover most of the universe at fine resolution in many wavelengths, from X-rays to ultraviolet, optical, and infrared. The archives will be stored at diverse geographical locations. One of the first of these projects, the Sloan Digital Sky Survey (SDSS) will create a 5-wavelength catalog over 10,000 square degrees of the sky (see http://www.sdss.org/). The 200 million objects in the multi-terabyte database will have mostly numerical attributes, defining a space of 100+ dimensions. Points in this space have highly correlated distributions.
Proceedings of the 2006 ACM/IEEE conference on Supercomputing - SC '06, 2006
National Center for Data Mining at UICIn our SC06 BWC entry, we will transfer SDSS (Sloan Digital... more National Center for Data Mining at UICIn our SC06 BWC entry, we will transfer SDSS (Sloan Digital Sky Survey) Data Release 5 (DR5) between the SC06 show floor in Tampa and one of the NCDM labs on the UIC campus. We will use SECTOR, our newly developed distributed data space management system, to transfer DR5 in parallel between two Linux
2006 Second IEEE International Conference on e-Science and Grid Computing (e-Science'06), 2006
In this paper, we describe a peer-to-peer storage system called Sector that is designed to access... more In this paper, we describe a peer-to-peer storage system called Sector that is designed to access and transport large data sets over wide area high performance networks. We also describe our recent experience using Sector to distribute the Sloan Digital Sky Survey BESTDR4 catalog data.
User-generated Structured Query Language (SQL) queries are a rich source of information for datab... more User-generated Structured Query Language (SQL) queries are a rich source of information for database analysts, information scientists, and the end users of databases. In this study a group of scientists in astronomy and computer and information scientists work together to analyze a large volume of SQL log data generated by users of the Sloan Digital Sky Survey (SDSS) data archive in order to better understand users' data seeking behavior. While statistical analysis of such logs is useful at aggregated levels, efficiently exploring specific patterns of queries is often a challenging task due to the typically large volume of the data, multivariate features, and data requirements specified in SQL queries. To enable and facilitate effective and efficient exploration of the SDSS log data, we designed an interactive visualization tool, called the SDSS Log Viewer, which integrates time series visualization, text visualization, and dynamic query techniques. We describe two analysis scenarios of visual exploration of SDSS log data, including understanding unusually high daily query traffic and modeling the types of data seeking behaviors of massive query generators. The two scenarios demonstrate that the SDSS Log Viewer provides a novel and potentially valuable approach to support these targeted tasks.
Web Services form a new, emerging paradigm to handle distributed access to resources over the Int... more Web Services form a new, emerging paradigm to handle distributed access to resources over the Internet. There are platform independent standards (SOAP, WSDL), which make the developers' task considerably easier. This article dis cusses how web services could be used in the context of the Virtual Observatory. We envisage a multi-layer architecture, with interoperating services. A well-designed lower layer consisting of simple, standard services implemented by most data providers will go a long way towards establishing a modular architecture. More complex applications can be built upon this core layer. We present two prototype applications, the SdssCutout and the SkyQuery as examples of this layered architecture.
Astronomy is posing imminent big challenges to database management systems. Projects such as Pan-... more Astronomy is posing imminent big challenges to database management systems. Projects such as Pan-STARRS will build a 300 TB database system by year 2011. In order to achieve such as an ambitious goal, we must divide to conquer. We present the GrayWulf framework where computational and data resources are integrated through powerful workflow, and query tools. The system is built on top of a cluster of commodity servers. Its scalable architecture makes it a great host for data intensive applications such as large scale cross-matching.
We present the fifth edition of the Sloan Digital Sky Survey (SDSS) Quasar Catalog, which is base... more We present the fifth edition of the Sloan Digital Sky Survey (SDSS) Quasar Catalog, which is based upon the SDSS Seventh Data Release. The catalog, which contains 105,783 spectroscopically confirmed quasars, represents the conclusion of the SDSS-I and SDSS-II quasar survey. The catalog consists of the SDSS objects that have luminosities larger than M i = −22.0 (in a cosmology with H 0 = 70 km s −1 Mpc −1 , Ω M = 0.3, and Ω Λ = 0.7), have at least one emission line with FWHM larger than 1000 km s −1 or have interesting/complex absorption features, are fainter than i ≈ 15.0, and have highly reliable redshifts. The catalog covers an area of ≈ 9380 deg 2 . The quasar redshifts range from 0.065 to 5.46, with a median value of 1.49; the catalog includes 1248 quasars at redshifts greater than four, of which 56 are at redshifts greater than five. The catalog contains 9210 quasars with i < 18; slightly over half of the entries have i < 19. For each object the catalog presents positions accurate to better than 0.1 ′′ rms per coordinate, five-band (ugriz) CCD-based photometry with typical accuracy of 0.03 mag, and information on the morphology and selection method. The catalog also contains radio, near-infrared, and X-ray emission properties of the quasars, when available, from other large-area surveys. The calibrated digital spectra cover the wavelength region 3800-9200Å at a spectral resolution of ≃ 2000; the spectra can be retrieved from the SDSS public database using the information provided in the catalog. Over 96% of the objects in the catalog were discovered by the SDSS. We also include a supplemental list of an additional 207 quasars with SDSS spectra whose archive photometric information is incomplete.
Putting data into the public domain is not the same thing as making those data accessible for int... more Putting data into the public domain is not the same thing as making those data accessible for intelligent analysis. A distinguished group of editors and experts who were already engaged in one way or another with the issues inherent in making research data public came together with statisticians to initiate a dialogue about policies and practicalities of requiring published research to be accompanied by publication of the research data. This dialogue carried beyond the broad issues of the advisability, the intellectual integrity, the scientific exigencies to the relevance of these issues to statistics as a discipline and the relevance of statistics, from inference to modeling to data exploration, to science and social science policies on these issues.
In this paper, we describe two distributed, data intensive applications that were demonstrated at... more In this paper, we describe two distributed, data intensive applications that were demonstrated at iGrid 2005 (iGrid Demonstration US109 and iGrid Demonstration US121). One involves transporting astronomical data from the Sloan Digital Sky Survey (SDSS) and the other involves computing histograms from multiple high volume data streams. Both rely on newly developed data transport and data mining middleware. Specifically, we describe a new version of the UDT network protocol called Composible-UDT, a file transfer utility based upon UDT called UDT-Gateway, and an application for building histograms on high volume data flows called BESH for Best Effort Streaming Histogram. For both demonstrations, we include a summary of the experimental studies performed at iGrid 2005.
A commercial, object-oriented database engine with custom tools for data-mining the multiterabyte... more A commercial, object-oriented database engine with custom tools for data-mining the multiterabyte Sloan Digital Sky Survey archive did not meet its performance objectives. We describe the problems, technical issues, and process of migrating this large data set project to relational database technology.
Uploads
Papers by Ani Thakar