Academia.eduAcademia.edu

The sqlLoader Data-Loading Pipeline

2000, Computing in Science & Engineering

Using a database management system (DBMS) is essential to ensure the data integrity and reliability of large, multidimensional data sets. However, loading multiterabyte data into a DBMS is a time-consuming and error-prone task that the authors have tried to automate by developing the sqlLoader pipeline-a distributed workflow system for data loading.

SDSS SCIENCE ARCHIVE The sqlLoader Data-Loading Pipeline Using a database management system (DBMS) is essential to ensure the data integrity and reliability of large, multidimensional data sets. However, loading multiterabyte data into a DBMS is a time-consuming and error-prone task that the authors have tried to automate by developing the sqlLoader pipeline—a distributed workflow system for data loading. D atabase management systems offer considerable advantages and are essential for reliable data warehousing and integrity, not to mention acceptable data retrieval performance and data mining capabilities. Of course, we must ingest the data into the DBmS before we can realize any of these beneits. Loading large multidimensional data sets such as the Sloan Digital Sky Server (SDSS) into a DBmS is a critical and time-consuming step that’s fraught with potential missteps. in many ways, loading the data is the weakest link in archive publishing: it’s typically the most error-prone, time-consuming yet least-planned and budgeted part of archive operations. The virtual observatory (VO)1 effort involves integrating many historical and current data sets into online repositories. Loading these data sets (and reloading them as schemas change) is an ongoing task. With the multiterabyte-to-petabyte data volumes anticipated for upcoming archives 1521-9615/08/$25.00 © 2008 IEEE Copublished by the IEEE CS and the AIP A Lex SzALAy AnD A ni R. T hAkAR Johns Hopkins University Jim GRAy Microsoft Research 38 THIS ARTICLE HAS BEEN PEER-REVIEWED. in the VO, data-loading time is expected to become a critical factor. Disk speeds won’t keep up with the exponential increase in data volume, and with the limited operational resources typical for scientiic archives, loading such large archives can take several days if parallelism is used—several weeks if not. in practice, archive data often must be reloaded several times even before it’s published; errors in the data and in the data-processing pipeline cause reloads, so any changes to the schema or performance considerations sometimes make reloading the data the most expedient and clean option. The eficiency and reliability of the data-loading process therefore is of paramount importance so as to minimize the operational resources required to reload data and maximize the chance of the data being correct the irst time it’s published. A crucial fact about public archives is that once the data is published, it’s immutable—that is, it can’t be retracted or changed, especially after scientiic research has been based on the data and papers have been published. As such, we get only one chance to get the data loaded correctly, so the loading procedure must be as reliable, thorough, and automated as possible. Accordingly, we’ve developed a data-loading pipeline for the SDSS data called the sqlLoader.2 We subdivided the loading process into a sequence COMPUTING IN SCIENCE & ENGINEERING of steps and developed a largely automated worklow system that executes these steps (in parallel where possible) on a distributed cluster of load servers. Loading the multiterabyte SDSS archive would be chaotic, if not impossible, without the automation and worklow management that the sqlLoader pipeline provides. Loading SDSS Data OpDB export TChunk BChunk Run Plate Tiling Plate Tiling sqlFits2CSV CSV files TChunk BChunk Run There are three main operational stages in SDSS data loading: Linux sqlLoader • Converting Flexible Image Transport System (FITS) binary data to ASCII comma-separated value (CSV) files. We store the output from the SDSS processing pipelines in binary form in the operational database (OpDB). We then export it from the OpDB into binary FiTS iles—a standard astronomy format for binary data and images. The FiTS data must be converted to ASCii CSV format before it can be ingested into the databases (see Figure 1). • Loading CSV files into temporary SQL Server databases. Once we convert the data to CSV, we load it into temporary SQL Server databases called task databases and validate it—that is, we check the export for any inconsistencies and errors. • Publishing the data from the task databases to the final SDSS database. This is the stage in which we merge all the data from multiple task databases and write it to the database that will be available to users, also known as the publish database. This stage creates the database indices and other ancillary tables. This last part is complex and time-consuming enough that we consider it a separate “inish” substage. The irst stage is performed on the Linux side where the FiTS iles are hosted. The last two stages are automated on the Windows side in the sqlLoader worklow as the load and publish/inish stages. FITS-to-CSV Converter When we export the raw data stored in the SDSS OpDB to FiTS iles, several different types of iles correspond to distinct data sets within the SDSS data, including image and spectroscopic data. A unit of exported image data is a chunk, or a single resolve operation in the OpDB. Because the data is often recalibrated as the image-processing pipeline is reined, different calibrations result in different skyVersions of the image data: Windows Catalog Archive Server databases Figure 1. Sloan Digital Sky Survey data-loading pipelines. Flexible Image Transport System (FITS) data exported by the operational database (OpDB) is first converted to comma-separated value (CSV) format on the Linux side. It’s then Samba-mounted and loaded into the SQL Server databases on the Windows side. we chose the spectroscopic targets. • Best skyVersion is the latest, best calibration of the data. • Runs skyVersion is a temporary calibration that might be reclassiied as Target or Best later. Until then, it’s only for use in internal collaboration and isn’t released to the public. in addition to the image data sets, the public data release also includes spectroscopic plates and tiling data. A utility called sqlFits2Csv converts these FiTS iles into ASCii CSV iles and JPeG image iles for ingestion into the SQL databases. sqlFits2Csv creates one type of output ile for each table in the database; it also assigns to each catalog object a unique 64-bit object iD that we use as the primary key in the corresponding database table. The conversion to CSV also acts as a “blood– brain” barrier between the Linux and Windows worlds and between the ile and database worlds— that is, it keeps either side protected from operational (as well as some functional) problems on the other side. All the popular database systems support generic CSV ile converters. The CSV iles are Samba-mounted on the Windows side and imported into the databases by the sqlLoader’s distributed worklow system. Figure 1 shows the data low. Workflow Management • Target skyVersion is the calibration from which JANUARY/FEBRUARY 2008 The SQL Server agent process controls the loader 39 Export EXP Load Check comma-separated value Distributed Framework CHK Build task databases BLD Build SQL schema SQL Validate VAL Back up BCK Detach DTC Publish PUB Publish Clean up CLN Finish FIN Figure 2. The workflow diagrams for the sqlLoader, showing the load, publish, and finish workflows. The load workflow is the most timeconsuming and can run in distributed parallel mode. The publish and finish stages must be sequential. worklow using two agent jobs: load and pub. The load job runs once every minute, picking up the next step from the load stage for the irst active task and executing it. Similarly, the pub job wakes up once every minute and picks up the next step in the publish/merge or inish stages. Figure 2 shows the worklow stages. The loader worklow has three distinct stages: • The load stage includes checking the input CSV iles, loading them into temporary task databases, validating the data in the task databases, and making a backup of the task database. • The publish stage involves publishing the individual task databases to the publish database, merging the various data streams (imaging, spectro, and tiling) in the publish database, and creating the indices on the main tables. • The finish stage includes running the inal tests and creating the derived tables and precomputed joins as well as the corresponding indices. We subdivide the publish stage further into two substages: publish and finish. At the moment, we must run the publish and inish stages sequentially because they write to the same (publish) database. The load/validate stages can run in parallel on a cluster of load servers, each having its own view of the schema and its own task database. On remote load servers (slaves), the local SQL Server agent also controls the worklow using account 40 delegation. We create a domain “loadagent” account for this purpose. There are two main parts of the sqlLoader framework—loadadmin and loadsupport—facilitate distributed and parallel data loading with a cluster of load servers. The loadadmin scripts and database control the load framework on the master load server. The loadsupport scripts and databases set up the loading on the slave load servers and build the ancillary framework to facilitate the loading process, such as setting up user privileges, the links between master and slave load servers (if applicable), spatial indexing, and utilities to manage units of the loading process (phase, step, task, and so on). Figure 3 shows the relationship between the loadadmin and loadsupport parts of the framework. each of the satellite or slave servers links only to the master, not to each other. The loadsupport database contains read-only remote views of the loadadmin database rather than copies so that changes in the master data or schema are automatically relected in the slaves. This feature uses the SQL Server’s linked server mechanism to make remote databases and tables appear to be local. Load and Publish Roles The loadadmin and loadsupport servers have different roles in the distributed implementation: the loadadmin server runs both in load and publish roles, whereas the loadsupport servers run only in the load role. The load role corresponds to loading CSV iles into individual component databases (one per load unit), whereas the publish role involves validating and transferring the component database’s contents into the publish database for each data set. Therefore, we can do the loading in parallel, but the inal part of the publishing stage that brings all the loaded components together into the publish database must be done sequentially. We refer to this inal publish step as the merge/inish step to avoid confusing it with the copying of the task databases to the publish database. Several load servers can be in the load role, but only one server can be in the pub role at a time. On each slave server (master and slave can coexist on one server), the SQL agent load job picks up the next load task to execute from its view of the task table. Similarly, on the publish server (usually the master), the SQL agent pub job picks up the next publish task from its view of the task table. COMPUTING IN SCIENCE & ENGINEERING Loader Command Scripts The sqlLoader framework contains several kinds of scripts: • Windows command scripts (.bat iles) create the loading framework and invoke the SQL scripts; • SQL scripts (.sql iles) contain the SQL code to set up the loading framework and load the data; • Data Transformation Services (DTS) packages perform special-purpose data import tasks; and • Visual Basic (VBScript) scripts check the CSV iles’ syntax and parse the documentation iles to load them into the databases. The initial creation of the loader framework, including building the loadadmin and loadsupport databases, must be done via two main Windows batch/command scripts—one each to build the loadadmin (master) and loadsupport (slave) frameworks. Although for conceptual clarity we show the master hosting only a loadadmin database in Figure 3, in practice, the master server hosts both the loadadmin database and a loadsupport database. Loader SQL Scripts The command scripts invoke the loadadmin and loadsupport SQL scripts to perform the actual database tasks (see Table 1). We coded most of the functions required for data import, validation, and publishing in the form of SQL stored procedures in the loadsupport task and publish databases. These are deined in Table 1 and installed in the appropriate databases upon creation. Schema Files The schema creation scripts for the Catalog Archive Server (CAS) databases are in the schema subdirectory. The contents of this subdirectory encapsulate the data model for the archive and hence will change the most when adapted to other (nonSDSS) archives. Five subdirectories at this level contain various aspects of the database schema: • csv contains CSV outputs from the documentation generation scripts—in general, this directory contains input iles for metadata tables. it also contains CSV input iles for tables of nonSDSS data required to do meaningful science with the SDSS (such as Stetson or RC3 catalogs) and derived science data tables such as the Quasi-Stellar Object (QSO) catalogs. • doc contains the schema documentation iles. JANUARY/FEBRUARY 2008 Samba-mounted comma-separated value files Load monitor Master LocalAdmin Slave LoadSupport View of Task DB Task master schema data Publish server Master schema Task DB Task data Slave LoadSupport View of Task DB Task master schema data Finish/merge Publish schema Publish data Figure 3. Master (loadadmin) and slave (loadsupport) servers. The loadsupport servers have remote views of the loadadmin schema via linked server connections. Note that the master server hosts, in practice, both loadadmin and loadsupport environments. • etc includes miscellaneous SQL script iles for housekeeping and utility schema-related functions. • log holds the weblogging schema and scripts . • sql is the main subdirectory containing the SDSS database schema iles. The SQL scripts in this subdirectory create the various schema tables, views, stored procedures, and functions. The xschema.txt ile in this directory drives the schema creation. it contains a list of schema iles that the loader executes in the order indicated in Table 2. Data Validator Data validation is a critical step in the data-loading process. The validation procedure has the following objectives: • Correctness ensures that the data-loading process and SDSS data-processing pipeline don’t have errors. An important function is to uncover data corruption that can occur as a result of disk failures or any problems with networking hardware. • Completeness checks that all the data is loaded and no data is lost. • Self-consistency checks referential integrity. Data emerges from the data-processing pipe- 41 Table 1. sqlLoader scripts required for data import, validation, and publishing. Script file Script functions loadadmin-build.sql Deletes the existing loadadmin database if any. Turns on the trace flag 1807 (a secret Windows flag that lets us mount remote databases). Sets numerous database options. Turns autoshrink off because the performance bogs down when autoshrink tasks run in the background. loadadmin-schema.sql Creates Task, Step, and Phase tables in the loadadmin database. Inserts null task and step to assign system errors if everything fails. Creates the NextStep table, which drives the loading sequence by specifying the procedures for the next step. Creates the ServerState table, which lets us stop the server and thus stop processing. Creates the Constants table and puts it in all the paths. Gets the server name from a global variable. (Note the SQL Server name for the server must be the same as the Windows name.) loadadmin-local-config.sql Must be adapted for the local configuration before running the loader. Sets up the paths for comma-separated value files. Sets up backup paths. Sets up the loadagent user and domain so that the SQL agent can be started up. loadsupport-build.sql Must be edited to update the domain account names explicitly. (In the future, we will likely reorganize it to deal differently with master/slave loadservers.) Creates a webagent user account used by the Load Monitor Web interface to connect to the loadserver(s) and run loader tasks to make sure that the Web agent is a sysadmin on the loadserver and that only the master loadserver does this. loadsupport-schema.sql Creates tables specific to loadsupport (at the moment, only the LoadServer table that contains the current server’s name). loadsupport-link.sql Sets up the link between the master and slave server for this slave. Creates a two-way link-server relationship between the slave and master servers. Sets up all the views of the loadadmin schema. Enables remote transactions. loadsupport-sp.sql Sets up the stored procedures for loading from the server. Creates constructors for phase, step, and task. Performs start/end steps (only on the loadserver). Kills a task, ensuring that log records are kept and files are cleaned up. Deletes the task database when the same task is resubmitted (with new taskID). loadsupport-utils.sql Stores procedures for the load stage. loadsupport-steps.sql Controls high-level steps: each step has a stored procedure with the name “sp<name-ofstep>step” associated with it, which is the meat of the step’s logic, and an “sp<name-ofstep>” procedure, a wrapper that calls the “sp<name-of-step>step” procedure. loadsupport-show.sql Displays Load Monitor screens. line as ASCii CSV iles and image iles in JPeG and GiF formats. each batch of iles is imported into a staging database where it’s validated and merged. Once validated, the data is moved to the production or archive databases. Figure 4 shows the validation steps. Photometric and spectroscopic data have different routines, but the Best, Target, and Runs data sets have approximately the same validation logic. The validator is invoked as a set of stored procedures on a particular task database; its job is to validate that database, which it does by searching the task table using the host and database names as keys. The returned record tells the validator 42 • the type of validation (photo or spectro); • the type of imaging data (Best, Target, Runs) if it’s a photo job; • the destination database; and • a job iD that’s used as a key to all future log events. spValidate then branches to spValidatePlates or spValidatePhoto (there’s also an spValidateTiles for the tiling data). When these routines complete, spValidate writes a completion message in the task table and exits. At each step, the validation routines record the test’s result. The Load monitor can watch the validation progress and as- COMPUTING IN SCIENCE & ENGINEERING Table 2. Sloan Digital Sky Survey database schema Data Definition Language (DDL) files. Group Metadata Basic schema Spatial functions Region & sector Diagnostics & test Web interface File Description FileGroups.sql Data-partitioning tables for multivolume disk DataConstants.sql Data Constants table and initialization ConstantSupport.sql Support functions to display constants MetadataTables.sql Metadata table definitions IndexMap.sql Index definitions for all tables Versions.sql Database versions PhotoTables.sql Photometry (imaging) table definitions spPhoto.sql Support functions for imaging data NeighborTables.sql Match and Neighbors tables SpectroTables.sql Spectroscopic Data tables spSpectro.sql Support functions for spectroscopic data TilingTables.sql Plate Tiling Information tables RegionTables.sql Sector and Region tables FrameTables.sql Imaging frames for JPEG display Views.sql Definitions of views on all tables spHtmCSharp.sql C# Hierarchical Triangular Mesh (HTM) library functions spSphericalLib.sql Support library for HTM functions spNearby.sql Proximity search functions spCoordinate.sql Coordinate conversion functions spRegion.sql Functions to compute regions spBoundary.sql Functions to compute survey boundaries spSector.sql Functions to compute plate sectors spCheckDB.sql Various diagnostics functions spTestQueries.sql Test query suite spDocSupport.sql Schema browser and online documentation support functions spSQLSupport.sql Web SQL query handling functions spUrlFitsSupport.sql Functions to compute URLs of Flexible Image Transport System file versions of the data finish, validation, and utility functions spNeighbor.sql Neighbor and Match table computation functions spSetValues.sql Functions to initialize various computed columns, such as HTM IDs spValidate.sql Functions for data validation (validate step in loading) spPublish.sql Functions to publish data to the pub database spFinish.sql Functions for finish stage processing spCopyDBSubset.sql Functions to make various-sized subsets of the Best database sess the validation’s success or failure by looking at this journal (in the load database) or by looking at the job summary record. The photo (imaging) data validator performs the following tasks: 1. 2. it checks the uniqueness of the primary key ields. it creates temporary indices to make subse- JANUARY/FEBRUARY 2008 3. 4. 5. quent tests run faster. it then tests a series of foreign keys. it checks to see if the advertised populations match the real populations. it looks at the parent objects to see which child objects were deblended from which parents. it then tests to see that the number of child objects matches the count for the respective parents for the irst 1,000 non-null parents. 43 Test uniqueness of primary keys Test the unique key in each table Test foreign keys Test for consistency of keys that link tables Test cardinalities Test consistency of numbers of various quantities Test HTM IDs Test parent–child consistency Test the Hierarchical Triangular Mesh (HTM) IDs used for spatial indexing Ensure that all parents and children are linked Figure 4. Data validation checks performed by the data validator. The checks range from basic data integrity tests to domain-specific selfconsistency checks. 6. 7. 8. it also tests the irst 1,000 image objects to see that the external hierarchical Triangular mesh (hTm) calculation is similar to the internal one. We allow a few errors due to rounding, but 99 percent of the results should agree exactly. it computes the neighbors of each object; for a Best database, it computes a 30-arc-second neighborhood, but the Target and Runs databases use a 3-arc-second radius. The neighbor computation is the longest step of the validation process. The inal neighbors computation in the publish database is done in the inish stage. Finally, the validator drops the indices it created for the validation work. 5. The last stage of the loading—in which everything is brought together in the publish database, the database indices are created, and all the derived and precomputed tables are created—is complex and time-consuming enough to warrant its own section. Finish Stage Some of the steps in the inish process are SDSSspeciic, but others are relevant for any scientiic archive. The inish procedures are all enshrined in the spFinish.sql ile in the schema directory, although many of these procedures call other functions and procedures elsewhere in the directory. Figure 5 shows the procedures and their dependencies. The following steps are executed in the inish stage in order. Currently, all these steps must be executed sequentially on a single server: 1. 2. 3. 4. spValidatePhoto then returns to spValidate. if the database holds spectroscopic data, it will be validated next. Testing spectroscopic data is simpler. This data is always destined for the spectro part of the database schema, and there are fewer tests: 5. 1. 2. 3. 4. 44 spValidateSpectro irst tests the uniqueness of the primary keys. it then creates two indices to make the subsequent tests run faster. next, it tests several foreign keys. it also tests the irst 1,000 hTm iDs in the SpecObj table (SpecObj.htmiD) to check that the external hTm calculation is similar to the internal one. Finally, it drops the working indices and returns. 6. 7. Drop indices. The irst step is to drop any indices as necessary before updating the data in the affected tables. For incremental loading, we usually leave the primary key (Pk) indices in and drop the foreign key (Fk) and other nonclustered indices. Synchronize schema. This is usually necessary only for incremental loading, when there are schema changes since the last incremental load. Load PhotoTag. The PhotoTag table is a vertical partition of the PhotoObjAll table and hence is created from the PhotoObjAll table during the inish processing. Build primary keys. if any of the Pks (and associated indices) were dropped in the irst step, they’re recreated here. For the larger tables, such as PhotoObjAll (which is more than a terabyte for Data Release 6) and PhotoProile (which has more than 10 billion rows), this step can take several days to inish. Build indices. Other indices besides the Pk indices are built in this step as necessary. Build foreign keys. Fks and associated indices are built after all other indices have been built. (Fk indices require the Pk indices to be in place irst.) The Fks on the larger tables, especially PhotoProile, usually take several days to be created. Build Match tables. many objects in the SDSS COMPUTING IN SCIENCE & ENGINEERING spFinish FileGroups spSpectro spCheckDB spNeighbor spValidate spPublish spSector spTestQueries spSetValues spBoundary spWebSupport spRegion spPhoto spFrameTables spNearby IndexMap ConstantSupport spCoordinate spHTM Figure 5. Dependency chart for finish functions and stored procedures. This shows the complexity and scope of the merge/ finish stage in the loading; it’s also the most application-dependent part of the pipeline. 8. 9. data set are observed more than once, usually because they’re in the overlapping boundaries between ields. The match and match head tables keep track of multiply observed objects, which are useful for time-domain studies—for example, to study variable objects over time. (The procedure for building these tables is in the SkyServer Algorithms help page at http://cas.sdss.org/dr6/en/help/ docs/algorithm.asp?key=match.) Compute neighbors. The neighbors table is a precomputed table that inds the spatial neighbors within 30 arc-seconds of each object in the PhotoObjAll table for fast proximity-type searches in the database. Compute regions and sectors. The Region and Sector tables are part of the Tiling group of tables that contain survey geometry constructs necessary for large-scale structure studies—for example, to gauge the completeness of spectroscopic targeting. Computing the regions and especially the sectors is a complex task, and we’ve written a set of stored procedures and functions to compute the tiling geometry. (We describe the al- JANUARY/FEBRUARY 2008 10. 11. 12. 13. 14. gorithm in the SkyServer Algorithms help page at http://cas.sdss.org/dr6/en/help/docs/ algorithm.asp?key=sector.) Load Science tables. These (usually external) science tables are necessary for the SDSS data, and include tables of standard stars or other well-known catalogs. Sometimes they also include derived science tables, such as those computed from SDSS data. Synchronize spectra with photometry. The spectroscopic observations are matched up spatially with the photometry in this step. Build finish indices. Some indices must be built at the end, after most of the inish processing completes—for example, indices on Spectro tables and match and neighbors tables. Match Target and Best photometry. As we mentioned earlier, we use at least two calibrations of the data: the Target calibration from which the spectroscopic targets were chosen and the Best calibration, which is the latest, greatest calibration. This step correlates the objects from the Target skyVersion with those from the Best skyVersion. Load patches. The last step is to apply any 45 This subdirectory should be copied or moved to the Web tree where the Load monitor is to be accessed from. The Web server connects to the loadadmin server’s loadsupport database. This directory contains four kinds of iles: • active server pages (.asp) iles that correspond to actual Web pages; • Cascading Style Sheet (.css) iles used by the Web pages to set up the overall look and feel; • JavaScript (.js) iles containing functions in JavaScript to perform loader admin procedures; and • include (.inc) iles, which are part of the other three ile types. Figure 6. Load Monitor showing the All Tasks display page. Figure 7. Load Monitor statistics page. Average and cumulative stats for all jobs are displayed. ASP iles correspond to each of the commands listed in the menu on the main sqlLoader page. Figure 6 gives an example of a Load monitor ASP page for the All Tasks command. each of these ASP scripts formulates and submits an SQL query to the loadsupport database on the loadadmin server, which sees the query as being submitted by the webagent user. The query can execute a stored procedure in the loadsupport database or request rows from a loadsupport table. in addition to scripts for displaying the various pages showing task logs and information, there is an ASP script that displays a summary statistics page for loader tasks showing the average total time taken for each step in the task and for each table processed (see Figure 7). Task Management patches necessary before releasing the data to the public. normally, patch iles are listed in a predetermined ile in the schema directory and automatically loaded and executed in this step. in addition to these steps, we must tie up a couple of loose ends manually outside the inishing process, mainly because the data isn’t available during the inish. Right now, the inish stage takes more than two weeks to complete. We’re looking for ways to speed this up, including parallelizing some of the steps and optimizing them for incremental data ingest. The administrator launches the loading process from the Load monitor Web interface. The basic unit of loader processing at the top level is a task. A new task must be created for each unit of the loading using the new Task or Upload pages. The loader framework divides tasks into steps and then subdivides them into phases. Steps have a well-deined start and end, but a phase does not. A step also has an SQL stored procedure associated with it. The Tasks Display pages (active, all, inished, and killed tasks) show task tables containing the task, step, and phase iDs, hence the granularity of the task display is a single phase. Creating a new task. The Add new Task page (see Load Monitor or Admin Web Pages The admin subdirectory contains the Web pages and associated iles and scripts for the loader administration Web interface (the Load monitor). 46 Figure 8) creates a new loader task. The user must enter the following task parameters: • The data set is the release being loaded (for ex- COMPUTING IN SCIENCE & ENGINEERING • • • • • ample, DR1, DR2, or TeST for testing). it’s the same as the irst parameter given to the build-publish-db.bat script. The export type is the data set to which we’re exporting. The choices are Best, Runs, or Target for an imaging load; plates for spectro; tiles for tiling; and a special export type called inish for the last step that merges all the data streams (photo, spectro, and tiling) and computes the indices. The xroot is the root of the exported CSV directory tree on the Linux side (Sambamounted). in Windows notation, this is \\hostname\directory\subdir. xid is the identiier of the export or load unit— that is, the chunk, plates, or tiles that must be loaded. This is the name of the subdirectory in the CSV directory tree that contains the runs, plates, or tiles to be loaded. The user is the name of the person running this load task. Optional comments describe this load’s purpose or content. new tasks can be added singly or in bulk. The Add new Task page is for adding one task at a time. The administrator can also upload a ile containing multiple task deinitions using the Upload page. Submitting multiple load tasks (file upload). Users can submit multiple load tasks at once by building an upload ile containing the task parameters values in CSV format. here’s an example of a load ile’s contents: DR2,BEST,\\sdssdata\dr2data\csv\phCSV\ best\1-82-1176576\,JohnDoe,best34 DR2,BEST,\\sdssdata\dr2data\csv\phCSV\ best\1-86-1184468\,JohnDoe,best35 DR2,BEST,\\sdssdata\dr2data\csv\phCSV\ best\1-86-1402340\,JohnDoe,best38 DR2,BEST,\\sdssdata\dr2data\csv\phCSV\ best\1-86-1422868\,JohnDoe,best39 DR2,TARGET,\\sdssdata\dr2data\csv\ phCSV\target\0-37-718022\,JohnDoe, targ39 DR2,TARGET,\\sdssdata\dr2data\csv\ phCSV\target\0-82-1113106\,JohnDoe, targ40 Killing a task. A user can kill a task by clicking on the last column of the task display in the tasks table. After prompting the user for conirmation, the loader cleans up the information associated with that task. however, it won’t delete JANUARY/FEBRUARY 2008 Figure 8. Load Monitor Add New Task page. New tasks can be added individually or uploaded in a text file containing multiple task specifications. This is the page for adding a single new task. some iles, especially the temporary task databases, until the administrator submits the same task (with the same parameters but a different task iD) again. This is intentional because recreating a database is a time-consuming process, and we assume that, in most cases, a killed task will be rerun at a later date, so the data must be loaded into the database eventually. Of course, the administrator can always manually delete the task database in SQL Server after the task is killed. Monitoring the load. Selecting the active tasks or all tasks links in the Load monitor shows users the tasks currently running (see Figure 6). The color coding for the task status is shown below the task table. For each task, the task iD, the step iD, and the phase number are shown, along with the name of the task and step currently executing. The task display updates once every minute. For each task, users can select the steps, iles, or log links to look at the steps, iles, and phases logged (completed) for that task. The preload step of a loading task usually takes the longest time because CSV iles are loaded into the load database in this step. The largest CSV iles for each run—the PhotoObjAll*.csv iles—will take 10 to 15 minutes each to load, and the preload step for one imaging chunk can take more than an hour to complete. Users can monitor the step’s progress by selecting the display for that task. The Load monitor also lets the administrator start and stop the SQL Server on each of the loadservers from the Servers page. 47 A s we continue to use the sqlLoader to load SDSS-ii (the successor to SDSS) data and look toward future archives, we’re concentrating on several areas of future development. First, several steps in the inish stage recreate tables from scratch even when data is incrementally added to the existing archive. most of these would run in a fraction of their current time if we could adapt them to work incrementally. We’re also looking to fully automate every aspect of the pipeline. Parts of the loading process still require human intervention, especially when things go wrong. Precious time is lost when this happens, so the pipeline needs to be fully automated. Another crucial goal is to make the process scalable to petabyte-scale archives. This entails data partitioning across a cluster of data nodes so we can scale the loader out to archives that are an or- r o f l l Ca s e l c i Art ing mput o C st e e late vasiv r o n th e s r P e p pa IEEE sive, s e f ul p e r va ible, u seek s a cce s re peer- s d de viewe velop m e nt s in mp u o u s co ting. s Topic it wa r e ubiqu , s of t e , an d ology n h c e et an d d wa r nsing e har rld se o w includ ion, l , re a e ra c t c tu r e er int t u r u t p s a m infr n - co ing hu m a includ tion, c a r tions , e a r int e id , an d co n s curit y s te m s e y s s , y d it an labil t , sca ymen lo p e d c y. priva mobil / es: vasive de lin c /per r gui o /m h g t r Au uter.o .comp www r.htm au th o er Furth ils: det a p e r va u co m p sive@ ter.or g . www uter. c om p ive er vas or g /p 48 der of magnitude or more larger than SDSS, such as the Panoramic Survey Telescope and Rapid Response System (Pan-STARRS, http://pan-starrs. ifa.hawaii.edu/) and the Large Synoptic Survey Telescope (LSST, www.lsst.org/). Lastly, we’re researching ways to generalize our current framework. The sqlLoader is tightly coupled to the SDSS schema, even though we’ve tried to keep all the schema iles in a single directory. more work is needed to decouple the schema from the framework. in doing so, we also want to extend the pipeline to other DBmS environments beyond the SQL Server. This is the most challenging task, but it’s essential if the sqlLoader pipeline is to become a general-purpose dataloading pipeline for other archives in astronomy and beyond. References 1. A. Szalay, “The National Virtual Observatory,” Proc. Conf. Astronomical Data Analysis Software and Systems (ADASS) X, vol. 238, F.R. Harnden Jr., F.A. Primini, and H.E. Payne, eds., Astronomical Soc. of the Pacific, 2001., p.3. 2. A. Thakar, S. Szalay, and J. Gray, “From FITS to SQL - Loading and Publishing the SDSS Data,” Astronomical Data Analysis Software and Systems (ADASS) XIII, vol. 314, F. Ochsenbein, M.G. Allen, and D. Egret, eds., Astronomical Soc. of the Pacific, 2004., p. 38. Alex Szalay is Alumni Centennial Professor of Physics and Astronomy at the Johns Hopkins University. His research interests include cosmology, large-scale structure of the universe, data mining, and science with large databases. Szalay has a PhD from Eotvos University, Hungary. He is a member of the American Astronomical Society and the Association for Computing Machinery (ACM). Contact him at szalay@ jhu.edu. Ani R. Thakar is a research scientist in the Center for Astrophysical Sciences at the Johns Hopkins University. His research interests include science with large databases, data mining, and simulations of interacting galaxies. Thakar has a PhD in astronomy from the Ohio State University. He is a member of the American Astronomical Society, the Astronomical Society of the Pacific, and the IEEE Computer Society. Contact him at [email protected]. Jim Gray, prior to his disappearance in February 2007, was the Turing Award-winning distinguished engineer, researcher, and manager of Microsoft Research’s eScience Group in San Francisco. His primary research interests were in databases and transaction processing systems, with particular focus on using computers to make scientists more productive. COMPUTING IN SCIENCE & ENGINEERING