We will introduce the workshop and initiate the discussion by reviewing three possible approaches for applying database techniques and technologies to the platforms and applications of HPC.
Stage 1: Database software for post-hoc analysis
The most obvious application of databases in the context of HPC is to load the results of simulations into a DBMS to support ad hoc query. This approach has been applied in materials science (Herber and Gray 2005), astrophysics (Loebman 2009), and oceanography (Howe 2008), but rarely at “full-scale” and in an operational, production context. The challenge is that although DBMS are routinely used at petascale in industry (XLDB 2009, 2010), such deployments require significant specialized expertise to build and operate (experienced Database Administrators can command a salary $200k or more), and the license fees for the database products can approach O($100k/TB) (Monash 2009). We estimate that few HPC projects allocate this level of budget for data management considerations. Moreover, successful deployment of this technology for HPC applications is by no means assured, given the unique data types, unique computing environments, and unique application requirements.
Stage 2: Database techniques informing new data analysis software
A second approach is to apply the concepts and techniques pioneered in the database community to develop new software specialized for HPC environments. The promise of this approach is to raise the level of abstraction for manipulating HPC data types without being forced to use an inappropriate or over-simplified data model (tables). Examples of this approach include the development of new data models, query algebras, and database systems for scientific data (SciDB, Howe 2005, Kumar 2010), the adaptation of IO-oriented indexing methods to HPC computing (Wu 2006, Kim 2010), or even the extension of file-based data management solutions with database-style query and indexing features (Grossnik 2008).
Stage 3: Novel integrated platforms for HPC and large-scale ad hoc query
Systems in this approach erase the separation between HPC and database applications, using simulation and stored data to answer questions interchangeably. This class of system is of high interest for this workshop, and raises many interesting questions. What is the appropriate programming interface to such a system? Can queries initiate simulations as part of their query processing strategy? What roles does the new breed of languages and tools for data-intensive computing play (Hadoop and friends)? What requirements does such a system impose on the hardware platform? Will these platforms require extremely high Amdahl numbers to be effective? Can proposed exascale systems serve both sets of requirements through adaptation of main-memory database techniques?
