SciDB – a database for scientific analysis – “For the toughest problems on the planet”
SciDB Inc. website (http://www.scidb.org/) opens with a powerful statement “For the toughest problems on the planet“. Typically, scientists are forced to retrofit business information technologies to suit their needs or to build their own technologies with very limited resources. To work more productively, scientists do need information solutions built for their purpose (for science).
In March 2008 at Asilomar, representative group of science and database experts came together to determine if the requirements of the different scientific domains (and some large-scale commercial applications) were similar enough to justify building a database system tailored to the needs of the scientific community. The answer was “YES”.
Result is the decision to build SciDB, an open source database technology product designed specifically to satisfy the demands of data-intensive scientific analytics. It was decided that:
- SciDB was to be built around analytics with a storage that is write-once, read-many.
- Bulk loads, rather than single-row inserts, to be the primary input method.
- “Load-free” access to minimally-structured data will be provided.
- Functions and procedures will execute in parallel, as close to the data being operated on as possible.
- Interfaces to common scientific tools like R and eventually MATLAB and IDL, as well as programming languages like C++ and Python, will be provided.
- Many features important to science – including versioning, provenance tracking, and support for uncertain data with error bars – will be standard with SciDB.
Other requirements to be satisfied by SciDB were elaborated upon in detail in the workshop report (http://www.jstage.jst.go.jp/article/dsj/7/0/88/_pdf) and summarized as follows:
- A data model based on multidimensional arrays (that simplifies the representation of time series and spatial grids), not sets of tuples
- A storage model based on versions and not update in place
- Built-in support for provenance (lineage), workflows, and uncertainty
- Scalability to 100s of petabytes and 1,000s of nodes with high degrees of tolerance to failures
- Support for “external” data objects so that data sets can be queried and manipulated without ever having to be loaded into the database
- Open source in order to foster a community of contributors and to insure that data is never “locked up” – a critical requirement for scientists.
World’s leading scientists across a variety of disciplines including astronomy, biology, physics, oceanography, atmospheric sciences, and climatology are involved in the design of SciDB. The one page poster available at http://www.scidb.org/Documents/SciDB-VLDB09-poster.pdf gives an excellent view of the vision of the product.
SciDB is not to be like a traditional database, but optimized for data management of big data and for big analytics. It is referred to as DMAS (as against DBMS), a Data Management and Analytics Software System.
SciDB falls under NoSQL category due to fact that SciDB:
- Is not to be optimized for online transaction processing (OLTP) and only provide minimal support to transactions.
- Doesn’t need to provide strict atomicity, consistency, isolation, and durability (ACID) constraints.
- Doesn’t have a rigidly-defined, difficult-to-modify schema.
SciDB is designed to run on a shared-nothing cluster of commodity servers (or nodes), each with its own local storage, and interconnected by an Ethernet network; a physical architecture also known as a Just a Bunch of Disks (JBOD) configuration. SciDB design is aimed at being able to continue operating in the face of node failure, without even restarting a long-running operation in progress.
With its first public demo on August 2009, SciDB now has a new release R11.06 (using a <year.month> naming convention), which is effectively a R1.0, available. SciDB currently supports a single coordinator node that will run on node-0. All host SciDB worker instances perform their own localized data storage and query processing. It is recommended that the SciDB executables and configuration files are stored on a single file system which is shared by all the physical servers using a network file system facility (like NFS).
SciDB uses an ODBC/JDBC like interface to connect to the SciDB server and execute commands. This interface is available from multiple computer languages. The coordinator instance is where external client applications connect, and it is responsible for query parsing, planning and coordinating query execution operations over the collection of SciDB worker nodes.
The ‘files’ SciDB presents to users are logical ‘arrays’, and the bits and pieces of the arrays are distributed evenly over the physical nodes. A centralized DBMS (PostgresSQL) is used to keep track of information like – what arrays exist, which nodes are where, what instances are running on which nodes, how array data is partitioned over SciDB instances, and what physical operators the instance has available to it.
SciDB provides two programming interfaces:
- AQL, an array query language that is similar to SQL.
- AFL, a functional language with a functional syntax, provides the capabilities as AQL as well as support for querying array metadata and additional array operators.
AFL is expected to be revised over time as the implementation evolves and matures, currently provides:
- Data Definition Language (DDL) that allows users to create, load, remove and modify arrays and their structure.
- Data Manipulation Language (DML) that allows users to query the contents of arrays
- Metadata operators that provide a number of mechanisms for getting information about the database (e.g., list of arrays, dimensions of array, attributes of an array etc.)
SciDB currently supports user-defined functions (UDFs) referred to as plugins and user-defined types (UDTs). In the near future, SciDB is expected to include user-defined aggregates (UDAs) and user-defined operators (UDOs).
SciDB Use Cases currently available at http://www.scidb.org/use/ are in the areas of Optical astronomy, Radio astronomy, Earth Remote sending, Environmental Observation and Modeling, Seismology and Atmospheric Radiation Measurement (ARM) Climate Research.
SciDB is indeed a good initiative and it is hoped that SciDB (and the likes) will enable scientists to work productively in bringing out significant improvements on challenges like global warming, cancer, communicable diseases, etc. and hence make our living that much better.
In addition to scientific applications such as astronomy, remote sensing and climate modeling, bio-science information management, commercial applications such as risk management systems in the financial services sector, and the analysis of web log data are also expected to benefit from using SciDB.
A good presentation on SciDB is available at http://diuf.unifr.ch/main/xi/sites/diuf.unifr.ch.main.xi/files/SciDB_CIS_2010.pdf.
“SciDB will provide a unique level of data access that will allow scientists to understand data in far deeper and more natural ways. Its array-based model inherently supports convenient access and analysis of raw data sources as well as preprocessed products in a unified fashion. We will see our data in a whole new light.” …Martin L. Perl -1995 Physics Nobel Laureate