« Behind the scenes of an online marketplace | Main | Elasticity for the Enterprise -- Ensuring Continuous High Availability in a Disaster Failure Scenario »
Thursday
Apr292010

Product: SciDB - A Science-Oriented DBMS at 100 Petabytes

Scientists are doing it for themselves. Doing what? Databases. The idea is that most databases are designed to meet the needs of businesses, not science, so scientists are banding together at scidb.org to create their own Domain Specific Database, for science. The goal is to be able to handle datasets in the 100PB range and larger.

SciDB, Inc. is building an open source database technology product designed specifically to satisfy the demands of data-intensive scientific problems. With the advice of the world's leading scientists across a variety of disciplines including astronomy, biology, physics, oceanography, atmospheric sciences, and climatology, our computer scientists are currently designing and prototyping this technology

The scientists that are participating in our open source project believe that the SciDB database — when completed — will dramatically impact their ability to conduct their experiments faster and more efficiently and further improve the quality of life on our planet by enabling them to run experiments that were previously impossible due to the limitations of existing database systems and infrastructure. Many of the world's leading computer scientists with expertise in database systems have contributed to the design and architecture of the system to meet the needs of the world's scientists.

SciDB looks like a cool project and follows what might be considered a trend, instead of beating a general tool into submission, build a specialized tool that does what you need it to do. More details about SciDB can be found in the paper A Demonstration of SciDB: A Science-Oriented DBMS. A nice succinct poster is available summarizing the product.

Some interesting bits from the paper:

  • The data model used is that of multi dimensional, nested array model with array cells containing records, which in turn can contain components that are mult-dimensional arrays. Arrays can have any number of named dimensions. Sparse, ragged, and unbounded arrays are supported.
  • Science is not a monolith. Different specialties would like different data representations. Users in chemistry, biology, and genomics would like to see a graph database. Users in solid modeling would like a mesh. But the array is a natural for astronomy, oceanography, fusion, remote sensing, climate modeling, and seismology.
  • Postgres-style user defined functions coded in C++ allow for custom operators. There are no built-in operators, all operators are UDFs.
  • No overwrite. It's not transactional. Scientists don't want to update values in place. If a cell is declared updateable there's a history mechanism that traces the history of changes to each cell. A delta compression mechanism to reduce data usage. 
  • Grid. A shared nothing cluster to 1000s of nodes. Data is split horizontally over. Partitioning is flexible. Partitions are not fixed, they can change over time to adapt to changing data. Partitions can be based on dimensions or attributes.
  • Queries. Queries refer to a single array and the query planner takes care of splitting the query across all the nodes. UDFs are run in parallel across the cluster.
  • "In Situ" Data. No load process required. Data analysis is often preceded by a data load phase which can take forever with large datasets. Adaptors will be created so data not in the SciDB format can be used without loading.
  • Integrated Cooking. Data often needs to be cooked/changed/transformed/normalized to be useful. The cooking operators will be integrated into SciDB.
  • Provenance. The steps for how an array was created will be remembered so scientists can show how results were derived.
  • Uncertainty. Attributes supports a value and error components so uncertainty can be represented natively in the database.
  • Open source. A non-profit has been setup to create a commercial quality open source database for the scientific community.

Not your typical looking database. The Domain Specific aspects create something unique and hopefully uniquely useful. I wonder if we'll see more Domain Specific Database efforts in the future?

Reader Comments (4)

"Not your typical looking database. The Domain Specific aspects create something unique and hopefully uniquely useful. I wonder if we'll see more Domain Specific Database efforts in the future?"

Quite possibly, but I'd like to avoid reinventing the wheel...

Ideally, the SciDB folks would be able to use an existing sharding/replication backend for normal data storage, then extending it with modules to handle parsing third-party data formats without an import, specialist numeric types that have error bars and provenance data, and so on.

Current databases are still too monolithic - there's a lot of code in common between even the most disparate types. Your massively sharded NoSQL document store and your traditional single-node SQL server still use B-Trees, and have all the usual code to handle connections and concurrency control...

April 30, 2010 | Unregistered CommenterAlaric Snell-Pym

Wow very impressive. Standard databases do tend to have problems with massive data sets, thanks for the solution

April 30, 2010 | Unregistered Commentersteve

For an Open Source project, they sure don't have much code publicly available.

May 3, 2010 | Unregistered CommenterMike

this is not really new. What SciDB announces has already been implemented in the rasdaman ("raster data manager") system long before. See www.rasdaman.org for the free source code, www.earthlook.org for an interactive online demo.

Rasdaman comes with some goodies:

SQL-style query language for multi-dimensional arrays, with heavy server-side optimizations (heuristic rewriting, adaptive tiling, compression, just-in-time compilation, etc.)

complex run-time type definitions through raster DDL

integrates array data with relational backend storage; this has been claimed by posters here already.

in operational under industrial conditions use since 5+ years, on dozen-Terabyte raster objects

not read-only (like SciDB plans), but raster data can be modified arbitrarily (in other words: select, update, insert, and delete statements are implemented)

dedicated support for a series of geo-scientific raster data standards (OGC WMS, WCS, WCPS, WCS-T, WPS)

commercial support by a spin-off company

November 4, 2010 | Unregistered CommenterPeter Baumann

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>