« Stuff The Internet Says On Scalability For May 22nd, 2015 | Main | Paper: FlashGraph: Processing Billion-Node Graphs on an Array of Commodity SSDs »
Wednesday
May202015

Database Scaling Redefined: Scaling Demanding Queries, High Velocity Data Modifications and Fast Indexing All At Once for Big Data

This is a guest post by Cihan Biyikoglu, Director of Product Management at Couchbase.

Question: A few million people are out looking for a setup to efficiently live and interact. What is the most optimized architecture they can use?

  1. Build one giant high-rise for everyone,
  2. Build many single-family homes OR
  3. Build something in between?

Schools, libraries, retail stores, corporate HQs, homes are all there to optimize variety of interactions. Sizes of groups and type of exchange vary drastically… Turns out, what we have chosen to do is, to build all of the above. To optimize different interactions, different architectures make sense.

While high rises can be effective for interactions with high density of people in a small amount of land, it is impractical to build 500 story buildings. It is also hard to add/remove floors as you need them. So high-rises feel awfully like scaling-up – cluster of processors communicating over fast memory to compute fast but limited scale ceiling and elasticity.

As your home, single-family architecture work great. Nice backyard to play and private space for family dinners... You may need to get in your car to interact with other families, BUT it is easy to build more single family houses: so easy elasticity and scale. Single-family structure feels awfully like scaling-out, doesn't it? Cluster of commodity machines that communicate over slower networks and come with great elasticity.

“How does this all relate to database scalability?” you ask… Databases optimize your interaction with the data and manages a structures to help keep that interaction efficient much like the housing examples above. Executing low latency, interactive queries vs indexing high velocity data vs large volume of data mutations (inserts, updates and deletes)... Each one of these have different demands on the computational resources. The HW you would choose for best mutation throughput and performance, will be different than used for Queries. For one, you need faster storage, while complex queries demand lots of processing power for example..

More importantly, each of these workloads operate best in a different scalability models. Core data mutations given a set of keys can scale smoothly with scale-out. However an interactive app running a low latency query with complex joins will suffer higher latency with a wider fan-out to more nodes because networks are slow and local processing in memory is fast!

With all this in hand however, to date, nosql and big data systems have scaled out to an homogenous set of nodes. Slicing equal portion of processing to each node… This fine and simple. Couchbase Server and others products in big data and nosql space can all do this. These modern computing architectures have built cities with only a single type of building. This means all the workloads from indexing to query to core data operations have to compete on every node and when it is time to add node #9 below, they all have to follow along and stretch.

9C356DDF-6892-49A9-935F-F90C674DC48A.jpg.jpeg

If you are looking for better latencies and throughput, there is a better way to build scalable database architectures. We need a way to optimize each one of these workloads independently. Queries should get the best HW that fit their needs... Core data operations should not slow down with each index you create… Indexes ought to be independently partitioned and placed to provide the best indexing vs lookup latency and throughput… You need to be able to mix types of buildings to optimize various interactions.

The picture below depicts a better layout of these distinct workloads for best latency and throughput. With this model, you create "zones" within the cluster for query processing, index processing and data modifications. Each zone can get its own HW. You can independently decide the scale-out vs scale-up factor for each zone.

52331023-DBC6-4BC0-BAA4-CC3E7EFD7818.jpg.jpeg

This capability is called "Multi-dimensional Scalability" and we are providing this facility with Couchbase Server 4.0 (now available as a developer preview). Multi-dimensional scalability will allow independent scaling of services to match the workload (data, index or query) with the best scalability model. Just like we have chosen to build different building structures for different interactions, core data operations, indexing and query processing can choose HW independently in your Couchbase Server cluster.

Let's dig deeper into services and Couchbase Server architecture.

The figure below shows the cluster architecture overview. Each node contains the cluster manager, managed cache and storage engines. Each node also contains the optional service: data, index and query that can be enabled or disabled depending on the topology you’d like to achieve: for homogeneous scaling all services can be enabled on all nodes. For independent scaling, one can enable one service per node to achieve isolated zones within the cluster for each server.

Figure below shows the zoomed in version: it details the single node architecture further.

Cluster Manager: In each node Couchbase Server carries the cluster manager to manage coordination across nodes.

Managed Cache and Storage Engine: As the name suggest these components manage the caching and storage of data in each node.

Data Service: Data Service houses the core data storage module and can be enabled or disabled per node. Its architecture is designed to optimize core data operations (CRUD). It is designed to perform the operations with top speed, with flexible durability guarantees and consistency vs availability dials. Data Service can house multiple buckets (a.k.a databases).

Index Service: Is the most interesting service of all 3. Just like the data service, index service is an optional service that can be enabled on any given subset of nodes to achieve homogeneous or independent scaling.

Index Service houses the indexes and manages two main workloads:

  • index maintenance which keep the index fresh,

  • query response which answers query engine asks with basic operators like filters, scans, lookups etc.

When core data operations and indexing happen on the same node, competing processes steal cycles from each other to process both incoming mutations to data and to maintain a fresh index for queries. With MDS, index service can be isolated to its own “zone” in the cluster.

Index Service responds to queries as well. For scaling the query responses, indexing service benefits greatly from local processing. Partitioning an index to many nodes may degrade index scan performance for certain queries (think range scan on a hash partitioned index). With the flexible deployment model in Couchbase Server MDS, instead of wider fanouts to many nodes for index scans, indexing service can place indexes more efficiently in fewer nodes with larger memory utilizing efficient local, in-memory processing.

Index Service also is efficient in maintaining fresh indexes. It activates the router and projector on data service nodes to optimize index maintenance. With Projector and Router, changes to data are picked up and filtered to ensure only the relevant updates are sent to the index service nodes. This ensures when Index Service is deployed in an independent zone within the cluster, the data over the network that is required to maintain the index is minimal.

Query Service: Query service houses the query execution engine. SQL parser, optimizer and executor reside here to compile a plan and execute your SQL query to completion with the desired consistency guarantees developer express through their queries. Query service also likes local parallelisation of queries and can benefit greatly from higher number of cores. With the independent scaling in MDS, one can separate out query processing to ensure query processing does not interfere with core data operations or index maintenance.

Why is Isolation and Independent Scaling is Important?

Building complex systems such as customer-360 apps, fraud detection systems or low latency ad-serving applications require both demanding queries to be executed with low latency, indexes combined with high velocity data modifications. With multi-dimensional scalability, you can build these systems without sacrificing performance at scale. You can pick commodity servers for data modifications and scale out to fan out the mutations. You can combine that with nodes with higher core counts and larger memory and distribute your indexes independent from data nodes. This would give your indexes an edge to answer queries without wide fan outs to many nodes but parallelize more locally.

One last word before I close: You can try MDS and Couchbase Server 4.0 today by downloading the developer preview. MDS provides the utmost flexibility to achieve either the simple mode with homogeneous scaling or more fine tuned independent scaling. However I am confident that as the big data volume and velocity increase, independent scaling will enable developers to achieve new heights in performance and scalability to rise to the challenge.

Related Articles

Reader Comments

There are no comments for this journal entry. To create a new comment, use the form below.

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>