Entries in APQ (2)

Monday
Nov282016

How to Make Your Database 200x Faster Without Having to Pay More?

This is a guest repost Barzan Mozafari, an assistant professor at University of Michigan and an advisor to a new startup, snappydata.io, that recently launched an open source OLTP + OLAP Database built on Spark.

Almost everyone these days is complaining about performance in one way or another. It’s not uncommon for database administrators and programmers to constantly find themselves in a situation where their servers are maxed out, or their queries are taking forever. This frustration is way too common for all of us. The solutions are varied. The most typical one is squinting at the query and blaming the programmer for not being smarter with their query. Maybe they could have used the right index or materialized view or just re-write their query in a better way. Other times, you might have to spin up a few more nodes if your company is using a cloud service. In other cases, when your servers are overloaded with too many slow queries, you might set different priorities for different queries so that at least the more urgent one (e.g., CEO queries) finish faster. When the DB does not support priority queues, your admin might even cancel your queries to free up some resources for the more urgent queries.

No matter which one of these experiences you’ve had, you’re probably familiar with the pain of having to wait for slow queries or having to pay for more cloud instances or buying faster and bigger servers. Most people are familiar with traditional database tuning and query optimization techniques, which come with their own pros and cons. So we’re not going to talk about those here. Instead, in this post, we’re going to talk about more recent techniques that are far less known to people and in many cases actually lead to much better performance and saving opportunities.

To start, consider these scenarios:

Click to read more ...

Thursday
Feb252016

When Should Approximate Query Processing Be Used?

This is a guest repost by Barzan Mozafari, an assistant professor at University of Michigan and an advisor to a new startup, snappydata.io, that recently launched an open source OLTP + OLAP Database built on Spark.

The growing market for Big Data has created a lot of interest around approximate query processing (AQP) as a means of achieving interactive response times (e.g., sub-second latencies) when faced with terabytes and petabytes of data. At the same time, there is a lot of misinformation about this technology and what it can or cannot do.

Having been involved in building a few academic prototypes and industrial engines for approximate query processing, I have heard many interesting statements about AQP and/or sampling techniques (from both DB vendors and end-users):

Myth #1. Sampling is only useful when you know your queries in advance
Myth #2. Sampling misses out on rare events or outliers in the data
Myth #3. AQP systems cannot handle join queries
Myth #4. It is hard for end-users to use approximate answers
Myth #5. Sampling is just like indexing
Myth #6. Sampling will break the BI tools
Myth #7. There is no point approximating if your data fits in memory

Although there is a grain of truth behind some of these myths, none of them are actually accurate. There are many different forms of sampling, approximation, and error quantification, and their nuances are missed by these blanket statements. In other words, many of these impressions are simply based on wrong assumptions and/or misunderstanding of basic AQP terminology.

Anyhow, instead of going over each of these statements and explaining why they are categorically wrong, in this post I’d like to answer the positive question: When can (and should) one use approximate answers? Note that by asking this question, I am implicitly giving away that I don’t think approximate answers are always useful. A perfect example where you don’t want to use approximation is in billing departments. (Although every time I look at my own Internet bill, I start to think that even this example has its own exceptions. I’m too afraid to mention my Internet provider’s name here but I am sure you can guess).

Anyhow, let’s discuss the key reasons and use-cases for approximate answers.

1. Use AQP when you care about interactive response times

Click to read more ...