Recommend When Should Approximate Query Processing Be Used? (Email)

This action will generate an email recommending this article to the recipient of your choice. Note that your email address and your recipient's email address are not logged by this system.

EmailEmail Article Link

The email sent will contain a link to this article, the article title, and an article excerpt (if available). For security reasons, your IP address will also be included in the sent email.

Article Excerpt:

This is a guest repost by Barzan Mozafari, an assistant professor at University of Michigan and an advisor to a new startup, snappydata.io, that recently launched an open source OLTP + OLAP Database built on Spark.

The growing market for Big Data has created a lot of interest around approximate query processing (AQP) as a means of achieving interactive response times (e.g., sub-second latencies) when faced with terabytes and petabytes of data. At the same time, there is a lot of misinformation about this technology and what it can or cannot do.

Having been involved in building a few academic prototypes and industrial engines for approximate query processing, I have heard many interesting statements about AQP and/or sampling techniques (from both DB vendors and end-users):

Myth #1. Sampling is only useful when you know your queries in advance
Myth #2. Sampling misses out on rare events or outliers in the data
Myth #3. AQP systems cannot handle join queries
Myth #4. It is hard for end-users to use approximate answers
Myth #5. Sampling is just like indexing
Myth #6. Sampling will break the BI tools
Myth #7. There is no point approximating if your data fits in memory

Although there is a grain of truth behind some of these myths, none of them are actually accurate. There are many different forms of sampling, approximation, and error quantification, and their nuances are missed by these blanket statements. In other words, many of these impressions are simply based on wrong assumptions and/or misunderstanding of basic AQP terminology.

Anyhow, instead of going over each of these statements and explaining why they are categorically wrong, in this post I’d like to answer the positive question: When can (and should) one use approximate answers? Note that by asking this question, I am implicitly giving away that I don’t think approximate answers are always useful. A perfect example where you don’t want to use approximation is in billing departments. (Although every time I look at my own Internet bill, I start to think that even this example has its own exceptions. I’m too afraid to mention my Internet provider’s name here but I am sure you can guess).

Anyhow, let’s discuss the key reasons and use-cases for approximate answers.

1. Use AQP when you care about interactive response times


Article Link:
Your Name:
Your Email:
Recipient Email:
Message: