« Google App Engine - what about existing applications? | Main | Scaling Mania at MySQL Conference 2008 »
Saturday
Apr192008

How to build a real-time analytics system?

Hello everybody!

I am a developer of a website with a lot of traffic. Right now we are managing the whole website using perl + postgresql + fastcgi + memcached + mogileFS + lighttpd + roundrobin DNS distributed over 5 servers and I must say it works like a charm, load is stable and everything works very fast and we are recording about 8 million pageviews per day.

The only problem is with postgres database since we have it installed only on one server and if this server goes down, the whole "cluster" goes down. That's why we have a master2slave replication so we still have a backup database except that when the master goes down, all inserts/updates are disabled so the whole website is just read only.

But this is not a problem since this configuration is working for us and we don't have any problems with it.

Right now we are planning to build our own analytics service that would be customized for our needs. We tried various different software packages but were not satisfied with any of them.

We want to build something like Google Analytics so it would allow us to create reports in real-time with "drill-down" possibility to make interactive reports. We don't need real-time data to be included in report - we just need a possibility to make different reports very fast. Data can be pre-processed.

For example right now we are logging requests into plain text log files in the following format:
date | hour | user_id | site_id | action_id | some_other_attributes..

There are about 8 - 9 million requests per day and we want to make real-time reports for example:
- number of hits per day (the simplest)
- number of hits by unique users per day
- number of hits by unique users on specific site per day
- number of distinct actions by users on specific site during defined period (e.g. one month, period of X months...)
etc.

You can display any type of report by combining different columns as well as counting all or only distinct occurrences of certain attributes.

I know how to parse these log files and calculate any type of report I want, but it takes time. There are about 9 million rows in each daily log file and if I want to calculate monthly reports I need to parse all daily log files for one month - meaning I have to parse almost 300 million of lines, count what I want and then display the summary. This can take for hours and sometimes it has to be done in more than one step (e.g. calculating a number of users that have been on site_id=1 but not on site_id=2 - in this case I have to export users on site 1, export users on site 2 and then compare results and count the differences).

If you take a look at Google Analytics it calculates any type of similar report in real-time. How do they do it? How can someone form a database that could do something like that? If I put 300 million of rows (requests per month) into the Postgres/MySQL table, selects are even slower than parsing plain text log files using Perl...

I am aware that they have a huge amount of servers but I am also aware that they have even bigger amount of hits per day. I have a possibility to store and process this kind of analytics on multiple servers at the same time but I don't have enough knowledge how to construct a software and database that would be able to do a job like this.

Does somebody have any suggestion? A simple example would be great! We already managed to make some sort of a database for site_id+action_id drilldown but the problem is with "unique users" which is THE information that we need all the time. To calculate unique users during certain period you have to count all the distinct user_ids during that time period. E.g.: select count(distinct user_id) from ... where date>='2008-04-10' and date <='2008-04-18' - with a 9million rows per day this statement would take about two minutes to complete and we are not satisfied with it.

Thank you for any hint!

Reader Comments (18)

Howdy, you might find some help in http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-query-terabytes-data. They had a similar problem and solved with hadoop.

December 31, 1999 | Unregistered CommenterTodd Hoff

I wrote something about it on this thread
http://highscalability.com/how-update-video-views-count-effectively

Thats only half the part. If you are contended with that aproach, let me know and I can explain the second half (analytics).

best regards

December 31, 1999 | Unregistered Commenteratif.ghaffar

Hi!

atif.ghaffar: I read your thread and I don't think that it has a lot in common with my problem. In this thread you talk about real time video views counter. As I said we have a very large website that does all that without any problems and we do tasks like this using memcached (temporary storage) + mogileFS (permanent storage) so that's not a problem. But we collect a big amount of plain-text log files which need to be analyzed for us as well as for our users. We want to offer our users detailed statistics with "drill-down" capability - right now they only have summaries of the statistics that we pre-preprocess (like daily summary, weekly summary and monthly summary).

Todd Hoff: Your solution seems more similiar to our problem - I will take a closer look at it! Does anybody have any hints for beginners in Hadoop? I never used it and I am not familiar with Hadoop and whole MapReduce concept at all. Where to start? How fast is this approach? In this Mapreduce thread you posted a link it says that "making a report is a matter of hours"... well I am looking for a real-time solution like Urchin that powers Google Analytics provides.

Maybe I wasn't so clear in my first post so I will try to explain in detail what we want to do (oh, and sorry if my English is bad - I hope you understand what I'm talking about)...

For example - if we want to display a number of views of some content by UNIQUE users in let's say march 2008 we need:
- date
- user's ID
- content ID
and we can select something like
select count (distinct user_id) from logtable where content_id='1234' and date>='2008-03-01' and date < '2008-04-01';

This type of SQL statement takes a couple of minutes to complete because of a large database table (like I said - about 300 million log file rows per month)... That's why we don't use SQL for logging - we use perl for parsing plain text files and then we store everything into the summary table e.g.:

month | content | unique_user_count
2008-03 | 1234 | 987654

If user is interested in a report for a different time period (let's say from 5th to 10th march 2008) we can't provide him with a datepicker to change the time period and display report since this would kill our server (and user also doesn't want to wait for 5 minutes for the report - he wants it immediately).
We also calculate daily number of unique views which we store in another database table:
date | content | unique_user_count
2008-03-01 | 1234 | 43744
2008-03-02 | 1234 | 56312
2008-03-03 | 1234 | 47639
etc.

But the problem is that unique user count for a period of 5 days is not equal to sums of daily unique user counts for that period! It's less because the same users can view that content multiple times on different days.

So the problem is how to calculate, store and analyze data about unique users in relation to date/time? If we need only ALL views (not by unique users) we don't have a problem - we can calculate daily summaries and then sum everything up on the fly. But the problem is with unique users reports since you can not easily sum daily data to get a total data during defined time period.

December 31, 1999 | Unregistered Commenterj99

noob here, but a thought. Based on your text, I can't tell if you are doing this already or if it is limited

The first thing I thought of when reading this was a datawarehouse. Usually too expensive to do log anaylsis. But the concept of pre-calculation of reports may speed up the reporting. This is different from a normalized database concept, but you are looking for speed of delivery.

1). Create a table for your most granular search
2). Create a table for one step above
n). ..Repeat for the levels / reports you have

INSERT INTO Table 1 - SELECT distinct month, week, date, hour, content, user, count FROM Hourlylogdump
INSERT INTO Table 2 - SELECT distinct month, week, date, content, user, count FROM Table1
INSERT INTO Table 3 - SELECT distinct month, week, content, user, count FROM Table2
INSERT INTO Table 4 - SELECT distinct month, content, user, count FROM Table3
This is in the simplest form. Of course you will have other tables based on your reporting needs, but here you are pre-calculating the data (once) before the user wants to analyze the data, therby making it fast

Some of these INSERT statements will actually be updates, depending on the measures(date fields) you are working with.

December 31, 1999 | Unregistered CommenterRob

Avinash Kaushik in http://www.kaushik.net/avinash/2006/10/is-real-time-really-relevant.html">Is Real-Time Really Relevant? questions the need for real-time analytics. So maybe the best strategy is to do nothing. That's an easy system to build :-)

December 31, 1999 | Unregistered CommenterTodd Hoff

I agree with Rob, a data mart for web usage analysis would probably solve your reporting needs, I've done it successfully on smaller data sets. You'd need a single table for page views (a degenerate dimension with the user's IP address would suffice to analyze unique visits by IP) that would be populated automatically on a given periodicity using some ETL software, either pre-packaged or developed in-house. You can review Ralph Kimball's work on this subject on "The Data Webhouse Toolkit" or (resumed) on "The Data Warehouse Toolkit".

December 31, 1999 | Unregistered CommenterJG

Try bulk-loading your data into MySQL in chunks and then do something like:

INSERT INTO stats table SELECT summed stats FROM temp table GROUP BY group fields ... ON DUPLICATE KEY UPDATE stats

on an InnoDB table with your primary or unique key as your date and whatever other grouping columns (eg. siteID). Primary key clustering on innodb makes this extremely fast if your aggregated tables end up relatively small (below size of main memory) and your summary fields are additive. I've seen this approach work on a single off-the-shelf server to process over 200m log events a day for "real-time" reporting.

For unique user-type counting you'll probably have to scan through your entire dataset for the time range of interest, but you can probably do that separately and less frequently with a script or even a Kettle or other ETL process. At 8-9m page views a day, i think hadoop might be overkill.

December 31, 1999 | Unregistered CommenterAlex Tomic

Couple of years before i was searching for an simple, scalable and easy to integrate analytics solution and ended
up building one. Some of the suggestions:
1. High performance data collection can be achieved by using MySQL MyISAM tables and INSERT DELAYED queries for bulk load.
2. Must flush data to db in batch buffered mode.
3. If writing raw data to file during collection, do in memory buffering, Dumping to file system must happen in sizes of multiple of disk blocks. In case of jave use java.io.RandomAccessFile to write log files. Avoid using log4j for collection of raw data. Log4j is a buetiful tool build for solving different use case.
4. During aggregation of data in Database make use of computation reuse by defining appropriate common sub expressions. Do incremental aggregation and computation for linear queries.

Our custom analytics solution is scaling very well for our shopping
site.

VRL Swamy Vadali

December 31, 1999 | Unregistered Commentervrl_swamy

Google Analytics is not real time, it has at least a 24 delay.
I think every Google application including adsense, adwords has a delay of ~ 3 hours.

December 31, 1999 | Unregistered CommenterMarcelb

We offer an analytics service and what we do is record every hit in binary format, not text. Then another process reads all the data for one hour into memory, groups it by different criteria and inserts the summary counts into different MySQL tables. For unique users, we keep separate daily and monthly files of unique user IDs for each criteria. Those files are merged every time the 'update stats' process runs and the number of rows in each file is inserted into the tables. This allows us to process around 250MM daily hits with an average delay of 1 hour on a 4GB 4-core server. We have around 30 different grouping criteria.

Before this, we used to load every single hit into mysql and create the summary tables with GROUP BY, but it did not scale.

We are not working on storing the binary files in glusterfs so as to be able to run the 'update stats' process in many boxes at the same time. We tried Hadoop but it was too slow. We tried KFS but it didn't work very well in our tests. glusterfs seems to work fine, but the fact that all coordination between servers is done by the clients puts some limitations.

REcently we tested a multithreaded qsort() to speed up sorts, but found some mutex() deadlocks. We ended up devoloping our own which is not as fast, but does not fail.

December 31, 1999 | Unregistered CommenterAnonymous

We have a solution for that problem precisely. You can use a SQL on flat file to get the data out in real-time.

I will be more than happy to discuss this with you.

December 31, 1999 | Unregistered CommenterAnonymous

What do you mean SQL on flat file? you mean a parser that takes SQL queries against logs?

December 31, 1999 | Unregistered CommenterAnonymous

The simple answer (one of) is to use Sybase IQ. I've cut and pasted the performance stats from the Sybase website. They tested it with 1000 TB's of loaded data. Yes, the petabyte test was done on a large machine ;) But you can get snazzy results with a 2/4 way box running a cheap linux distro. One of the keys is it's not IO BOUND! Second is you get 70% data compression. 3rd is it's column based so less 'table scans' when searching down the tables. 4th is it has none blocking rites, so reads are not affected. 5th is data is indexed as loaded.

You can just dump you schema in and give it a go. However due to the nature of it for analytics you don't want to do single inserts. Better to batch and load from main oltp system. You could batch load every three seconds, and load 50k rows if you want. Just don't do it as single inserts. Load from file, BCP, named pipes or use a staging db to batch load. Any reasonable query on large datasets or complex querys should beat a traditional oltp database by an order of magnitude (you can find queries runninf 100-1000 times faster).

It's not real time like a real time trading system 1ms < for example. But you should be able to get really complex answers down to seconds.

They have some large customers using it for realt time web analytics. Shopzilla. Neilsan Media Research, Comscore for example .

Shopzilla.com allows thousands of shoppers to search 30 million products from 40,000 stores.As a result,Shopzilla
is an information factory,gathering millions of data points aboutshoppers and making thatinformation available as
a product.

For me personally . I'm building my web 2.0 blah blah super user generated content website with a high availability system, (not the url I've mentioned in my profile, that's just my hobby site), hence my interest in this site! (to launch November). I plan to use SybaseIQ for some spiffy analytics on the back end. I'm deciding on all the usual questions in this matter APC, round robin, master/slave etc. However for large analytics it really is a no brainer.

----------------------------------------
Petabyte Data Warehouse Using Solaris™ 10 OS, Sybase® IQ and BMMsoft DataFusion™
nts:
* It loaded 1 Petabyte of raw transactional data (6 Trillion stock quote records) in a fully indexed star schema; creating a new, independently verified record for the world’s largest data warehouse.
* It reached a load speed of 285 billion rows per day (3 Million rows per second) and sustained that database population pace for a period of over 3 weeks.
* It showed an 85% data compression ratio by storing a Petabyte of raw transactional data in less than 260 Terabytes of actual disk space.
* It demonstrated an average Ready-Time of less than 2 seconds for data freshly added to the data warehouse.
* It replaced half of the “T” (Transactional) data with over 72 Terabytes of “EDM” (Emails, Documents and Multimedia) data, creating a data warehouse populated with 572 Terabytes of raw “EDMT” data.
* It showed a load speed of 26 Terabytes per day when populating the data warehouse with 185 million documents (emails, attachments and other unstructured documents.)
* It reached loading rates of 2 million emails per hour and 6 million documents per hour while consuming less than 7% of the available CPU power, leaving 93% of the M9000’s CPUs available for other activities.
* It demonstrated a substantial reduction in the number of disk drives needed for storage, translating directly into at least 90% reduction in CO2 emission over the lifetime of the Sun Data Warehouse Reference Architecture using Solaris 10 OS, Sybase IQ and BMMsoft DataFusion.

December 31, 1999 | Unregistered CommenterJustin List

The answer to your problem is easy - http://www.kaushik.net/avinash/2006/09/how-to-choose-a-web-analytics-tool-a-radical-alternative.html">use Google Analytics. You haven't said anything to suggest why you need to re-invent the wheel, and Urchin is a decent enough product.

December 31, 1999 | Unregistered CommenterAnonymous

I'm in a vaguely similar position - I'll soon be building the databases behind a website with 1,000,000+ page views per day target and one of the requirements is for real time info. I haven't decided how it's all going to work yet but rather than trying to solve the problem of how to process lots of data in real time I'm working on ways of providing 'enough' info without processing lots of data (the easiest task to do is one you don't have to do).

My thinking at the moment is to have some form of period-metrics table, maybe MyTable (date+time period, metric 1, metric 2, metric 3, etc) where the date+time period would represent, for example, 17th July, 10pm to 10:15pm, metric 1 is page views, metric 2 is sessions, metric 3 is unique visitors today etc. Metrics will be gathered by each webserver in the farm and periodically written to MyTable. In the mean time all the usual detailed clickstream data around sessions and page views gets written to the normal tables - this data simply isn't used for real time analytics. Periodically the clickstream data is moved from the main database to a datawarehouse and pre-processed accordingly.

From MyTable I can now produce a reasonable amount of info with virtually no effort; 15 minute intervals produces 96 rows per day - manageable ;) More complicated analysis can be performed by cross referencing MyTable with pre-processed data from the data warehouse.

How well this works depends on how practical it is for the individual metrics to be recognised and counted without reference to a main datastore. This is my first project for a large website and I'm still in the requirements analysis phase so this idea may or not make it off the drawing board, just thought I'd throw it out there.

Good luck,
Rhys

PS Even though MS SQL Server is my main tool I'm also a big fan of Sybase IQ (as mentioned by a previous poster) - it can be a moody bugger to work with but when you use it properly it's amazing.

December 31, 1999 | Unregistered CommenterRhys

I would agree to the person on the top who said regarding "Google analytics". They give you more exact real time Stats than any other.
-----
http://underwaterseaplants.awardspace.com">sea plants
http://underwaterseaplants.awardspace.com/seagrapes.htm">Sea grapes...http://underwaterseaplants.awardspace.com/plantroots.htm">plant roots

December 31, 1999 | Unregistered Commenterfarhaj

How about instead of this:
date | hour | user_id | site_id | action_id | some_other_attributes..

use:
date | user_id | site_id | action_id | count
Basically if the user goes to the same site and does the same action on a given day, just increment a count instead of adding another row. I am assuming you don't need granularity lower than day. This will reduce the size of the table by X times if average user visits X pages of your site on a given day. Don't consider this if X is too small for you.

December 31, 1999 | Unregistered CommenterAnonymous

You actually can use PostgreSQL, the truth is your content is static so you can pre process every day and have a piece of text telling how many "information of something" you have and then sum it on-fly.

The solution is very very simple, you have an input like this you said:

date | hour | user_id | site_id | action_id | some_other_attributes..

So you can create a temporary table using this exact structure.


CREATE TEMP TABLE daily (
date ...
user_id ...
site_id ...
action_id ...
...
)

After that, you answer your own question "summing" all values of the day. So you have pre processed information.


CREATE TABLE log_static (
- number of hits per day (the simplest)
- number of hits by unique users per day
- number of hits by unique users on specific site per day
- number of distinct actions by users on specific site during defined period (e.g. one month, period of X months...)
etc.
-date of this
)

After this you can generate elaborate reports very, very fast.

You don't have need to use "Big Data" for this simple stuff.

December 17, 2012 | Unregistered CommenterDaniel Mantovani

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>