Strategy: Don't Use Polling for Real-time Feeds
Ivan Zuzak wrote a fascinating article on Real-time feed processing and filtering using Google App Engine to build Feed-buster, a service that inserts MediaRSS tags into feeds that don't have them. He talks about using polling and PubSubHubBub (real-time) to process FriendFeed feeds. Ivan is trying to devise a separate filtering service where:
- filtering services should be applied as close to the publisher as possible so notifications that nobody wants don’t waste network resource.
- processing services should be applied as close to the subscriber so that the original update may be transported through the network as a single notification for as long as possible.
Besides being a generally interesting article, Ivan makes an insightful observation on the nature of using polling services in combination with metered Infrastructure/Platform services:
Polling is bad because AppEngine applications have a fixed free daily quota for consumed resources, when the number of feeds the service processed increased - the daily quota was exhausted before the end of the day because FF polls the service for each feed every 45 minutes.
Reader Comments (2)
Thanks Todd! I've been following your blog for some time now and it's really great to be mentioned here.
I agree with you on the points that algorithms and system design will be changed and affected by the new billing models of cloud computing platforms. The Big $ notation you imagined can be tied in with cloud interoperability (http://cloudforum.org/): if I develop an application with a $(A) for a cloud infrastructure X, and then wish to move to infrastructure Y (e.g. moving from GAE to Amazon), then not only will the program probably need to be rewritten to a new programming language for Y but a design with $(B) != $(A) will possibly be needed. And possibly, and really going into SciFi territory here, this would be done automatically. A cross-cloud compiler maybe? :)
Your idea of decoupling the service from the consumer with a request queue and then polling the queue for work items on the service side is indeed great for controlling resource consumption. I have a feeling that these kind of infrastructure mechanisms *must* be a part of every cloud computing platforms, together with reflective APIs which provide detailed insight into how the application is consuming resources. When combined, these two enable the application to programmatically and dynamically scale itself. I believe this is the motivation behind Task Queues in GAE, for example, but it covers only a part of the needed functionality (since, AFAIK, there is no way to determine how much resources the application has consumed, from within the application).
http://rsscloud.org/ + http://realtimerss.org/