High Scalability -

8 Comments |

Permalink |

AWS,

Example,

Postgres,

Product,

RoR,

cloud

Thursday

Apr232009

Which Key value pair database to be used

Thursday, April 23, 2009 at 5:39AM

My Table has 2 columsn .Column1 is id,Column2 contains information given by user about item in Column1 .User can give 3 types of information about item.I separate the opinion of single user by comma,and opinion of another user by ;. Example- 23-34,us,56;78,in,78 I need to calculate opinions of all users very fast.My idea is to have index on key so the searching would be very fast.Currently i m using mysql .My problem is that maximum column size is below my requirement .If any overflow occurs i make new row with same id and insert data into new row. Practically I would have around maximum 5-10 for each row. I think if there is any database which removes this application code. I just learn about key value pair database which is exactly i needed . But which doesn't put constraint(i mean much better than RDMS on column size. This application is not in production.

d17may |

Permalink |

General Discussion,

key-value store

Wednesday

Apr222009

Gear6 Web cache - the hardware solution for working with Memcache

Wednesday, April 22, 2009 at 9:41AM

The Gear6 Web Cache hybrid DRAM-flash memory architecture allows for 5-10 times more memcache memory per unit of rack space than DRAM-only configurations, and cuts memory costs by 50%. Other software enhancements include a slab allocator that is more efficient than traditional memcache implementations due to its fine-grained bucket sizing. Gear6 Web Cache also supports object sizes greater than 1 megabyte and manages evictions based on the cost of replacing objects, depending on the size and frequency of object access. It intelligently places cache instances across DRAM and flash, taking into account their different characteristics, while at the same time monitoring their health and detecting and de‐allocating faulty or failing memory.

Gear6 Web Cache is a Memcached protocol compliant solution that scales and accelerates web applications, reduces memory footprint, enhances availability and implements comprehensive Memcached management features. Designed to work with all popular memcache clients, Gear6 Web Cache integrates seamlessly into existing deployments and immediately provides a scalable, high density caching solution for your web application environment.

Some of the web services which are using Gear6 are Answers.com (wiki answers), Veoh.com (online video), myYearBook.com (social network).

Read more about Gear6 hardware and customer cases studies on Gear6 website

mg1313 |

Permalink |

gear6,

web cache

Tuesday

Apr212009

Thread Pool Engine in MS CLR 4, and Work-Stealing scheduling algorithm

Tuesday, April 21, 2009 at 9:36PM

I just saw this article in HFadeel blog that spaek about Parallelism in .NET Framework 4, and how Thread Pool work, and the most faoums scheduling algorithm : Work-stealing algorithm. With preisnation to see it in action.

JeffD |

Permalink |

Highscalability CDN Tag Cloud

Parallelism,

multithreading

Tuesday

Apr212009

What CDN would you recommend?

Tuesday, April 21, 2009 at 3:31PM

Update 10: The Value of CDNs by Mike Axelrod of Google. Google implements a distributed content cache from within large ISPs. This allows them to serve content from the edge of the network and save bandwidth on the ISPs backbone. Update 9: Just Jump: Start using Clouds and CDNs. Bob Buffone gives a really nice and practical tutorial of how to use CloudFront as your CDN. Update 8: Akamai’s Services Become Affordable for Anyone! Blazing Web Site Performance by Distribution Cloud. Distribution Cloud starts at $150 per month for access to the best content distribution network in the world and the leader of Content Distribution Networks. Update 7: Where Amazon’s Data Centers Are Located, Expanding the Cloud: Amazon CloudFront. Why Amazon's CDN Offering Is No Threat To Akamai, Limelight or CDN Pricing. Amazon has launched their CDN with "“low latency, high data transfer speeds, and no commitments.” The perfect relationship for many. The majority of the locations are in North America, but some are in Europe and Asia. Update 6: Amazon Launching New Content Delivery Network: No Threat To Major CDNs, Yet. All the Amazon will kill all other CDNs is a bit overblown. As usual Dan Rayburn sets us straight: The offering won't support streaming, live broadcasting, or provide many of the other products and services that video content owners need...the real story here is that Amazon is going to offer a high performance method of distributing content with low latency and high data transfer rates. Update 5: When It Comes To Content Delivery Networks, What Is The "Edge"?. Dan Rayburn is on edge about the misuse of the term edge: closest location to the user does not guarantee quality, often content is not delivered from the closest location, all content is not replicated at every "edge" location. Lots of other essential information. Update 4: David Cancel runs a great test to see if you should be Using Amazon S3 as a CDN?. Conclusion: "CacheFly performed the best but only slightly better than EdgeCast. The S3 option was the worst with the Nginx/DIY option performing just over 100 ms faster." Also take look at Part 2 - Cacheability? Update 3: Mr. Rayburn takes A Detailed Look At Akamai's Application Delivery Product . They create a "bi-nodal overlay network" where users and servers are always within 5 to 10 milliseconds of each other. Your data center hosted app can't compete. The problem is that people (that is, me) can understand the data center model. I don't yet understand how applications as a CDN will work. Update 2: Dan Rayburn starts an interesting series of articles on Highlights Of My Day In Cambridge With Akamai. Akamai is moving strong into the application distribution business. That would make an interesting cloud alternative.. Update: Streamingmedia links to new CDN DF Splash that specializes in instant-on TV-quality video streaming. A question was raised on the forum asking for a CDN recommendation. As usual there are no definitive answers, but here are three useful articles that may help your deliberations.

First, Tony Chang shows how to drive down response times using edge acceleration strategies.

Then Pingdom gives a nice overview and introduction to CDNs.

And last but not least, Dan Rayburn from StreamingMedia.com gives a master class in how much you should pay for your CDN, what you should be getting for your money, and how to find the right provider for your needs. Lots and lots of good stuff to learn, even if you didn't roll out of bed this morning pondering the deeper mysteries of content delivery networks and the Canadian dollar.

Edge Acceleration Strategies: Akamai by Tony Chang

The edge network is the "network physically closest to the end user and the 'origin' is where the application(s) is hosted." Tony talks about how you use CDNs to manage the user experience through meeting millisecond+ level SLAs using edge acceleration services. He does this in an interesting way. He follows a request through its life cycle and shows how to turn your caterpillar into a butterfly at each stage:

An edge DNS means a name server closest to the end user will serve the DNS request.

Static content is easily cached on the edge.

Dynamic content is moving to the edge using what Akamai calls Web Application Accelerators.

And something I've never heard of is to use your CDN to improve routing performance by up to 33%. The service bypasses BGP using its own more optimized route tables to decrease latency.

Pingdom's A look at Content Delivery Networks, or “how to serve lots of content really fast”

CDNs are the hidden powerhouse of the internet. The unsung mitochondria powering bits forward. Cost, convenience and performance are the reasons people turn to CDNs. A CDN does what you can't, it put lots of servers in lots of different places. Panther Express, for example, puts 800 servers in 22 different geographical locations. Since CDNs sell delivery capacity capacity planning is one of their big challenges. And Pingdom would like you to recognize the importance of monitoring for detecting and quickly reacting to problems :-) The future of CDNs lies in larger caches for HD video, better locality, and more integration with hosting providers.

Video on Content Delivery Network Pricing, Costs for Outsourced Video Delivery by Dan Rayburn

Also CDN Pricing Data: Average Cost Per GB Declines In Q4 Due To Startups. It's evident Dan really knows his stuff. His articles and presentations are highly educational for anyone interested in the complex and confusing CDN world. Dan sees hundreds of real-life customer-CDN vendor contracts a year and he reports on real prices averaged over all the contracts he has seen. One of the hardest things as a consumer is knowing what a good price is for your basket of goods and Dan gives you the edge, so to speak. What I learned:

You decide who is a CDN.There's no central agency setting a standard. Dan's minimal definition is a service delivering live video in the US and Europe.

CDN market has gone from 10 to 30 vendors. VCs are pumping hundreds of millions into the space.

CDN providers provide a wide variety of services: application caching, static caching, streaming video, progressive video, etc. Dan concentrates only on video delivery.

You can't say vendor A is better than vendor B. It depends on your needs.

When comparing vendors you need to do an "apples to apples" comparison. He really likes that phrase. You can't compare vendors, only like products between vendors.

Video serving is complex because there are few standards in the market. There are multiple platforms, multiple encoding standards, etc.

Customer's don't buy on price alone. Delivery of bits over a network is a commodity. Buy on SLA, customer service, product, format, support, geographic reach, and performance.

There appears to be no way to compare vendors on the performance of their network. There are too many variables in play to make an accurate comparison.He's quite adament about this. Performance could mean: SLA, customer service, upload content, buffering, etc. No way to measure performance network performance across networks. Static image performance is very different than streaming performance. People all over the globe are accessing your content so what is the "performance" of that?

A trend this year is demand for P2P pricing and services.

To price your video delivery you need to answer 4 questions: 1) How many hours? 2) How many users? 3) How long will they watch? 4) What encoding and what bit rate?

Price varies on product bundle. Vendors need to specialize so they can move themselves out of the commodity market. If you would pay 8 cents a gig for delivered video your price will be different if you add application and static caching.

Contracts are at 12 months. Maybe 2 years when bundling services.

P2P is not necessarily cheaper so compare. Pricing is very confusing.

You can sometimes get a lower price by using the vendor's player.

Flash streaming is more expensive because of licensing fees. The number varies because each vendor cuts their own licensing deals. Could be 20% more, or it could be double, depends on volume.

When signing a vendor think if you need global reach or is regional reach sufficient? Use a regional service provider if you need a CDN just once in a while. It's matter of picking based on your needs. You can often get a less expensive deal and get a quarterly commit versus a montly commit.

Storage costs have really fallen. High of $10/gig and low of 10 cents per gig.

Most CDNs will pull from your origin storage and cache, which reduces your storage cost.

CDNs don't want to get paid with promises of ad sharing.

Pick a CDN vendor that will take the time to educate you. They should ask you about your business first, they shouldn't talk about themselves first. He mentions this point a few times and it makes a lot of sense.

Consider a dual vendor strategy where you pick one vendor for video and another for application.

Quality in the industry is very high. People rarely complain about the network. Customers want better support and reporting. Poor reporting is the #1 complaint. Run away if a vendor wants to charge for reporting. Lots of good stuff.

Edge Acceleration Strategies: Akamai by Tony Chang

A look at Content Delivery Networks, or “how to serve lots of content really fast”

Content Delivery Network Pricing, Costs for Outsourced Video Delivery

CDN Pricing Data: Average Cost Per GB Declines In Q4 Due To Startups

A Taxonomy and Survey of Content Delivery Networks

Content Delivery Networks (CDN) Research Directory

16 Comments |

Permalink |

CDN

Monday

Apr202009

Some things about Memcached from a Twitter software developer

Monday, April 20, 2009 at 5:14AM

Memcached is generally treated as a black box. But what if you really need to know what's in there? Not for runtime purposes, but for optimization and capacity planning?

Read more on Evan Weaver, a software developer working for Twitter (a contributor for Rails core and Mongrel).

mg1313 |

1 Comment |

Permalink |

ruby on rails

Thursday

Apr162009

Serving 250M quotes/day at CNBC.com with aiCache

Thursday, April 16, 2009 at 3:48AM

As traffic to cnbc.com continued to grow, we found ourselves in an all-too-familiar situation where one feels that a BIG change in how things are done was in order, the status-quo was a road to nowhere. The spending on HW, amount of space and power required to host additional servers, less-than-stellar response times, having to resort to frequent "micro"-caching and similar tricks to try to improve code performance - all of these were surfacing in plain sight, hard to ignore. While code base could clearly be improved, the limited Dev resources and having to innovate to stay competitive always limits ability to go about refactoring. So how can one go about addressing performance and other needs without a full blown effort across the entire team ? For us, the answer was aiCache - a Web caching and application acceleration product (aicache.com). The idea behind caching is simple - handle the requests before they ever hit your regular Apache<->JK<->Java<->Database response generation train (we're mostly a Java shop). Of course, it could be Apache-PHP-Database or some other backend system, with byte-code and/or DB-result-set caching. In our case we have many more caching sub-systems, aimed at speeding up access to stock and company-related information. Developing for such micro-caching and having to maintain systems with such micro-caching sprinkled throughout is not an easy task. Nor is troubleshooting. But we digress... aiCache takes this basic idea of caching and front-ending the user traffic to your Web environment to a whole new level. I don't believe any of aiCache's features are revolutionary in nature, rather it is the sheer number of features it offers that seems to address our every imaginable need. We've also discovered that aiCache provides virtually unlimited performance, combined with incredible configuration flexibility and support for real-time reporting and alerting. In interest of space, here're some quick facts about our experience with the product, in no particular order: · Runs on any Linux distro, our standard happens to be RedHat 5, 64bit on HP DL360G5 · The responses are cached in the RAM, not on disk. No disk IO, ever (well, outside of access and error logging, but even that is configurable). No latency for cached responses - stress tests show TTFB at 0 ms. Extremely low resource utilization - aiCache servers serving in excess of 2000 req/sec are reported to be 99% idle ! Being not a trusting type, I verified the vendor's claim and stress tested these to about 25,000 req/sec per server - with load averages of about 2 (!). · We cache both GET and POST results, with query and parameter busting (selectively removing those semi-random parameters that complicate caching) · For user comments, we use response-driven expiration to refresh comment threads when a new comment is posted. · Had a chance to use site-fallback feature (where aiCache serves cached responses and shields origin servers from any traffic) to expedite service recovery · Used origin-server tagging a few times to get us out of code-deployment-gone-bad situations. · We average about 80% caching ratios across about 10 different sub-domains, with some as high as 97% cache-hit-ratio. Have already downsized a number of production Web farms, having offloaded so much traffic from origin server infrastructure, we see much lower resource utilization across Web, DB and other backend systems · Keynote reports significant improvement in response times - about 30%. · Everyone just loves real-time traffic reporting, this is a standard window on many a desktop now. You get to see req/sec, response time, number of good/bad origin servers, client and origin server connections, input and output BW and so on - all reported per cached sub-domain. Any of these can be alerted on. · We have wired up Nagios to read/chart some of aiCache extensive statistics via SNMP, pretty much everything imaginable is available as an OID. · Their CLI interface is something I like a lot too: you see the inventory of responses, can write out any response, expire responses, report responses sorted by request, size, fill time, refreshes and so on, in real time, no log crunching is required. Some commands are cluster-aware, so you only execute them on one node and they are applied across. Again, the list above is a small sample of product features that we use, there're many more that we use or explore using. Their admin guide weighs in at 140 pages (!) - and it is all hard-core technical stuff that I happen to enjoy. Some details about our network setup . We use F5 load balancers and have configured the virtual IPs to have both aiCache servers _and origin server enabled at the same time. Using F5's VIP priority feature, we direct all of the traffic to aiCache servers, as long as at least one is available, but have ability to automatically, or on demand, failover all of the traffic to origin servers. We also use a well known CDN to serve auxiliary content - Javascript, CSS and imagery. I stumbled upon the product following a Wikipedia link, requested a trial download and was up and running in no time. It probably helped that I have experience with other caching products - going back to circa 2000, using Novell ICS. But it all mostly boils down to knowing what URLs can be cached and for how long. And lastly - when you want stress test aiCache, make sure to hit it directly, right by server's IP - otherwise you will most likely melt down one or more of other network infrastructure components ! A bit about myself: an EE major, have been working with Internet infrastructures since 1992 - from an ISP in Russia (uucp over MNP-5 2400b modem seemed blazing fast back then!) to designing and running infrastructures of some of the busier sites for CNBC and NBC - cnbc.com, NBC's Olympics website and others. Rashid Karimov, Platform, CNBC.com

13 Comments |

Permalink |

aicache

Thursday

Apr162009

Paper: The End of an Architectural Era (It’s Time for a Complete Rewrite)

Thursday, April 16, 2009 at 1:16AM

Update 3: A Comparison of Approaches to Large-Scale Data Analysis: MapReduce vs. DBMS Benchmarks. Although the process to load data into and tune the execution of parallel DBMSs took much longer than the MR system, the observed performance of these DBMSs was strikingly better. Update 2: H-Store: A Next Generation OLTP DBMS is the project implementing the ideas in this paper: The goal of the H-Store project is to investigate how these architectural and application shifts affect the performance of OLTP databases, and to study what performance benefits would be possible with a complete redesign of OLTP systems in light of these trends. Our early results show that a simple prototype built from scratch using modern assumptions can outperform current commercial DBMS offerings by around a factor of 80 on OLTP workloads. Update: interesting related thread on Lamda the Ultimate. A really fascinating paper bolstering many of the anti-RDBMS threads the have popped up on the intertube lately. The spirit of the paper is found in the following excerpt: In summary, the current RDBMSs were architected for the business data processing market in a time of different user interfaces and different hardware characteristics. Hence, they all include the following System R architectural features: * Disk oriented storage and indexing structures * Multithreading to hide latency * Locking-based concurrency control mechanisms * Log-based recovery Of course, there have been some extensions over the years, including support for compression, shared-disk architectures, bitmap indexes, support for user-defined data types and operators, etc. However, no system has had a complete redesign since its inception. This paper argues that the time has come for a complete rewrite. Of particular interest the discussion of H-store, which seems like a nice database for the data center. H-Store runs on a grid of computers. All objects are partitioned over the nodes of the grid. Like C-Store [SAB+05], the user can specify the level of K-safety that he wishes to have. At each site in the grid, rows of tables are placed contiguously in main memory, with conventional B-tree indexing. B-tree block size is tuned to the width of an L2 cache line on the machine being used. Although conventional B-trees can be beaten by cache conscious variations [RR99, RR00], we feel that this is an optimization to be performed only if indexing code ends up being a significant performance bottleneck. Every H-Store site is single threaded, and performs incoming SQL commands to completion, without interruption. Each site is decomposed into a number of logical sites, one for each available core. Each logical site is considered an independent physical site, with its own indexes and tuple storage. Main memory on the physical site is partitioned among the logical sites. In this way, every logical site has a dedicated CPU and is single threaded. The paper goes through how databases should be written with modern CPU, memory, and network resources. It's a fun an interesting read. Well worth your time.

10 Comments |

Permalink |

Database,

Paper

Wednesday

Apr152009

Implementing large scale web analytics

Wednesday, April 15, 2009 at 1:04AM

Does anyone know of any articles or papers that discuss the nuts and bolts of how web analytics is implemented at organizations with large volumes of web traffic and a critcal business need to analyze that data - e.g. places like Amazon.com, eBay, and Google? Just as a fun project I'm planning to build my own web log analysis app that can effectively index and query large volumes of web log data (i.e. TB range). But first I'd like to learn more about how it's done in the organizations whose lifeblood depends on this stuff. Even just a high level architectural overview of their approaches would be nice to have.

robw |

Permalink |

General Discussion

Wednesday

Apr152009

Using HTTP cache headers effectively

Wednesday, April 15, 2009 at 12:29AM

Hi, Some time ago , martin fowler bloged about how HTTP cache headers can be very effectively used in web site design. http://www.martinfowler.com/bliki/SegmentationByFreshness.html How actively HTTP cache headers are considered in web site design? I think it is a great tool to reduce lot of load on server and should be considered before designing any complex caching strategy. Thoughts? Thanks, Unmesh