The email sent will contain a link to this article, the article title, and an article excerpt (if available). For security reasons, your IP address will also be included in the sent email.
Yao Yue has worked on Twitter’s Cache team since 2010. She recently gave a really great talk: Scaling Redis at Twitter. It’s about Redis of course, but it's not just about Redis.
Yao has worked at Twitter for a few years. She's seen some things. She’s watched the growth of the cache service at Twitter explode from it being used by just one project to nearly a hundred projects using it. That's many thousands of machines, many clusters, and many terabytes of RAM.
It's clear from her talk that's she's coming from a place of real personal experience and that shines through in the practical way she explores issues. It's a talk well worth watching.
As you might expect, Twitter has a lot of cache.
Timeline Service for one datacenter using Hybrid List:
- ~40TB allocated heap
- ~30MM qps
- > 6,000 instances
Use of BTree in one datacenter:
- ~65TB allocated heap
- ~9MM qps
- >4,000 instances
You'll learn more about BTree and Hybrid List later in the post.
A couple of points stood out:
- Redis is a brilliant idea because it takes underutilized resources on servers and turns them into valuable service.
- Twitter specialized Redis with two new data types that fit their use cases perfectly. So they got the performance they needed, but it locked them into an older code based and made it hard to merge in new features. I have to wonder, why use Redis for this sort of thing? Just create a timeline service using your own datastructures. Does Redis really add anything to the party?
- Summarize large chunks of log data on the node, using your local CPU power, before saturating the network.
- If you want something that’s high performance separate the fast path, which is the data path, away from the slow path, which is the command and control path.
- Twitter is moving towards a container environment with Mesos as the job scheduler. This is still a new approach so it's interesting to hear about how it works. One issue is the Mesos wastage problem that stems from requirement to specify hard resource usage limits in a complicated runtime world.
- A central cluster manager is really important to keep a cluster in a state that’s easy to understand.
- The JVM is slow and C is fast. Their cache proxy layer is moving back to C/C++.
With that in mind, let's learn more about how Redis is used at Twitter:
Why Redis?