Tuesday
Feb152011
Wordnik - 10 million API Requests a Day on MongoDB and Scala
Tuesday, February 15, 2011 at 9:04AM
Wordnik is an online dictionary and language resource that has both a website and an API component. Their goal is to show you as much information as possible, as fast as we can find it, for every word in English, and to give you a place where you can make your own opinions about words known. As cool as that is, what is really cool is the information they share in their blog about their experiences building a web service. They've written an excellent series of articles and presentations you may find useful:
- What has technology doSave & Closene for words lately?
- Eventual consistency. Using an eventually consistent model they can do work in parallel and we count as many words as possible when we can, and add them all up when there’s a lag. The count’s always in the ballpark, and we never have to stop.D
- Document-oriented storage. Dictionary entries are more naturally modeled as hierarchical documents and using that model has made it quicker to find data and is easier for development.
- 12 Months with MongoDB
- Primary driver for migrating to MongoDB was for performance. MySQL didn't work for them.
- Mongo serves an average of 500k requests/hour. Peak traffic is 4x that.
- > 12 billion documents in Mongo, storage is ~3TB per node
- Can easily sustain an insert speed of 8k documents/second, often burst to 50k/sec
- A single java client can sustain 10MB/sec read over the backend (gigabit) network to one mongod. Four readers from the same client pull 40MB/sec over the same pipe
- Every type of retrieval has become significantly faster than the MySQL implementation:
- example fetch time reduced from 400ms to 60ms
- dictionary entries from 20ms to 1ms
- document metadata from 30ms to .1ms
- spelling suggestions from 10ms to 1.2ms
- Mongo's built-in caching allowed them to remove the memcached layer and speed up calls by 1-2ms.
- From MySQL to MongoDB - Migrating to a Live Application by Tony Tam
- An explanation of their experiences moving from MySQL to MongoDB.
- Wordnik stores a corpus of words, hierarchical data, and user data. The MySQL design was far more complex and required a complex caching layer to perform well. With MongoDB the system is 20x faster. Now there are no joins or the need of a caching layer. The whole system was simpler.
- Wordnik is primarily a read-only system and performance is limited mainly by disk speed.
-
They use dual quad-core 2.4GHz intel cpus with 72GB ram. They are physical servers and in master-slave mode and use 5.3TB LUNs on the DAS. They found virtual servers didn't have the IO performance they needed.
- Keeping the Lights On with MongoDB by Tony Tam
- A presentation of how they use and mange MongoDB.
- Wordnik API
- They've rewritten their REST API in Scala. Scala has helped them remove a lot of code and standardize "traits" throughout the API.
- MongoDB Admin Tools
- Wordnik has built some tools to manage large deployments of MongoDB and has open sourced them.
- Wordnik Bypasses Processing Bottleneck with Hadoop
- Add 8,000 words per second to their corupus of words.
- Map-reduce jobs are run on Hadoop to offload MongoDB and prevent any blocking queries. Data is append-only so there's no reason to hit MongoDB.
- Incremental updates to their data are stored in flat files, which are periodically imported into HDFS.
Overall impressions:
- Wordnik had a very specific problem to solve and set out to find the best tool that would help them solve that problem. They were willing to code around any faults they found and figure out how to make MongoDB work best for them. Performance requirements drove everything.
- After performance, the naturalness of the document data model seemed to be the biggest win for them. The ability to easily model complex data hierarchically and have that perform well, reverberated across the system.
- Code is now: faster, more flexible, and dramatically smaller.
- They have settled on specialized tools for the job. MongoDB is responsible for document storage for runtime data. Hadoop is responsible for analytics, data processing, reporting, and time-based aggregation.
Reader Comments (4)
As much as I can tell, Wordnik is a read-most application which don't need much consistency. "They have a specific problem", as quoted in the article. NoSQL fan-boys, please don't simply use this to attack MySQL. :-)
Peak usage is 2M req/hr.
That's only 550 req/sec. It's really not that much. I've seen MySQL doing MUCH higher throughput than that.
If they couldn't even make MySQL to do just 550 req/sec, I bet they just had a badly designed database
@John, the issue we had with mysql was the combination of heavy reads plus writes. Combine 550 req/sec + 8k inserts/sec and the equation changes dramatically.
I've said many times, it is very likely that someone could have tuned our mysql deployment to support this. We did it on our own with mongodb easily, and in the process got a number of huge benefits, as covered in the blog posts.
Didn't it read 500k/s not 500/s?