Entries in SimpleDB (5)

Saturday
May102008

Hitting 300 SimbleDB Requests Per Second on a Small EC2 Instance

High Performance Multithreaded Access to Amazon SimpleDB is a great follow up to the idea in How SimpleDB Differs from a RDBMS that more programming is the price paid for performance in SimpleDB. It shows how much work and infrastructure is required to batter better performance out of SimpleDB. Remember, in SimpleDB you get keys to records from queries so if you want to get all the fields for records you need to make separate requests. Since SimpleDB isn't exactly a speed daemon the obvious strategy is to parallelize. Even if a job takes a 100 msecs you can get a lot done in a little time if you can execute enough jobs in parallel. Parallelization is the approach taken by Haakon@AWS in his Java code example of how to get the most out of SimpleDB. You can find the code at Indexing and Querying Amazon S3 Metadata with Amazon SimpleDB. We'll also consider how a back-end service architecture built on Erlang may be a better fit with cloud computing. Two general mechanisms of parallelism are available: threads and boxes. To get the most bang out of a single machine you need threads (events, etc). To scale beyond the load handled by a single machine you need multiple boxes. The example code uses the Executor Thread Pool for parallelism within a program. Thread pools are a pretty common idiom by now. Amazon's queue service SQS was used to distribute work amongst boxes. Work was queued to SQS in batches of 1000 work items. The items were pulled by the thread pool and processed. Why 1000? The idea is to balance processing overhead with work overhead. You don't want popping items off SQS to dominate your processing time so you have to do enough work in each pass to make it worth the investment. The architecture uses two thread pools: one to run queries and one to get record values. Applications must carefully tune the number of threads in each pool so the queries to overwhelm the gets. Using a query thread pool with 2 threads and a get thread pool with 32 threads it was possible to perform 300 TPS on a small EC2 instances. Theoretically the advantage of this architecture is that it will scale to any size you need. SQS is your work distribution backbone and you just spin up the number of thread pool instances you need. The disadvantage is that this is a lot of programmer effort. But let's consider that you had to do some serious processing on each record, you would need something like this approach anyway to scale out the processing. But to perform simple aggregation operations it's total overkill which is why more time needs to be spent on the write site of the equation in SimpleDB/BigTable than the read side as we are used to with a RDBMS. What's the best way to go parallel? On the front-end life is simple. Go shared nothing and compose your pages from scalable back-end services. This is how Amazon does it and it's how Google AppEngine does it. GAE completely punts on the back-end service layer architecture. Unfortunately we still need to create a back-end architecture for more complex applications. Thread pools and SQS is one parallelization approach. Instead of thread pools something like Java's fork/join framework could be used. Initially I thought piling on more low level primitive threading facilities into Java was the wrong way to go. Yes, it is a "'multicore-friendly lightweight parallel framework' that supports a style of parallel programming where problems are recursively split into smaller fragments, solved in parallel and recombined," but it's also a style of programming that is very difficult to program correctly. If cloud architectures will rely on these primitives for efficiency then I think we have regressed. Erlang style architectures described by Luke Hoersten in Scalable Web Apps: Erlang + Python is a simpler more reliable to programming model. An event driven actor based approach is much harder to screw up than closely cooperating threads in a shared memory space. Erlang originally ran in embedded systems where the requirement was to reliably squeeze the most work possible out of limited CPU and other compute resources. Oddly enough the embedded node of old closely parallels your basic cloud VM. Start your work horse Erlang (or other similar system) instances and let them efficiently chew up your work loads. Erlang's scheduling model fits perfectly with a service centric job engine cloud instance. It will get more work done then your typical thread based system ever would.

Click to read more ...

Monday
Apr212008

The Search for the Source of Data - How SimpleDB Differs from a RDBMS

Update 2: Yurii responds with the Top 10 Reasons to Avoid Document Databases FUD. Update: Top 10 Reasons to Avoid the SimpleDB Hype by Ryan Park provides a well written counter take. Am I really that fawning? If so, doesn't that make me a dear? All your life you've used a relational database. At the tender age of five you banged out your first SQL query to track your allowance. Your RDBMS allegiance was just assumed, like your politics or religion would have been assumed 100 years ago. They now say--you know them--that relations won't scale and we have to do things differently. New databases like SimpleDB and BigTable are what's different. As a long time RDBMS user what can you expect of SimpleDB? That's what Alex Tolley of MyMeemz.com set out to discover. Like many brave explorers before him, Alex gave a report of his adventures to the Royal Society of the AWS Meetup. Alex told a wild almost unbelievable tale of cultures and practices so different from our own you almost could not believe him. But Alex brought back proof. Using a relational database is a no-brainer when you have a big organization behind you. Someone else worries about the scaling, the indexing, backups, and so on. When you are out on your own there's no one to hear you scream when your site goes down. In these circumstances you just want a database that works and that you never have to worry about again. That's what attracted Alex to SimpleDB. It's trivial to setup and use, no schema required, insert data on the fly with no upfront preparation, and it will scale with no work on your part. You become free from DIAS (Database Induced Anxiety Syndrome). You don't have to think about or babysit your database anymore. It will just work. And from a business perspective your database becomes a variable cost rather than a high fixed cost, which is excellent for the angel food funding. Those are very nice features in a database. But for those with a relational database background there are some major differences that take getting used to. No schema. You don't have to define a schema before you use the database. SimpleDB is an attribute-value store and you can use any you like any time you like. It doesn't care. Very different from Victorian world of the RDBMS. No joins. In relational theory the goal is to minimize update and deletion anomolies by normaling your data into seperate tables related by keys. You then join those tables together when you need the data back. In SimpleDB there are no joins. For many-to-1 relationships this works out great. In SimpleDB attribute values can have multiple values so there's no need to do a join to recover all the values. They are stored together. For many-to-many to relationships life is not so simple. You must code them by hand in your program. This is a common theme in SimpleDB. What the RDBMS does for you automatically must generally be coded by hand with SimpleDB. The wages of scale are more work for the programmer. What a surprise. Two step query process. In a RDBMS you can select which columns are returned in a query. Not so in SimpleDB. In a query SimpleDB just returns back a record ID, not the values of the record. You need to make another trip to the database to get the record contents. So to minimize your latency you would need to spawn off multiple threads. See, more work for the programmer. No sorting. Records are not returned in a sorted order. Values for multi-value attribute fields are not returned in sorted order. That means if you want sorted results you must do the sorting. And it also means you must get all the results back before you can do the sorting. More work for the programmer. Broken cursor. SimpleDB only returns back 250 results at a time. When there are more results you cursor through the result set using a token mechanism. The kicker is you must iterate through the result set sequentially. So iterating through a large result set will take a while. And you can't use your secret EC2 weapon of massive cheap CPU to parallelize the process. More work for the programmer because you have to move logic to the write part of the process instead of the read part because you'll never be able to read fast enough to perform your calculations in a low latency environment. The promise of scaling is fulfilled. Alex tested retrieving 10 record ids from 3 different database sizes. Using a 1K record database it took an average of 141 msecs to retrieve the 10 record ids. For a 100K record database it took 266 msecs on average. For a 1000K record database it took an average of 433 msecs to retrieve the 10 record ids. It's not fast, but it is relatively consistent. That seems to be a theme with these databases. BigTable isn't exactly a speed demon either. One could conclude that for certain needs at least, SimpleDB scales sufficiently well that you can feel comfortable that your database won't bottleneck your system or cause it to crash under load. If you have a complex OLAP style database SimpleDB is not for you. But, if you have a simple structure, you want ease of use, and you want it to scale without your ever lifting a finger ever again, then SimpleDB makes sense. The cost is everything you currently know about using databases is useless and all the cool things we take for granted that a database does, SimpleDB does not do. SimpleDB shifts work out of the database and onto programmers which is why the SimpleDB programming model sucks: it requires a lot more programming to do simple things. I'll argue however that this is the kind of suckiness programmers like. Programmers like problems they can solve with more programming. We don't even care how twisted and inelegant the code is because we can make it work. And as long as we can make it work we are happy. What programmers can't do is make the database scalable through more programming. Making a database scalable is not a solvable problem through more programming. So for programmers the right trade off was made. A scalable database you don't have to worry about for more programming work you already know how to do. How does that sound?

Related Articles

  • The new attack on the RDBMS by techno.blog("Dion")
  • The End of an Architectural Era (It’s Time for a Complete Rewrite) - A really fascinating paper bolstering many of the anti-RDBMS threads the have popped up on the intertube.

    Click to read more ...

  • Monday
    Mar172008

    Microsoft's New Database Cloud Ready to Rumble with Amazon

    Update: Zdnet says Ozzie signals Microsoft’s surrender to the cloud. CD ROMs are to the internet as the internet is to the cloud and Microsoft aims to scratch and claw its way into this paradigm shift as well. The gloves are off. The tag line for Microsoft's new SQL Server Data Service is Your Data, Any Place, Any Time. Thems fighten' words. Microsoft is itch'n for a fight! Who will be Amazon's second? The service description: SQL Server Data Services (SSDS) are highly scalable, on-demand data storage and query processing utility services. Built on robust SQL Server database and Windows Server technologies, these services provide high availability, security and support standards-based web interfaces for easy programming and quick provisioning. Sounds like a fast uppercut aimed squarely at SimpleDB's jaw. As a developer what do you need to know?

  • Highly available and highly scalable.
  • Targeted at applications that can tolerate high internet latencies. Primarily read-mostly data sets "storing data that is naturally partitionable into disjoint data sets requiring little or no cross-correlations."
  • Pricing isn't yet available.
  • No announced SLA, though you will be notified of downtime.
  • Access is via REST or SOAP.
  • Data is redundantly stored and backed up.
  • Geo-redundant data copies are maintained. Data is stored in large storage clusters in various Microsoft data centers located across North America. Plan to expand to more areas later.
  • The data model is flexible. No schemas required. "Simply add new attributes to your data set when needed, and the system will automatically store, index, and query your data accordingly."
  • String, numeric, datetime, and boolean data type are supported. All attributes are indexed.
  • The data model is: Customer { SSDS account (1..N) { Authority (1..N) { Container (0..N) { Entity (0..N) - Authorities give billing entities a way to organize their usage for accounting, security and co-location purposes. All containers under a single authority are provisioned within the same data center. As such authorities are the unit of geo-scale and geo-location. For example: Seattle or San Francisco. - Containers create contexts and scope for entity storage and query. For example, within its authorities, operations could choose to assign each member their own container, intended to contain a set of personal data for that member. Containers are the unit of consistency in the Microsoft SSDS service. For example: Autos for Sale, Services Offered. - Entities are the fundamental unit of storage in the system. Entities are a bag of scalars with no enforced type. For example, an individual member‟s jobs, educational institutions, contacts, recommendations, etc. could all be modeled as entities.
  • Data manipulation operations include: - Creation and deletion of containers. There are no updatable container properties. - Creation, replacement, and deletion of entities. - Retrieval of a single container in a serialized format. - Retrieval of a single entity in a serialized format.
  • A text based query language is supported that follow the LINQ pattern for C#. Only complete entities are returned. For example: To query addressed to a container to retrieve all entities in that container having a “City” property equal to “Seattle” and a “State” property equal to “WA” would be written as follows: from e in entities where e[“City”] == “Seattle” && e[“State”] == “WA” select e
  • Resource queries are supported. This URI returns all the entities in that container: http://mydomain.ssds.microsoft.com/ChildrensBooksContainer1
  • Customers will be able to associate entities with blobs which could be accessed as URL addressable resources.
  • SSDS is not just a hosted version of SQL Server. It's SQL Server deconstructed, running on blade servers with SATA drives.
  • A local on-premise version of SSDS will also be available to "allow users to better synchronize between the enterprise and the cloud, especially when handling reporting, analytics and business-intelligence tasks."
  • Not sure if you can do a full text search on a string property.
  • Not sure how large attributes can be. Should I store my 25MB documents as a string field so it can be searched? Or is that too expensive? Or is it even supported? Where Microsoft is behind Amazon is that they don't have EC2 and S3. Microsoft is having you traverse the internet for each database access. Now let''s say you have chunks of large storage stored off in a storage cloud. That's a lot of overhead. A major benefit of AWS is the free bandwidth and higher performance within the AWS cloud. If your application is completely hosted within the cloud the only slow part is from the server to the user's browser. All-in-all SSDS seems very comparable to and competitive with SimpleDB. Hard to say without solid pricing information. But it will be interesting to see how Microsoft's cloud strategy evolves and how we can all benefit from the competition.

    Related Articles

  • Home Page for SSDS
  • The Current Pros and Cons List for SimpleDB
  • Microsoft SQL Server Data Services
  • SQL Server Data Services FAQ
  • Microsoft launches its alternative to Amazon’s SimpleDB

    Click to read more ...

  • Friday
    Dec142007

    The Current Pros and Cons List for SimpleDB

    Not surprisingly opinions on SimpleDB vary from it sucks, don't take my database, to it will change the world, who needs a database anyway? From a quick survey of the blogosphere, here's where SimpleDB stands at the moment:

    SimpleDB Cons

  • No SLA. We don't know how reliable it will be, how fast it will be, or how consistent the performance will be.
  • Consistency constraints are relaxed. Reading data immediately after a write may not reflect the latest updates. To programmers used to transactions, this may be surprising, but many people think this is one of the tradeoffs that needs to be made to scale.
  • Database is a core competency. If you don't control your database you can't out compete your competition.
  • When your database is out of your control you can't guarantee it will work properly. You can't create the proper indexes and other optimizations.
  • No join or IN operator. You'll need to do multiple client side calls to simulate joins, which will be slow.
  • No stored procedures, referential integrity, and other relational goodies. This is not a professional product.
  • Attribute size limited to 1024 bytes. It's not designed for content serving.
  • Latency from outside Amazon will be high.
  • Setting up and maintaining a database is cheap and easy these days, so why bother? It costs too much compared when compared to running your own servers.
  • What happens when you need to super scale to very large datasets?
  • No API support from common languages like PHP, Ruby, etc.
  • All your existing code and infrastructure needs to be rewritten.
  • Not geographically distributed with nearest datacenter routing.
  • Queries are lexigraphical. So you’ll need to store data in lexicographical order. This means says inside looking out: zero-padding your integers, adding positive offsets to negative integer sets, and converting dates into something like ISO 8601.
  • Attribute values are typeless which could lead to a lot of typing related errors and inefficient queries.
  • The 10 GB maximum per domain is too limiting.
  • It's not Dynamo. Amazon is keeping the really good stuff to themselves.
  • Text searching is not supported. You'll need to construct your own fast search indexes.
  • Queries are limited to 5 seconds running time. It's only for getting and setting, nothing more SQLish.
  • No cloned APIs for unit testing. Need to be able to develop locally against other data stores.
  • Your data is under Amazon's control, so there could be security and privacy problems.
  • The XML based protocol unnecessarily increases overhead, latency, and cost.
  • Lockin. If you decide to leave Amazon’s cloud how do you move all your data and get a similar system up and working outside the cloud?
  • Open cash register. Since SDB is charge on use, a malicious user can simply setup a loop to query your site, which costs you an unbounded amount of money.

    SimpleDB Pros

  • SimpleDB is not a relational database. Relational databases are too complex and don't scale well. Keeping data access simple is a selling point, not a weakness.
  • Low setup costs and pay-as-you-go expansion make it perfect for startups. The price is reasonable given the functionality and the hands off admin.
  • Setting up and maintaining a highly available clustered database that is constantly growing is extremely difficult. Building your application on a building block that does all this for you adds a lot of value.
  • Setting up a database inside EC2 is a pain. The makes getting basic database functionality trivial. No need to worry about scaling, capacity planning, or partitioning.
  • It has a decent query language, which is unusual for this type of data store.
  • Data are stored across multiple nodes which supports parallel query execution.
  • It's built on Erlang and that's cool.
  • You don't need to seek funding to hire a database team and buy hardware. Depending on how you weight each factor, SimpleDB could be way behind or way ahead of other options. What's interesting is to see what people think is important. For many people the only real database is relational and if it doesn't have transactions, joins, etc it's not real. Databases like beauty seem to be in the eye of the beholder.

    Click to read more ...

  • Thursday
    Dec132007

    Amazon SimpleDB - Scalable Cloud Database

    Amazon has announced the limited beta of Amazon SimpleDB - a simple web services interface to create and store multiple data sets, query your data easily, and return the results. Together with the Simple Storage Service (S3), Elastic Compute Cloud (EC2) and other web services Amazon offers a complete utility computing platform. SimpleDB was the missing piece of AWS - the scalable structured database. Check out my blog entry: http://innowave.blogspot.com/2007/12/amazon-simpledb-scalable-cloud-database.html I was waiting for this one :-) Geekr

    Click to read more ...