Entries by Todd Hoff (380)

Wednesday
Jun102009

Dealing with multi-partition transactions in a distributed KV solution

I've been getting asked about this a lot lately so I figured I'd just blog about it. Products like WebSphere eXtreme Scale work by taking a dataset, partitioning it using a key and then assigning those partitions to a number of JVMs. Each partition usually has a primary and a replica. These 'shards' are assigned to JVMs. A transactional application typically interacts with the data on a single partition at a time. This means the transaction is executed in a single JVM. A server box will be able to do M of those transactions per second and it scales because N boxes does MN (M multiplied by N) transactions per second. Increase N, you get more transactions per second. Availability is very good because a transaction only depends on 1 of the N servers that are currently online. Any of the other (N-1) servers can go down or fail with no impact on the transaction. So, single partition transactions can scale indefinitely from a throughput point of view, offer very consistent response times and they are very available because they only point a small part of the grid at once.

All-partition transactions are different. A simple example might be that we are storing bank accounts in a grid. The account key is the bank account number. The value is an account object with the users online username and their password, address, portal profile, bank account information etc. Almost all access to the account is using the account number. Now, lets look at the login process for the banks portal. The user doesn't login with their account number, they login with the username. We have not partitioned on user name, we partitioned on account and did so for good reason as every other transaction type is keyed on account number.

So, given we can't easily look up a record using the user name what can we do. Option 1. Lets do a parallel search across all partitions to find account objects whose user name attribute is 'billy'. We can use a MapGridAgent in WebSphere eXtreme Scale to do this. The agent code will be executed in parallel across all partitions. It will run a query within that partition to find any accounts in that partition with a username of 'billy'. One account object should match across the whole grid and the client which called the agent should receive the account number as the result. Problem solved!

Not so fast. Lets examine this parallel search. How long does it take to run? The client invokes instructs each partition to execute the search code. These searches run in parallel and the client blocks until they all return. So, the client basically waits for the slowest 'partition' or server to return before it continues. How many of these lookup transactions can the grid perform per second? As many as the slowest box can do. If the number of accounts was to double, we could double the size of the grid. This lets us store twice as many accounts but what about the effect on our parallel search? It's true we are searching twice as fast as before (double the CPUs) but there is also twice as much data to search through so we are probably achieving the same response time as before. What about throughput? It's still the same. We can only do as many transactions per second as the slowest machine. Our throughput hasn't changed even though we doubled the size of the grid. Now, we can search twice as many records with the same response time as before, but throughput wise, nothing changed. The grid is scaling in terms of account capacity and records searched/second but the throughput number is not scaling at all.

Availability is also impacted when compared with single partition transactions. The single partition transactions only used a single partition/server. The every partition transaction needs the whole grid to be up to complete. The failure of a single box will delay the transaction from completing. Now, products like WebSphere eXtreme Scale will very quickly recover from a failure (typically sub second) but on a large enough grid then you'll see response time glitches where maybe a second or so is added if the admins are cycling through servers doing maintenance or something like that. This delay is very unlikely to happen in a single partition transaction case. You'd have a 1/N change of it happening. Much better than the 100% chance with a every partition transaction.

This lack of throughput scalability for every partition transactions is a problem as login is a operation whose throughput needs to go up as the web site becomes more popular. So, it looks like using parallel search for an operations which need to scale from a throughput point of view is a bad idea. What else can we do?

We could partition using user name instead of account but now we have the search problem for all the account number based transactions which are the bulk of all transactions and besides, users like being able to change the user name which would be a nightmare if everything was based on usernames.

We could cache the results of looking up usernames with parallel searches. The cache would be a Map whose key was username and the value was account number. A Loader attached to the Map would do a parallel search with a MapGridAgent if its Loader#get method was called on a cache miss. The problem here is that when we warm up the cache, we'll be getting a lot of cache misses and a lot of parallel searches. Not good either.

Or, we could maintain a persistent reverse index. This index is a Map which has the user name for the key and the account id for the value. The Map is backed by a database table or other long term persistence mechanism. Now, when a user logs in, we simply do a Map.get("billy") and receive the account id with a single partition transaction and the throughput of those does scale with grid size. We have to maintain this reverse index so that if the user changes their username then we need to make sure the reverse index is updated and so on.

Login now is a matter of looking up the user name in the reverse index map (revMap.get "billy" returning 1234) and then retrieving the account object using a second get to check the password (accMap.get "1234" returning the account object with the password). This is a much better solution than a parallel search. This is a query cache. Effectively, we are caching the results of the parallel search using a persistent map. We have converted a parallel transaction to a single partition transaction and as a result, our login operation is now throughput scalable.

Multi-partition transactions can be great for searching large amounts of data in parallel. The search speed/second does increase with the grid size. Larger grids can store larger amounts of data but the throughput typically stays the same as the grid grows (assuming the data size grows linearly with grid size). This means using parallel operations for something whose throughput will grow as your application scales up is a mistake as the throughput of the grid has nothing to do with the grid size, it's limited to the throughput of the slowest box.

You need to convert that parallel search operation to a single partition get if you want the system to scale from a throughput point of view. Caching the parallel searches OR using a reverse index (effectively this is a disk persistent query cache) is the normal way to handle this conversion.

How can you make an every partition operation scale from a throughput point of view then if you can't use reverse indexes? Use multiple grids which are all the same and round robin the requests over them. Each grid will be able to do M transactions per second and N grids givens you MN per second. If you need throughput scalable every partition transactions then this is probably the only way to make it scale from a throughput point of view. Ever wonder why google needs millions of servers...

This article is really talking about transactions that involve every partition like a search. Some transaction may use two partitions for example or some small number of partitions relative to the total number but thats for another blog entry...

Wednesday
Jun102009

Paper: Graph Databases and the Future of Large-Scale Knowledge Management

Relational databases, document databases, and distributed hash tables get most of the hype these days, but there's another option: graph databases. Back to the future it seems. Here's a really interesting paper by Marko A. Rodriguez introducing the graph model and it's extension to representing the world wide web of data.

Modern day open source and commercial graph databases can store on the order of 1 billion relationships with some databases reaching the 10 billion mark. These developments are making the graph database practical for applications that require large-scale knowledge structures. Moreover, with
the Web of Data standards set forth by the Linked Data community, it is possible to interlink graph databases across the web into a giant global knowledge structure. This talk will discuss graph databases, their underlying data model, their querying mechanisms, and the benefits of the graph data structure for modeling and analysis.
Friday
Jun052009

Google Wave Architecture

Update: Good Vibrations by Radovan Semančík. Lot's of interesting questions about how Wave works, scalability, security, RESTyness, and so on.

Google Wave is a new communication and collaboration platform based on hosted XML documents (called waves) supporting concurrent modifications and low-latency updates. This platform enables people to communicate and work together in new, convenient and effective ways. We will offer these benefits to users of Google Wave and we also want to share them with everyone else by making waves an open platform that everybody can share. We welcome others to run wave servers and become wave providers, for themselves or as services for their users, and to "federate" waves, that is, to share waves with each other and with Google Wave. In this way users from different wave providers can communicate and collaborate using shared waves. We are introducing the Google Wave Federation Protocol for federating waves between wave providers on the Internet.

Here are the initial white papers that are available to complement the Google Wave Federation Protocol:

  • Google Wave Federation Architecture

  • Google Wave Data Model and Client-Server Protocol

  • Google Wave Operational Transform

  • General Verifiable Federation

The Google Wave APIs are documented here.

Friday
Jun052009

HotPads Shows the True Cost of Hosting on Amazon

Mather Corgan, president of HotPads, gave a great talk on how HotPads uses AWS to run their real estate search engine. I loved the presentation for a few reasons:

  • It gives real costs on on their servers, how many servers they have, what they are used for, and exactly how they use S2, EBS, CloudFront and other AWS services. This is great information for anybody trying to architect a system and wondering where to run it.
  • HotPads is a "real" application. It's a small company and at 4.5 million page-views/month it's large but not super large. It has custom server side components like indexing engines, image processing, and background database update engines for syncing new real estate data. And it also stores a lot of images and has low latency requirements.

    This a really good example mix of where many companies are or would like to be with their applications.

    Their total costs are about $11K/month, which is about what they were paying at their previous provider. I found this is a little surprising as I thought the cloud would be more expensive, but they only pay for what they need instead of having to over provision for transient uses like testing. And some servers aren't necessary anymore as EBS handles backups so database slave servers are no longer required.

    There are lots more lessons like this that I've abstracted down below.

    Site: http://hotpads.com - a map-based real estate search engine, listing homes for sale, apartments, condos, and rental houses.

    Stats

  • 800,000 visits/month
  • 4.5 million page-views/month
  • 3.5 million real-estate listings updated daily

    Platform

  • Java
  • MySQL
  • AWS

    Costs

  • EC2 - $7400/month - run 20 of various size instances at anyone time. Most work is in the background processing of images, not web serving.
    * $150: 2 Small HAProxy Load Balancers - 2 for failover, these have the elastic IPs, round robin DNS point at the elastic IPs.
    * $1,200: 3-5 Large Tomcat Web Servers - an array of 3 run at night and 5 during the day.
    * $1,500: 5 Large Tomcat Job Servers
    * $900: 1 X-Large 1 Large Index Server - used to power property search and have several GB of RAM for the JVM
    * $1,200: 1 X-Large 2 Large MySQL masters
    * $1,200: 1 X-Large 2 Large MySQL slaves
    * $300: 1 Large Messaging Server ActiveMQ - will be replaced with SQS
    * $300: 1 Large Map tile creation servers Tilecache
    * $600: Development/testing/migration/ servers
  • S3 - $1500/month - few hundred million objects for files for maps and real-estate listing photos. 4TB of database backup stored as EBS diffs ($600/month).
  • Elastic Block Storage - $500/month
  • CloudFront - $460/month - is used to serve static files and map files throughout the world. It serves static files, map tiles, and listing photos.
  • Elastic IP Addresses - $8/month
  • RightScale - $500/month - used for management and deployment.

    Lessons Learned

  • Major reason for choosing EC2 was the cloud API which allows adding servers at any time. In their previous hosting service they had to prepay for a month at a time so they would order the minimum necessary to get by that month. That doesn't leave room for servers for development, test, preview servers for customers or making live database servers upgrades (which requires 2x servers)?
  • Overall cost is about the same as with previous hosting site but the overall speed of development and ease of management is night and day different. Getting more servers and lots more flexibility.
  • HotPads is a small company and doesn't think added trouble of colocation isn't worth it for them yet.
  • Advantage of Amazon over something like Google App Engine is that Amazon allows you to innovate by building your own services on your own machines.
  • S3 is better for larger objects because for small files that are not viewed often the cost of puts outweighs everything. Not a cache to use for short lived objects because the put costs start to dominate.
    * For a 67 KB object (600 px image) which is where the cost of putting an image into S3 equals the cost of storing it there and about equal the cost of storing it once.
    * For a 6.7 KB object (15 px thumb nail) the put (small fee for putting an object into S3) cost is 10x the storage transfer costs.
  • Costs have to figured into the algorithms you use.
    * In April 330 GB of images downloaded at $.15/GB cost $49. 55mm GETs at $1/mm cost $55. 42mm PUTs at $1/1k cost $420!
    * $100 download and GETs of maptiles.
    * So S3 very cheap for larger files, watch out for lots of short lived small files.
  • CloudFront is 10 times faster than S3 but is more expensive for infrequently viewed files.
    * Makes frequently viewed listings faster.
    * For infrequently viewed listings the CloudFront has to go to S3 to get the file the first time which means you have to pay twice for a file that will be viewed only once.
  • EBS
    * Used on database servers because it's faster than local storage (especially for random writes), blocks of data redundant, and supports easy backups and versioning via cloning.
    * Only 10% cost overhead.
    * Allowed them to get rid of second set of slaves because the backups were so CPU intensive they had to have slaves to do the backups. EBS allows snapshots of running drives so the extra slaves are unnecessary.
    * Databases are I/O bound and the CPU is vastly underutilized so there's extra capacity when you need it.
  • SimpleDB - not using, pretty proprietary. May be of value because you only pay for what you use given how under utilized your own database servers can be.
  • Reserved Instances
    * 1 year for the cost of 6 months and guaranteed (denied one time) to get an instance.
    * Con is tied to an instance type and they want more flexibility to choose instance types as their software changes and take advantage of new instance types as they are released.
  • Rather than having dedicated memcached machines they've scavenged 8 GB of memory from their existing servers.

    Related Sites

  • AWS Start-Up Event DC 2009: HotPads On AWS Slideshow.
  • Cloud Programming Directly Feeds Cost Allocation Back into Software Design
  • AWS Elastic Load Balancer Tutorial

  • Monday
    Jun012009

    Guess How Many Users it Takes to Kill Your Site?

    Update: Here's the first result. Good response time until 400 users. At 1,340 users the response time was 6 seconds. And at 2000 users the site was effectively did. An interesting point was that errors that could harm a site's reputation started at 1000 users. Cheers to the company that had the guts to give this a try.

    That which doesn't kill your site makes it stronger. Or at least that's the capacity planning strategy John Allspaw recommends (not really, but I'm trying to make a point here) in The Art of Capacity Planning:

    Using production traffic to define your resources ceilings in a controlled setting allows you to see firsthand what would happen when you run out of capacity in a particular resource. Of course I'm not suggesting that you run your site into the ground, but better to know what your real (not simulated) loads are while you're watching, than find out the hard way. In addition, a lot of unexpected systemic things can happen when load increases in a particular cluster or resource, and playing "find the butterfly effect" is a worthwhile exercise.

    The problem is how do you ever test to such a scale? That's where Randy Hayes of CapCal--a distributed performance testing system--comes in. Randy first contacted me asking for volunteers to try a test of a million users, which sounded like a great High Scalability sort of thing to do. Unfortunately he already found a volunteer so the idea now is to test how many users it takes to find a weakness in your site.

    If anyone wants test their system to the breaking point the process goes like this:

  • Guess how many users it will take to bring your average response time to two seconds.
  • Contact Randy at randy@capcal.com.
  • Download the CapCal client, record the test, and upload it to the server.
  • At a scheduled time the test will be run by ramping up virtual users until average response time >= 2 seconds
  • You will get a link to the results on the CapCal server. Here's an example result.
  • How close was your guess?
  • This cost will be whatever Amazon charges. An hour's worth of tests on virtual user counts up to 10,000 is about $45.

    In the past test generators were fun to write, but it was always difficult to get enough boxes to generate sufficient load. Maybe you remember installing test agents on people's work computers in cubeland so tests could be run over night when everyone was sleeping?

    The cloud has changed all that. Testing-as-a-Service is one very obvious and solid use of the cloud. You need load? We got your load right here. Spin up more machines and you can drive your site into oblivion, but not in a denial-of-service attack sort of way :-)

    Randy has a nice write up how their system works in CapCal Architecture and Background. It's similar in concept to other distributed testing frameworks you may have used, only this one operates in AWS and not on your own servers.

    Not everyone is Google or Yahoo with zillions of users to test their software against. If you are interested in testing your site please contact Randy and give it a go. And when you are done it would be fun to have an experience report here about what you learned and what changes you needed to make.
  • Friday
    May292009

    Is Eucalyptus ready to be your private cloud?


    Update:: Eucalyptus Goes Commercial with $5.5M Funding Round. This removes my objection that it's an academic project only. Go team go!

    Rich Wolski, professor of Computer Science at the University of California, Santa Barbara, gave a spirited talk on Eucalyptus to a large group of very interested cloudsters at the Eucalyptus Cloud Meetup. If Rich could teach computer science at every school the state of the computer science industry would be stratospheric. Rich is dynamic, smart, passionate, and visionary. It's that vision that prompted him to create Eucalyptus in the first place. Rich and his group are experts in grid and distributed computing, having a long and glorious history in that space. When he saw cloud computing on the rise he decided the best way to explore it was to implement what everyone accepted as a real cloud, Amazon's API. In a remarkably short time they implement Eucalyptus and have been improving it and tracking Amazon's changes ever since.

    The question I had going into the meetup was: should Eucalyptus be used to make an organization's private cloud? The short answer is no. Wait wait, it's now yes, see the update at the beginning of the article.

    The project is of high quality, the people are of the highest quality, but in the end Eucalyptus is a research project from a university. As an academic project Eucalyptus is subject to changes in funding and the research interests of the team. When funding sources dry up so does the project. If the team finds another research area more interesting, or if they get tired of chasing a continuous stream of new Amazon features, or no new grad students sign on, which will happen in a few years, then the project goes dark.

    Fears over continuity have at least two solutions: community support and commercial support. Eucalyptus could become community supported open source project. This is unlikely to happen though as it conflicts with the research intent of Eucalyptus. The Eucalyptus team plans to control the core for research purposes and encourage external development of add-on service like SQS. Eucalyptus won't go commercial as University projects must stay clear from commercial pretensions. Amazon is "no comment" on Eucalyptus so it's not clear what they would think of commercial development should it occur.

    Taken together these concerns imply Eucalyptus is not a good base for an enterprise quality private cloud. Which they readily admit. It's not enterprise ready Rich repeats. It's not that the quality isn't there. It is and will be. And some will certainly base their private cloud on Eucalyptus, but when making a decision of this type you have to be sure your cloud infrastructure will be around for the long haul. With Eucalyptus that is not necessarily the case. Eucalyptus is still a good choice for it's original research purpose, or as cheap staging platform for Amazon, or as base for temporary clouds, but as your rock solid private cloud infrastructure of the future Eucalyptus isn't the answer.

    The long answer is a little more nuanced and interesting.

    The primary purpose for Eucalyptus is research. It was never meant to be our little untethered private Amazon cloud. But if it works, why not?

    Eucalyptus is Not a Full Implementation of the Amazon Stack

    Eucalyptus implements most of EC2 and a little of S3. They hope to get community support for the rest. That of course makes Eucalyptus far less interesting as a development platform. But if your use for Eucalyptus is as an instant provisioning framework you are still in the game. Their emulation of EC2 is so good RightScale was able to operate on top of Eucalyptus. Impressive.

    But even in the EC2 arena I have to wonder for how long they'll track Amazon development. If you are a researcher implementing every new Amazon feature is going to get mighty old after a while. It will be time to move on and if you are dependent on Eucalyptus you are in trouble. Sure, you can move to Amazon but what about that $1 million data center buildout?

    Developing software not tied to the Amazon service stack then Eucalyptus would work great.

    As an Amazon developer I would want my code to work without too much trouble in both environments. Certainly you can mock the different services for testing or create a service layer to hide different implementations, but that's not ideal and makes Eucalyptus as an Amazon proxy less attractive.

    One of the uses for Eucalyptus is to make Amazon cheaper and easier by testing code locally without out having to deploy into Amazon all the time. Given the size of images the bandwidth and storage costs add up after a while, so this could make Eucalyptus a valuable part of the development process.

    Eucalyptus is Not as Scalable as Amazon

    No kidding. Amazon has an army of sysadmins, network engineers, and programmers to make their system work at such ginormous scales. Eucalyptus was built on smarts, grit and pizza. It will never scale as well as Amazon, but Eucalyptus is scalable to 256 nodes right now. Which is not bad.

    Rich thinks with some work they already know about it could scale to 5000 nodes. Not exactly Amazon scale, but good enough for many data center dreams.

    One big limit Eucalyptus has is the self-imposed requirement to work well in any environment. It's just a tarball you can install on top of any network. They rightly felt this was necessary for adoption. Saying to potential customers that you need to setup a special network before you can test this software tends to slow down adoption. By making Eucalyptus work as an overlay they soothed a lot of early adopter pain.

    But by giving up control of the machines, the OS, the disk, and the network they limited how scalable they can be. There's more to scalability than just software. Amazon has total control and that gives them power. Eucalyptus plans to make more invasive and more scalable options available in the future.

    Lacks Some Private Cloud Features

    Organizations interested in a private cloud are often interested in:

  • Control
  • Privacy and Security
  • Utility Chargeback System
  • Instant Provisioning Framework
  • Multi-tenancy
  • Temporary Infrastructure for Proof of Concept for "Real" Provisioning
  • Cloud Management Infrastructure

    Eucalyptus satisfies many of these needs, but a couple are left wanting:
  • The Utility Chargeback System allows companies to bill back departments for the resources they use and is a great way get around a rigid provisioning process and still provide accountability back to the budgeting process. Eucalyptus won't do this for you.
  • A first class Cloud Management Infrastructure is not part of Eucalyptus because it's not part of Amazon's API. Amazon doesn't expose their internal management process. Eucalyptus is adding some higher level management tools, but they'll be pretty basic.

    These features may or may not be important to you.

    Clouds vs Grids

    Endless pixels have been killed defining clouds, grids, and how they are different enough that there's really a whole new market to sell into. Rich actually makes a convincing argument that grids and clouds are different and do require a completely different infrastructure. The differences:

    Cloud

  • Full private cluster is provisioned
  • Individual user can only get a tiny fraction of the total resource pool
  • No support for cloud federation except through the client interface
  • Opaque with respect to resources

    Grid

  • Built so that individual users can get most, if not all of the resources in a single request
  • Middleware approach takes federation as a first principle
  • Resources are exposed, often as bare metal

    Related Articles

  • Get Off of My Cloud by M. Jagger and K. Richards.
  • Rich Wolski's Home Page
  • Enomaly
  • Nimbus
  • Thursday
    May282009

    Scaling PostgreSQL using CUDA

    Combining GPU power with PostgreSQL PostgreSQL is one of the world's leading Open Source databases and it provides enormous flexibility as well as extensibility. One of the key features of PostgreSQL is that users can define their own procedures and functions in basically any known programming language. With the means of functions it is possible to write basically any server side codes easily. Now, all this extensibility is basically not new. What does it all have to do with scaling and then? Well, imagine a world where the data in your database and enormous computing power are tightly integrated. Imagine a world where data inside your database has direct access to hundreds of FPUs. Welcome to the world of CUDA, NVIDIA's way of making the power of graphics cards available to normal, high-performance applications. When it comes to complex computations databases might very well turn out to be a bottleneck. Depending on your application it might easily happen that adding more CPU power does not improve the overall performance of your system – the reason for that is simply that bringing data from your database to those units which actually do the computations is ways too slow (maybe because of remote calls and so on). Especially when data is flowing over a network, copying a lot of data might be limited by network latency or simply bandwidth. What if this bottleneck could be avoided? CUDA is C / C++ Basically a CUDA program is simple a C program with some small extensions. The CUDA subsystem transforms your CUDA program to normal C code which can then be compiled and linked nicely with existing code. This also means that CUDA code can basically be used to work inside a PostgreSQL stored procedure easily. The advantages of this mechanism are obvious: GPUs can do matrix and FPU related operations hundreds of times faster than any CPU the GPU is used inside the database and thus no data has to be transported over slow lines basically any NVIDIA graphics card can be used you get enormous computing power for virtually zero cost you can even build functional indexes on top of CUDA stored procedures not so many boxes are needed because one box is ways faster How to make it work? How to make this all work now? The goal for this simplistic example is to generate a set of random number on the CPU, copy it to the GPU and make the code callable from PostgreSQL. Here is the function to generate random numbers and to copy them to the GPU: /* implement random generator and copy to CUDA */ nn_precision* generate_random_numbers(int number_of_values) { nn_precision *cuda_float_p; /* allocate host memory and CUDA memory */ nn_precision *host_p = (nn_precision *)pg_palloc(sizeof(nn_precision) * number_of_values); CUDATOOLS_SAFE_CALL( cudaMalloc( (void**) &cuda_float_p, sizeof(nn_precision) * number_of_values)); /* create random numbers */ for (int i = 0; i < number_of_values; i++) { host_p[i] = (nn_precision) drand48(); } /* copy data to CUDA and return pointer to CUDA structure */ CUDATOOLS_SAFE_CALL( cudaMemcpy(cuda_float_p, host_p, sizeof(nn_precision) * number_of_values, cudaMemcpyHostToDevice) ); return cuda_float_p; } Now we can go and call this function from a PostgreSQL stored procedure: /* import postgres internal stuff */ #include "postgres.h" #include "fmgr.h" #include "funcapi.h" #include "utils/memutils.h" #include "utils/elog.h" #include "cuda_tools.h" PG_MODULE_MAGIC; /* prototypes to silence compiler */ extern Datum test_random(PG_FUNCTION_ARGS); /* define function to allocate N random values (0 - 1.0) and put it into the CUDA device */ PG_FUNCTION_INFO_V1(test_random); Datum test_random(PG_FUNCTION_ARGS) { int number = PG_GETARG_INT32(0); nn_precision *p = generate_random_numbers(number); cuda_free_array(p); PG_RETURN_VOID(); } This code then now be nicely compiled just like any other PostgreSQL C extension. The test random function can be called just like this: SELECT test_random(1000); Of course this is a just brief introduction to see how things can practically be done. A more realistic application will need more thinking and can be integrated into the database even more closely. More information: Professional CUDA programming Professional PostgreSQL services The official PostgreSQL Website The official CUDA site

    Click to read more ...

    Wednesday
    May202009

    Paper: Flux: An Adaptive Partitioning Operator for Continuous Query Systems

    At the core of the new real-time web, which is really really old, are continuous queries. I like how this paper proposed to handle dynamic demand and dynamic resource availability by making the underlying system adaptable, which seems like a very cloudy kind of thing to do. Abstract:

    The long-running nature of continuous queries poses new scalability challenges for dataflow processing. CQ systems execute pipelined dataflows that may be shared across multiple queries. The scalability of these dataflows is limited by their constituent, stateful operators – e.g. windowed joins or grouping operators. To scale such operators, a natural solution is to partition them across a shared-nothing platform. But in the CQ context, traditional, static techniques for partitioned parallelism can exhibit detrimental imbalances as workload and runtime conditions evolve. Longrunning CQ dataflows must continue to function robustly in the face of these imbalances. To address this challenge, we introduce a dataflow operator called Flux that encapsulates adaptive state partitioning and dataflow routing. Flux is placed between producerconsumer stages in a dataflow pipeline to repartition stateful operators while the pipeline is still executing. We present the Flux architecture, along with repartitioning policies that can be used for CQ operators under shifting processing and memory loads. We show that the Flux mechanism and these policies can provide several factors improvement in throughput and orders of magnitude improvement in average latency over the static case

    Click to read more ...

    Sunday
    May172009

    Scaling Django Web Apps by Mike Malone

    Film buffs will recognize Django as a classic 1966 spaghetti western that spawned hundreds of imitators. Web heads will certainly first think of Django as the classic Python based Web framework that has also spawned hundreds of imitators and has become the gold standard framework for the web. Mike Malone, who worked on Pownce, a blogging tool now owned by Six Apart, tells in this very informative EuroDjangoCon presentation how Pownce scaled using Django in the real world. I was surprised to learn how large Pounce was: hundreds of requests/sec, thousands of DB operations/sec, millions of user relationships, millions of notes, and terabytes of static data. Django has a lot of functionality in the box to help you scale, but if you want to scale large it turns out Django has some limitations and Mike tells you what these are and also provides some code to get around them. Mike's talk-although Django specific--will really help anyone creating applications on the web. There's a lot of useful Django specific advice and a lot of general good design ideas as well. The topics covered in the talk are:

  • Django uses a shared nothing architecture. * The database is responsible for scaling state. * Application servers are horizontally scalable because they are stateless.
  • Scalability vs Performance. Performance is not the same as scalability. Scalability is A scalable system doesn’t need to change when the size of the problem changes.
  • Type of scalability: * Vertical - buy bigger hardware * Horizontal - the ability to increase a system’s capacity by adding more processing units (servers)
  • Cache to remove load from the database server.
  • Built-in Django Caching: Per-site caching, per-view cache, template fragment cache - not so effective on heavily personalized pages
  • Low-level Cache API is used to cache at any level of granularity.
  • Pounce cached individual objects and lists of object IDs.
  • The hard part of caching is invalidation. How do you know when a value changes such that the cache should be up updates so readers see valid values? * Invalidate when a model is saved or deleted. * Invalidate post_save, not pre_save. * This leaves a small race condition so: ** Instead of deleting, set the cache key to None for a short period of time ** Instead of using set to cache objects, use add, which fails if there’s already something stored for the key
  • Pounce ran memcached on their web servers * Their servers were not CPU bound, they were IO and memory bound so they compressed objects before caching.
  • Work is spread between multiple application servers using a load balancer.
  • Best way to reduce load on your app servers: don’t use them to do hard stuff.
  • Pounce used software load balancing * Hardware load balancers are expensive ($35K) and you need two for redunancy. * Software load balancers are cheap and easy. * Some options: Perlbal, Pound, HAProxy, Varnish, Nginx * Chose a single Perlbal server. This was a Single Point of Failure but they didn't have the money for hardware. Liked Perlbal's reproxying feature.
  • Used a ghetto queuing solution (MySQL + cron) to process work asynchronously in the background.
  • At scale their system needed to have high availability and be partitionable. * The RDBMS’s consistency requirements get in our way * Most sharding / federation schemes are kludges that trade consistency * There are many non relational databases (CouchDB, Cassandra, Tokyo Cabinet) but they aren't easy to use with Django.
  • Rules for denormalization: * Start with a normalized database * Selectively denormalize things as they become bottlenecks * Denormalized counts, copied fields, etc. can be updated in signal handlers
  • Joins are evil and Django makes it really easy to do joins.
  • Database Read Performance * Since your typical web app is 80% to 80% reads adding MySQL master-slave replication can solve a lot of problems. * Django doesn't support multiple database connections, but there's a library, linked to at the end of this document to help. * A big problem is slave lag. When you write to the primary it takes time for the state to be transferred to the read slaves so readers may see an old value on the read.
  • Database Write Performance * Federate. Split tables across different servers. Not well supported by Django. * Vertical Partitioning: split tables that aren’t joined across database servers. * Horizontal Partitioning: split a single table across databases (e.g., user table). Problem is autoincrement now doesn't work and Django uses autoincrement for primary keys.
  • Monitoring - You can't improve what you don't measure * Products: Ganglia and Munin
  • Measure * Server load, CPU usage, I/O * Database QPS * Memcache QPS, hit rate, evictions * Queue lengths * Anything else interesting

    Related Articles

  • Interview with Leah Culver: The Making of Pownce
  • Django Caching Code
  • Django Multidb Code
  • EuroDjangoCon Presentations

    Click to read more ...

  • Sunday
    May172009

    Product: Hadoop

    Update 5: Hadoop Sorts a Petabyte in 16.25 Hours and a Terabyte in 62 Seconds and has its green cred questioned because it took 40 times the number of machines Greenplum used to do the same work. Update 4: Introduction to Pig. Pig allows you to skip programming Hadoop at the low map-reduce level. You don't have to know Java. Using the Pig Latin language, which is a scripting data flow language, you can think about your problem as a data flow program. 10 lines of Pig Latin = 200 lines of Java. Update 3: Scaling Hadoop to 4000 nodes at Yahoo!. 30,000 cores with nearly 16PB of raw disk; sorted 6TB of data completed in 37 minutes; 14,000 map tasks writes (reads) 360 MB (about 3 blocks) of data into a single file with a total of 5.04 TB for the whole job. Update 2: Hadoop Summit and Data-Intensive Computing Symposium Videos and Slides. Topics include: Pig, JAQL, Hbase, Hive, Data-Intensive Scalable Computing, Clouds and ManyCore: The Revolution, Simplicity and Complexity in Data Systems at Scale, Handling Large Datasets at Google: Current Systems and Future Directions, Mining the Web Graph. and Sherpa: Hosted Data Serving. Update: Kevin Burton points out Hadoop now has a blog and an introductory video staring Beyonce. Well, the Beyonce part isn't quite true. Hadoop is a framework for running applications on large clusters of commodity hardware using a computational paradigm named map/reduce, where the application is divided into many small fragments of work, each of which may be executed on any node in the cluster. It replicates much of Google's stack, but it's for the rest of us. Jeremy Zawodny has a wonderful overview of why Hadoop is important for large website builders: For the last several years, every company involved in building large web-scale systems has faced some of the same fundamental challenges. While nearly everyone agrees that the "divide-and-conquer using lots of cheap hardware" approach to breaking down large problems is the only way to scale, doing so is not easy. The underlying infrastructure has always been a challenge. You have to buy, power, install, and manage a lot of servers. Even if you use somebody else's commodity hardware, you still have to develop the software that'll do the divide-and-conquer work to keep them all busy It's hard work. And it needs to be commoditized, just like the hardware has been... Hadoop also provides a distributed file system that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both map/reduce and the distributed file system are designed so that node failures are automatically handled by the framework. Hadoop has been demonstrated on clusters with 2000 nodes. The current design target is 10,000 node clusters. The obvious question of the day is: should you build your website around Hadoop? I have no idea. There seems to be a few types of things you do with lots of data: process, transform, and serve. Yahoo literally has petabytes of log files, web pages, and other data they process. Process means to calculate on. That is: figure out affinity, categorization, popularity, click throughs, trends, search terms, and so on. Hadoop makes great sense for them for the same reasons it does Google. But does it make sense for your website? If you are YouTube and you have petabytes of media to serve, do you really need map/reduce? Maybe not, but the clustered file system is great. You get high bandwidth with the ability to transparently extend storage resources. Perfect for when you have lots of stuff to store. YouTube would seem like it could use a distributed job mechanism, like you can build with Amazon's services. With that you could create thumbnails, previews, transcode media files, and so on. When they have Hbase up and running that could really spike adoption. Everyone needs to store structured data in a scalable, reliable, highly performing data store. That's an exciting prospect for me. I can't wait for experience reports about "normal" people, familiar with a completely different paradigm, adopting this infrastructure. I wonder what animal O'Reilly will use on their Hadoop cover?

    See Also

  • Open Source Distributed Computing: Yahoo's Hadoop Support by Jeremy Zawodny
  • Yahoo!'s bet on Hadoop by Tim O'Reilly
  • Hadoop Presentations
  • Running Hadoop MapReduce on Amazon EC2 and Amazon S3

    Click to read more ...

  • Page 1 ... 4 5 6 7 8 ... 38 Next 10 Entries »