Building Scalable Systems Using Data as a Composite Material
Think of building websites as engineering composite materials. A composite material is when two or more materials are combined to create a third material that does something useful that the components couldn't do on their own. Composites like reinforced concrete have revolutionized design and construction. When building websites we usually bring different component materials together, like creating a composite, to get the features we need rather than building a completely new thing from scratch that does everything we want.
This approach has been seen as a hack because it leads to inelegancies like data duplication; great gobs of component glue; consistency issues; and messy operations. But what if the the composite approach is really a strength, not a hack, but a messy part of the world that needs to be embraced rather than belittled?
They key is to see data as a material. Right now we are arguing which is the best single material to build with. Is it NoSQL, relational, massively parallel, graph, in-memory, or something else entirely? It all seems a bit crazy. Each material has both limits and capabilities. What we need to think of building is a composite material that combines the best characteristics of what is available into something better.
Composite Materials in the Meat World
In material science materials have a natural capacity, depending on their structure, to resist failure due to the forces of: shear, torsion, compression. These capabilities limit how high you can build a building or how much weight something can carry. To get around these limits composite materials are engineered that combine different materials together to create something new that has the best characteristics of its components.
To get a feel for the difference materials and engineering can make compare Medieval Romanesque churches against Gothic churches. Romanesque churches are short, thick, small windowed, and dark. Gothic churches in contrast are tall, open, and light, marking a new more confident and prosperous era. Gothic churches could be so much taller because of the invention of the pointed arch, ribbed vaults, and the flying buttress, all of which handle loads better which make it possible to build taller more open buildings. Better engineering and materials led to better buildings.
Skip ahead a few centuries. Tall, majestic skyscrapers, one of the most muscular symbols of modern civilization, were made possible in part by the invention of new compound materials. As we've seen in the past, buildings were height challenged because they needed walls thick enough to support their weight. In 1891, for example, a 16-story building constructed in Chicago had walls 6 ft thick at the base.
Skyscrapers became possible with the invention of steel-frame construction where a rigid steel skeleton supports a building's weight. Walls are now demoted to decoration and not the meatier job of support.
Another invention that made skyscrapers possible is reinforced concrete: the magical mixing of concrete + steel rebar. As a foreshadowing of how we keep creating larger components out of parts which themselves go into building more complex components, these materials are themselves composites: concrete is a mixture of cement and stone aggregate; steel is an alloy consisting mostly of iron and carbon; cement is clay and limestone.
What concrete brings to the mix, because of its chemical makeup, is rigidity. Rigidity makes concrete able resist compressive forces, which are forces that flatten or squeeze a material. A very good quality for tall, heavy buildings. Unfortunately that same rigidity makes concrete brittle when bent. Twisting concrete breaks it. Not a good thing for buildings subject to the natural twists of wind and earthquake. So concrete alone won't work. This is where rebar comes in. Steel shines because the interaction of carbon atoms gives it the ability to return to its original shape once a load is removed. When a force like the wind causes steel to twist, the steel will return to its original shape once the load is removed. A very good property for a building.
By combining concrete and steel a new material has been created that has the best qualities of each, it resists both compression and bending, which allows buildings of almost any shape to be created. This new material brought on an entirely new wave of architecture. The modern cityscape would be unrecognizable without it.
Composites in the Web World
Let's look at our material, data, in a little more detail. Data doesn't exist separate from purpose. No building technique treats data in a similar way. We have normalization, denormalization, unstructured, semi-structured, with relationships, without relationships, documents, key-value, and so on.
Which view is right? None of them. They are all views built on a purpose. It's the purpose that shapes the material. All views have a point because all the views are meant to carry out a different function. A compound material will need to take data and allow it exist in many forms simultaneously.
The key concept to see here is: there is never a representation of data that is the one true representation. Data is represented how it needs to be represented to carry out a particular function. The job of a website is to make sure data flows correctly between all these different functions.
There are a number functions common to sites: time series, geolocation, reading, writing, streaming, counters, aggregated data, graph navigation, ad-hoc query, monetization, job queue, complex analysis, tagging, logging, searching, events, social graphs, operations, and so on. There are also a number of cross cutting concerns that are common to each function and the interaction between functions: transaction model, programming model, cache coherence, data model, caching, real-time, batch, data size, input rates, output rates, bandwidth, storage, memory, CPU, availability, testing, and so on. If you look you'll find some sort of product in all these spaces.
Often the proposed solution to do away with this confusing world is to create a one-ring to control them all, as if that will finally bring order to the chaos. In the database world the one-ring is a product that does everything using just one external view of the data. In some towers if you even suggest that the one-ring has faults you are shouted down by the Ring Wraiths who say just wait, it all can be done by one system, have a little faith, or else it's Mount Doom for you.
Composing components is after all how real websites like Amazon, Flickr, Digg, and Facebook are put together. Look at the Real Life Architecture section on HighScalability.com, you'll see company and after company building their website from a long list of components, most are usually Open Source, many are custom made for their situation. The secret sause of a site is often knowing how to forge all these components together into a single working system.
Ravelry is a good example. They use: Ruby on Rails, MySQL, Gentoo Linux, Capistrano, Nginx, Xen, HAproxy, Munin, Nagios, Tokyo Cabinet, HopToad, NewRelic, Syslog-ng, S3, Sphinx, Memcached, a bandwidth provider, a hosting provider, and a server provider. Across most websites this theme is the same, even if the components change.
Building specialized software is usually done when it involves a site's core competency. Digg and Linked In, for example, build highly custom in-memory social networking databases. That's their biz, so it makes sense. Facebook builds specialized software for image serving, chat, caching, replication, logging and everything else that really needs to scale. Again, that's their biz.
Custom software must always be fit into the rest of the system. You may have your own social networking database to make social networking operations fast, but you may also use a transactional database for financial data; a search platform like Sphinx to handle site searches; a cache for fast keyed lookups; and so on.
The alternative is to force functionality where it is implemented poorly. Is search really best handled in your RDBMS because you are worried about how to keep data consistent between the database and the search engine? Is it really better not to have a denormalized cache for fast access because it's hard to keep consistent with the normalized/transactional part of your system?
You could force one database to handle all these requirements. One mega-system could handle search, key-value access, and transactional functionality, but really, how well does that work? It hasn't worked well. Maybe it's better to build a best-of-breed system for each function and then learn how to keep the components in sync, handle transactions between components, and handle cache coherence between components.
The advantages of a composite architecture are:
Biology uses a similar strategy of building larger systems from components providing very specific functionality. We often think of cells in our body as just one thing, but in reality cells are highly specialized. There are hundreds of different types of cells in the human body. Blood cells are specifically designed to carry oxygen, for example. We talk of neurons like there's only one kind of neuron. There are hundreds. In fact, regions of the brain were first categorized by noticing when cell types changed. Cells are collected together into larger structures called organs. Organs are collected together to make a body. Parts of the body talk to each using the blood stream to carry messages and through direct connections using the nervous system. It's a little messy and chaotic, but is seems to work.
The point of all this is to counter seeing any one technology as the saviour. About now there's a lot of excitement over NoSQL style solutions. There's a rush to toss out the relational database and replace it with one of the many NoSQL options. Replacement, however, may not suffice. Each data model has a weakness. The idea behind creating composite materials is that materials can be combined together to make something better and that this approach to building systems is not a hack, it may in fact be better.
Reader Comments (4)
SQL or NoSQL... as long as it's fast and works like I need it to, who cares... We have a mixture of key/value (NoSQL) and traditional databases. Each tool for a job. And yes, there is no "magic" bullet that fixes things. Obviously you wouldn't use the same tool to search members of a site based on their attributes as you would to store your web server access logs for offline analysis. :)
Although watching people argue about NoSQL vs SQL is kind of amusing.
----------------------------------------------------------------------------
http://blog.pe-ell.net/1_jons_ramblings/archive/5_what_i_do.html
Great post! This is exactly what i'm working on now (we call it here System of Systems). I'm even using the same cells metaphor! :D
It impose some challenges:
- Process: a team change every system to get a functionality or they must change a single system? If the last is right, who is the one to keep the conceptual idea intact and get the functionality at the end? Architects?
- Communication between system: a good approach is the "Canonical Protocol " pattern
- Data duplication: if there is duplicated data, is there a single place where I can find the most recent version of it?
- And many others complexity of a distributed system....
I liked to think of it this way,
the only things you can do with data are:
* make more copies of it (eg. master slave replication)
* process data to make more data / summary (eg. monthly weekly rollup tables)
* create different indexes that allow rapid access (eg. index on a, b, c columns, reverse index on all words in a document for free text search )
Code is data.