Thursday
Aug142008
Product: Terracotta - Open Source Network-Attached Memory

Update: Evaluating Terracotta by Piotr Woloszyn. Nice writeup that covers resilience, failover, DB persistence, Distributed caching implementation, OS/Platform restrictions, Ease of implementation, Hardware requirements, Performance, Support package, Code stability, partitioning, Transactional, Replication and consistency.
Terracotta is Network Attached Memory (NAM) for Java VMs. It provides up to a terabyte of virtual heap for Java applications that spans hundreds of connected JVMs.
NAM is best suited for storing what they call scratch data. Scratch data is defined as object oriented data that is critical to the execution of a series of Java operations inside the JVM, but may not be critical once a business transaction is complete.
The Terracotta Architecture has three components:
JVM-level clustering can turn single-node, multi-threaded apps into distributed, multi-node apps, often with no code changes. This is possible by plugging in to the Java Memory Model in order to maintain key Java semantics of pass-by-reference, thread coordination and garbage collection across the cluster. Terracotta enables this using only declarative configuration with minimal impact to existing code and provides fine-grained field-level replication which means your objects no longer need to implement Java serialization.
Ari Zilka, the founder and CTO of Terracotta had a
video session organized by Skills Matter. He will show you how it works and how you can start clustering your POJO-based Web applications (based on Spring, Struts, Wicket, RIFE, EHCache, Quartz, Lucene, DWR, Tomcat, JBoss, Jetty or Geronimo etc.).
Terracotta is Network Attached Memory (NAM) for Java VMs. It provides up to a terabyte of virtual heap for Java applications that spans hundreds of connected JVMs.
NAM is best suited for storing what they call scratch data. Scratch data is defined as object oriented data that is critical to the execution of a series of Java operations inside the JVM, but may not be critical once a business transaction is complete.
The Terracotta Architecture has three components:
- Client Nodes - Each client node corresponds to a client node in the cluster which runs on a standard JVM
- Server Cluster - java process that provides the clustering intelligence. The current Terracotta implementation operates in an Active/Passive mode
- Storage used as
- Virtual Heap storage - as objects are paged out of the client nodes, into the server, if the server heap fills up, objects are paged onto disk
- Lock Arbiter - To ensure that there is no possibility of the classic "split-brain" problem, Terracotta relies on the disk infrastructure to provide a lock.
- Shared Storage - to transmit the object state from the active to passive, objects are persisted to disk, which then shares the state to the passive server(s).
JVM-level clustering can turn single-node, multi-threaded apps into distributed, multi-node apps, often with no code changes. This is possible by plugging in to the Java Memory Model in order to maintain key Java semantics of pass-by-reference, thread coordination and garbage collection across the cluster. Terracotta enables this using only declarative configuration with minimal impact to existing code and provides fine-grained field-level replication which means your objects no longer need to implement Java serialization.
Ari Zilka, the founder and CTO of Terracotta had a
video session organized by Skills Matter. He will show you how it works and how you can start clustering your POJO-based Web applications (based on Spring, Struts, Wicket, RIFE, EHCache, Quartz, Lucene, DWR, Tomcat, JBoss, Jetty or Geronimo etc.).
Reader Comments (10)
This looks pretty impressive. Has anyone actually tried implementing this into their own, existing, system? Definitely seems like this could be very powerful for large scale java apps (is anyone doing that? :o) )
http://www.samalamadingdong.com
"Definitely seems like this could be very powerful for large scale java apps (is anyone doing that? :o) )"
LinkedIn is a big Java shop.
Cheers
I am particularly unimpressed by Terracotta. Take for instance their justifications for avoiding a peer system:
http://www.ddj.com/java/199703478
"The peer-to-peer approach bottlenecks on the network, since every node needs to know everything that the other nodes know."
Also, http://www.theserverside.com/tt/articles/article.tss?l=TerracottaScalabilityStory :
"If an application is distributed across four nodes as opposed to two, should that cluster not take twice as many operations to update objects on all four nodes? If a clustering architecture were to send updates one-by-one, then four nodes would achieve half the clustered throughput of two. And, if we were to use a multicast approach, then we would lower the elapsed time for the update by going parallel in our cluster updates. But, if we confirm (ACK) those updates in all clustered JVMs—and for the case of correctness we really ought to acknowledge all updates on all nodes where the object lies resident—we still have to wait for 3 acknowledgements to each update in a four-node cluster and our design is, thus, O(n)."
The idea that a distributed (peer) system requires an O(n) routing algorithm is patently false. Take any modern research on DHTs which provide the same basic functionality as terracotta; these systems have, at worst, an O(log(n)) routing algorithm for object lookup. Or take a look at memcached, which has an O(1) lookup algorithm. My point is, Consistent hashing solves this problem.
Due to this erroneous reasoning on the part of the Terracotta developers, they have chosen to create an inherently unscalable system. Look closely at how Terracotta actually scales -- they essentially use a master slave architecture in order to distribute reads. They scale fine for a read-heavy application, but there is a single node bottleneck for writes. Thats right, Terracotta is a hierarchical system with a single node as the bottleneck.
Yes, It can be made faster with better hardware. And yes, you can use your own manual sharding strategies. But no, this is not a truly (horizontally) scalable product. When I read high scalability, I'm looking for technologies that avoid the problem of having a single bottleneck.
Disclosure: My only interaction with their product has been to read a couple of hyped up articles about it and look closely at their "cutting edge" architecture diagrams. I have never (and currently don't) work on any products/projects in the same realm.
I'm also skeptical of the "no code change" benefits. Most applications likely assume inter thread latency is faster than inter host latency. Likewise, failures in clusters are a lot more complicated with things like asymmetric partitions. Using spindles for locks sounds slow as well.
I want to address the concerns raised here:
1. "No code changes' is more accurately characterized as pure Java and direct JDK support. Our users say that they end up with cleaner code, fewer bugs, and faster apps once Terracotta has been introduced. So yes, apps are not always prepared to cluster, but a clustered app runs w/o Terracotta present with zero changes. As a specific example, I just spoke to a user who wanted to build a data structure made up of concurrenthashmaps, treemaps inside that, and linkedlists inside that. Worked fine with us. He wanted to benchmark against a distributed cache and this proved quite hard. His linkedlist has a fixed 180 time-based elements. Every so often he pops the oldest item off one end and adds the newest item to the other. I wouldn't want to model this use case only having a map. I would have to do some sort of indexing trick like writing an entry keyed "head of list 7" to the map which would contain as a value, the key of the entry which currently represented the head of that list. And each entry would have to maintain the ID of the next value in the chain. Why? Because linked structures cannot be clustered w/o Terracotta. They get serialized with other frameworks.
2. TC is O(1) like Memcached. And it is 10X higher thoughtput than anything else out there because of the runtime data it can leverage to route and batch--much like GC or hotspot can improve code, so can we. Our customers find that a typical TC installation of 2-4 JVMs (plus our server) would be replaced by 20 - 50 servers of the nearest competitor's solution. While they scale linearly to hundreds of nodes, they seem to need lots of nodes. And, our server has consistent hashing built-in to stripe data across Terracotta servers for linear scale. Why quote our blog entries and posts from 2 -4 years ago and ignore the ones from this year where we announced the existence of active / active server striping with Terracotta?
Anyways, try it. You will definitely like the programming model. You may have to tune a bit to get it to go fast, but we have all these visualization tools to help with that so that you aren't blind to what's going on like you would be with many other frameworks. (Our cluster profiling tools show lock hopping, object locality and load balancing visualization, etc.)
Cheers,
--Ari
Ari,
Sorry for referencing material that is out of date. I only read material from the front page of a google search -- that aside, could you explain how your recent improvements to Terracotta keep the system from becoming write-constrained? As I understand it, there is a master node that all writes need to pass through, or is this no longer the case?
-Michael Carter
Michael,
The Terracotta Server Array supports both mirroring and striping. Mirroring is the act of providing a physical replica for failover capabilities. Striping partitions the workload. The combination gives you replicas for failover, and stripes scale.
So to answer your question - when a node is "writing" it can utilize one or more servers during the write. Any one particular object can have only one home - one server stripe - but a typical application writes to many objects (usually concurrently) and therefore the write throughput is increased linearly as stripes are added.
So you do have the architecture in your head right - except the piece I think you are misunderstanding is that the Terracotta Server Array scales out to many nodes to distribute the write load (and also the read load, however as you have already identified, the read load is normally distributed out to the client nodes so most of the read load is already deflected from the Server Array).
The biggest problem with P2P architectures is that of latency. You have to get a quorum on a lock (or use stale data). The cost of doing so grows as the size of the cluster grows - so the quorum approach quickly fails for reasonable cluster sizes.
The P2P approach is left with two options - asynchronous data updates (meaning the data is not guaranteed up to date) or partitioning.
Partitioning is somewhat uninteresting - I will admit that a paritioning framework can provide useful primitives, but partitioning is something that can be managed by any system - it's the foundation of "database sharding" - and doesn't really fall into the category of a shared memory system even if certain distributed cache vendors would like it to.
Asynchronous is a reasonable approach - but Terracotta has opted to give the developer a coherent model. That is why we call our solution "Network Attached Memory" because it's behavior models memory - writes that are made to the system are persistent, reliable, and repeatable. They work like a developer expects. An asynchronous approach trades off this simple development model with a model that is much harder to reason about. In certain cases, it's just what the doctor ordered, although quite frankly, I am getting sick of sites like Facebook that can't manage to show me the data I posted in a previous web page - sure social data is unimportant and it's not the end of the world if the systems provide "eventually" correct, but it's not what I expect to happen as a user.
A final approach taken by many "P2P" systems some hybrid replication approach - whereby the system dynamically selects some n number of backups where n < cluster size (often n is configurable at deployment time). This approach eliminates the problem of making too many replicas as the cluster scales up (and cannot rightly be called P2P), but really it's just the master/replica scenario - only in this approach the system picks a replica. (Note that WebLogic calls this implementation "buddies")
What's the problem with "buddy" replication? The problem with the buddy system is that it is a nightmare operationally. If nodes are not designated replicas then they are elected dynamically. This has disastrous effects in production - you don't know where your data is replicated and that means an operator has no idea what nodes are critical to the systems operation, and what nodes are not critical to its operation. In effect, an operator is held hostage to the "clustering" system and cannot ever take more than one node down at a time, because a two node failure could wipe out a buddy and it's replica simultaneously resulting in total data loss (assuming a replica level of 2, of course a replica level of 3 mitigates this somewhat at the expense of performance but now the operator can never take down 3 nodes at a time etc. etc.)
Worse than planned downtime is the problem of cascading failures that are a direct result of the buddy style of replication. Cascading failures start with one or more node failures. Upon death detection a buddy will elect a new replica and being replicating to it. The new buddy of course is a member of the cluster, and it is often the case that the original failure was caused by high cluster load - meaning the new buddy is likely also experiencing high load. The addition of replication load + normal load results in a failure of the new buddy, and the process repeats in an exponential domino effect (since the new buddy was already buddies with someone else, and that someone else also tries to pick a new buddy and so on). Of course no one notices this problem until Black Friday and then everyone gets to read about it the next day in the NYTimes.
The Terracotta architecture is immune to these effects. It provides a well known location for data replication. Knowing where the data is replicated is key to ensuring operational simplicity and eliminating the risk of cascading failures.
Regards,
Taylor Gautier - Product Manager - Terracotta
>I am getting sick of sites like Facebook that can't manage to show me the data I posted in a previous web page
Funny you mention this as I just read an article how they solved this with an adjustment in the mysql code. Problem was the 2 servers they used. Everytime the mysql database does a write the memcached was not updated and so the wrong data was shoen. They soolved it by changing the mysql code so that after a write it will destroy the memcached part if it exists so that on the next lookup it will get the changed data from mysql and write it back to memcached.
Nice article and discussion byw ;-)
After reading this article I have become a bit confused. For quite sometime I'm using Terracotta, but for sure there is some problem of *Single node bottleneck for writes*. I'm planning to shift to some system which provide peer-to-peer type of resilience. What do you suggest?
Umesh, there should be no reason you cannot achieve that with terracotta. Give them a call and get them to explain how you can obtain peer-to-peer type of resilience.
What is it that you are working on?
Lee