Entries in Java (39)

Thursday
Aug142008

Product: Terracotta - Open Source Network-Attached Memory

Update: Evaluating Terracotta by Piotr Woloszyn. Nice writeup that covers resilience, failover, DB persistence, Distributed caching implementation, OS/Platform restrictions, Ease of implementation, Hardware requirements, Performance, Support package, Code stability, partitioning, Transactional, Replication and consistency. Terracotta is Network Attached Memory (NAM) for Java VMs. It provides up to a terabyte of virtual heap for Java applications that spans hundreds of connected JVMs. NAM is best suited for storing what they call scratch data. Scratch data is defined as object oriented data that is critical to the execution of a series of Java operations inside the JVM, but may not be critical once a business transaction is complete. The Terracotta Architecture has three components:

  1. Client Nodes - Each client node corresponds to a client node in the cluster which runs on a standard JVM
  2. Server Cluster - java process that provides the clustering intelligence. The current Terracotta implementation operates in an Active/Passive mode
  3. Storage used as
    • Virtual Heap storage - as objects are paged out of the client nodes, into the server, if the server heap fills up, objects are paged onto disk
    • Lock Arbiter - To ensure that there is no possibility of the classic "split-brain" problem, Terracotta relies on the disk infrastructure to provide a lock.
    • Shared Storage - to transmit the object state from the active to passive, objects are persisted to disk, which then shares the state to the passive server(s).
JVM-level clustering can turn single-node, multi-threaded apps into distributed, multi-node apps, often with no code changes. This is possible by plugging in to the Java Memory Model in order to maintain key Java semantics of pass-by-reference, thread coordination and garbage collection across the cluster. Terracotta enables this using only declarative configuration with minimal impact to existing code and provides fine-grained field-level replication which means your objects no longer need to implement Java serialization. Ari Zilka, the founder and CTO of Terracotta had a video session organized by Skills Matter. He will show you how it works and how you can start clustering your POJO-based Web applications (based on Spring, Struts, Wicket, RIFE, EHCache, Quartz, Lucene, DWR, Tomcat, JBoss, Jetty or Geronimo etc.).

Click to read more ...

Tuesday
Jul292008

Ehcache - A Java Distributed Cache 

Ehcache is a pure Java cache with the following features: fast, simple, small foot print, minimal dependencies, provides memory and disk stores for scalability into gigabytes, scalable to hundreds of caches is a pluggable cache for Hibernate, tuned for high concurrent load on large multi-cpu servers, provides LRU, LFU and FIFO cache eviction policies, and is production tested. Ehcache is used by LinkedIn to cache member profiles. The user guide says it's possible to get at 2.5 times system speedup for persistent Object Relational Caching, a 1000 times system speedup for Web Page Caching, and a 1.6 times system speedup Web Page Fragment Caching. From the website: Introduction Ehcache is a cache library. Before getting into ehcache, it is worth stepping back and thinking about caching generally. About Caches Wiktionary defines a cache as A store of things that will be required in future, and can be retrieved rapidly . That is the nub of it. In computer science terms, a cache is a collection of temporary data which either duplicates data located elsewhere or is the result of a computation. Once in the cache, the data can be repeatedly accessed inexpensively. Why caching works Locality of Reference While ehcache concerns itself with Java objects, caching is used throughout computing, from CPU caches to the DNS system. Why? Because many computer systems exhibit locality of reference . Data that is near other data or has just been used is more likely to be used again. The Long Tail Chris Anderson, of Wired Magazine, coined the term The Long Tail to refer to Ecommerce systems. The idea that a small number of items may make up the bulk of sales, a small number of blogs might get the most hits and so on. While there is a small list of popular items, there is a long tail of less popular ones. The Long Tail The Long Tail is itself a vernacular term for a Power Law probability distribution. They don't just appear in ecommerce, but throughout nature. One form of a Power Law distribution is the Pareto distribution, commonly know as the 80:20 rule. This phenomenon is useful for caching. If 20% of objects are used 80% of the time and a way can be found to reduce the cost of obtaining that 20%, then the system performance will improve. Will an Application Benefit from Caching? The short answer is that it often does, due to the effects noted above. The medium answer is that it often depends on whether it is CPU bound or I/O bound. If an application is I/O bound then then the time taken to complete a computation depends principally on the rate at which data can be obtained. If it is CPU bound, then the time taken principally depends on the speed of the CPU and main memory. While the focus for caching is on improving performance, it it also worth realizing that it reduces load. The time it takes something to complete is usually related to the expense of it. So, caching often reduces load on scarce resources. Speeding up CPU bound Applications CPU bound applications are often sped up by: * improving algorithm performance * parallelizing the computations across multiple CPUs (SMP) or multiple machines (Clusters). * upgrading the CPU speed. The role of caching, if there is one, is to temporarily store computations that may be reused again. An example from ehcache would be large web pages that have a high rendering cost. Another caching of authentication status, where authentication requires cryptographic transforms. Speeding up I/O bound Applications Many applications are I/O bound, either by disk or network operations. In the case of databases they can be limited by both. There is no Moore's law for hard disks. A 10,000 RPM disk was fast 10 years ago and is still fast. Hard disks are speeding up by using their own caching of blocks into memory. Network operations can be bound by a number of factors: * time to set up and tear down connections * latency, or the minimum round trip time * throughput limits * marshalling and unmarhshalling overhead The caching of data can often help a lot with I/O bound applications. Some examples of ehcache uses are: * Data Access Object caching for Hibernate * Web page caching, for pages generated from databases. Increased Application Scalability The flip side of increased performance is increased scalability. Say you have a database which can do 100 expensive queries per second. After that it backs up and if connections are added to it it slowly dies. In this case, caching may be able to reduce the workload required. If caching can cause 90 of that 100 to be cache hits and not even get to the database, then the database can scale 10 times higher than otherwise. How much will an application speed up with Caching? The short answer The short answer is that it depends on a multitude of factors being: * how many times a cached piece of data can and is reused by the application * the proportion of the response time that is alleviated by caching In applications that are I/O bound, which is most business applications, most of the response time is getting data from a database. Therefore the speed up mostly depends on how much reuse a piece of data gets. In a system where each piece of data is used just once, it is zero. In a system where data is reused a lot, the speed up is large. The long answer, unfortunately, is complicated and mathematical. It is considered next.

Related Articles

  • Caching Category on High Scalability
  • Product: Memcached
  • Manage a Cache System with EHCache

    Click to read more ...

  • Tuesday
    May272008

    eBay Architecture

    Update 2: EBay's Randy Shoup spills the secrets of how to service hundreds of millions of users and over two billion page views a day in Scalability Best Practices: Lessons from eBay on InfoQ. The practices: Partition by Function, Split Horizontally, Avoid Distributed Transactions, Decouple Functions Asynchronously, Move Processing To Asynchronous Flows, Virtualize At All Levels, Cache Appropriately. Update: eBay Serves 5 Billion API Calls Each Month. Aren't we seeing more and more traffic driven by mashups composed on top of open APIs? APIs are no longer a bolt on, they are your application. Architecturally that argues for implementing your own application around the same APIs developers and users employ. Who hasn't wondered how eBay does their business? As one of the largest most loaded websites in the world, it can't be easy. And the subtitle of the presentation hints at how creating such a monster system requires true engineering: Striking a balance between site stability, feature velocity, performance, and cost. You may not be able to emulate how eBay scales their system, but the issues and possible solutions are worth learning from. Site: http://ebay.com

    Information Sources

  • The eBay Architecture - Striking a balance between site stability, feature velocity, performance, and cost.
  • Podcast: eBay’s Transactions on a Massive Scale
  • Dan Pritchett on Architecture at eBay interview by InfoQ

    Platform

  • Java
  • Oracle
  • WebSphere, servlets
  • Horizontal Scaling
  • Sharding
  • Mix of Windows and Unix

    What's Inside?

    This information was adapted from Johannes Ernst's Blog

    The Stats

  • On an average day, it runs through 26 billion SQL queries and keeps tabs on 100 million items available for purchase.
  • 212 million registered users, 1 billion photos
  • 1 billion page views a day, 105 million listings, 2 petabytes of data, 3 billion API calls a month
  • Something like a factor of 35 in page views, e-mails sent, bandwidth from June 1999 to Q3/2006.
  • 99.94% availability, measured as "all parts of site functional to everybody" vs. at least one part of a site not functional to some users somewhere
  • The database is virtualized and spans 600 production instances residing in more than 100 server clusters.
  • 15,000 application servers, all J2EE. About 100 groups of functionality aka "apps". Notion of a "pool": "all the machines that deal with selling"...

    The Architecture

  • Everything is planned with the question "what if load increases by 10x". Scaling only horizontal, not vertical: many parallel boxes.
  • Architectures is strictly divided into layers: data tier, application tier, search, operations,
  • Leverages MSXML framework for presentation layer (even in Java)
  • Oracle databases, WebSphere Java (still 1.3.1)
  • Split databases by primary access path, modulo on a key.
  • Every database has at least 3 on-line databases. Distributed over 8 data centers
  • Some database copies run 15 min behind, 4 hours behind
  • Databases are segmented by function: user, item account, feedback, transaction, over 70 in all.
  • No stored procedures are used. There are some very simple triggers.
  • Move cpu-intensive work moved out of the database layer to applications applications layer: referential integrity, joins, sorting done in the application layer! Reasoning: app servers are cheap, databases are the bottleneck.
  • No client-side transactions. no distributed transactions
  • J2EE: use servlets, JDBC, connection pools (with rewrite). Not much else.
  • No state information in application tier. Transient state maintained in cookie or scratch database.
  • App servers do not talk to each other -- strict layering of architecture
  • Search, in 2002: 9 hours to update the index running on largest Sun box available -- not keeping up.
  • Average item on site changes its search data 5 times before it is sold (e.g. price), so real-time search results are extremely important.
  • "Voyager": real-time feeder infrastructure built by eBay.. Uses reliable multicast from primary database to search nodes, in-memory search index, horizontal segmentation, N slices, load-balances over M instances, cache queries.

    Lessons Learned

  • Scale Out, Not Up – Horizontal scaling at every tier. – Functional decomposition.
  • Prefer Asynchronous Integration – Minimize availability coupling. – Improve scaling options.
  • Virtualize Components – Reduce physical dependencies. – Improve deployment flexibility.
  • Design for Failure – Automated failure detection and notification. – “Limp mode” operation of business features.
  • Move work out of the database into the applications because the database is the bottleneck. Ebay does this in the extreme. We see it in other architecture using caching and the file system, but eBay even does a lot of traditional database operations in applications (like joins).
  • Use what you like and toss what you don't need. Ebay didn't feel compelled to use full blown J2EE stack. They liked Java and Servlets so that's all they used. You don't have to buy into any framework completely. Just use what works for you.
  • Don't be afraid to build solutions that meet and evolve with your needs. Every off the shelf solution will fail you at some point. You have to go the rest of the way on your own.
  • Operational controls become a larger and larger part of scalability as you grow. How do you upgrade, configure, and monitor thousands of machines will running a live system?
  • Architectures evolve. You need to be able to change, refine, and develop your new system while keeping your existing site running. That's the primary challenge of any growing website.
  • It's a mistake to worry too much about scalability from the start. Don't suffer from paralysis by analysis and worrying about traffic that may never come.
  • It's also a mistake not to worry about scalability at all. You need to develop an organization capable of dealing with architecture evolution. Understand you are never done. Your system will always evolve and change. Build those expectations and capabilities into your business from the start. Don't let people and organizations be why your site fails. Many people will think the system should be perfect from the start. It doesn't work that way. A good system is developed overtime in response to real issues and concerns. Expect change and adapt to change.

    Click to read more ...

  • Sunday
    Feb242008

    Yandex Architecture

    Update: Anatomy of a crash in a new part of Yandex written in Django. Writing to a magic session variable caused an unexpected write into an InnoDB database on every request. Writes took 6-7 seconds because of index rebuilding. Lots of useful details on the sizing of their system, what went wrong, and how they fixed it. Yandex is a Russian search engine with 3.5 billion pages in their search index. We only know a few fun facts about how they do things, nothing at a detailed architecture level. Hopefully we'll learn more later, but I thought it would still be interesting. From Allen Stern's interview with Yandex's CTO Ilya Segalovich, we learn:

  • 3.5 billion pages in the search index.
  • Over several thousand servers.
  • 35 million searches a day.
  • Several data centers around Russia.
  • Two-layer architecture.
  • The database is split in pieces and when a search is requested, it pulls the bits from the different database servers and brings it together for the user.
  • Languages used: c++, perl, some java.
  • FreeBSD is used as their server OS.
  • $72 million in revenue in 2006.

    Click to read more ...

  • Thursday
    Jan242008

    Mailinator Architecture

    Update: A fun exploration of applied searching in How to search for the word "pen1s" in 185 emails every second. When indexOf doesn't cut it you just trie harder. Has a drunken friend ever inspired you to create a first of its kind internet service that is loved by millions, deemed subversive by thousands, all while handling over 1.2 billion emails a year on one rickity old server? That's how Paul Tyma came to build Mailinator. Mailinator is a free no-setup web service for thwarting evil spammers by creating throw-away registration email addresses. If you don't give web sites you real email address they can't spam you. They spam Mailinator instead :-) I love design with a point-of-view and Mailinator has a big giant harry one: performance first, second, and last. Why? Because Mailinator is free and that allows Paul to showcase his different perspective on design. While competitors buy big Iron to handle load, Paul uses a big idea instead: pick the right problem and create a design to fit the problem. No more. No less. The result is a perfect system architecture sonnet, beauty within the constraints of form. How does Mailinator carry out its work as a spam busting super hero? Site: http://mailinator.com/

    Information Sources

  • The Architecture of Mailinator
  • Mailinator's 2006 Stats

    The Platform

  • Linux
  • Tomcat
  • Java

    The Stats

  • Will process an estimated 1.29 BILLION emails for 2007. 450.74 million in 2006. 280.68 million in 2005.
  • Peak rate of 6.5 million emails/day or 4513/min or 75/sec.
  • Mailinator runs on a very modest machine with an AMD 2Ghz Athlon processor, 1GB of RAM (much less is used), and a low-performance 80G IDE hard drive. And the machine is not very busy at all.
  • Mailinator runs for months unattended and very few emails are lost, even under constant spam attacks and high peak loads.

    The Architecture

  • Having a free system means the system doesn't have to be perfect. So the design goals are: - Design a system that values survival above all else, even users. Survival is key because Mailinator must fight off attacks on a daily basis. - Provide 99.99% uptime and accuracy for users. Higher uptime goals would be impractical and costly. And since the service is free this is just part of rules of the game for users. - Support the following service model: user signs-up for something, goes to Mailinator, clicks on the subscription link, and forgets about it. This means email doesn't have to be stored persistently on disk. Email can reside in RAM because it is temporary (3-4 hours). If you want a real mailbox then use another service.
  • The original flow of email handling was: - Sendmail received email in a single on-disk mailbox. - The Java based Mailinator grabbed emails using IMAP and/or POP (it changed over time) and deleted them. - The system then loaded all emails into memory and let them sit there. - The oldest email was pushed out once the 20,000 in memory limit was reached.
  • The original architecture worked well: - It was stable and stayed up for months at a time. - It used almost all the 1GB of RAM. - Problems started when the incoming email rate started surpassing 800,000 a day. The system broke down because of disk contention between Mailinator and the email subsystem.
  • The New Architecture: - The idea was to remove the path through the disk which was accomplished with a complete system rewrite. - The web application, the email server, and all email storage run in one JVM. - Sendmail was replaced with a custom built SMTP server. Because of the nature of Mailinator a full SMTP server was not necessary. Mailinator does not need to send email. And it's primary duty is to accept or reject email as fast as possible. This is the downside of layering. Layering is very often given as a key strategy in scaling, but it can kill performance because crucial decisions are best handled at the highest levels of the stack. So work flows through the system only to be dumped at the lower layers when many of the RAM and cycle stealing operations have already been accomplished. So the decision to go with a custome SMTP server is an interesting and brave decision. Most people at this point would just add more hardware. And they wouldn't be wrong, but it's interesting to see this path taken as well. Maybe with more DOM and AOP like architectures we can flatten the stack and get better performance when needed. - Now Mailinator receives an email directly, parses it, and stores it into memory. The disk is bypassed completely and the disk remains fairly idle. - Emails are written to disk when the system is coming down so they can be reloaded on startup. - Logging was shut-off to remove the risk of subpoenaes. When logging was performed log data was written in batches so several thousand logs lines would be written in one disk write. This minimized at disk contention at the risk of losing helpful diagnostic information. - The system uses under 300 threads. More aren't needed. - On arrival each email passes through a filter system and is stored in RAM if all filters are passed. - Every inbox is limited to only 10 emails so popular inboxes, like joe@mailinator.com, can't blow the system. - No incoming email can be over 100k and all attachments are immediately discarded. This saves on RAM.
  • Emails are compressed in RAM: - Since 99% of emails are never looked at, compressed email saves RAM. They are only ever decompressed when someone looks at them. - Mailinator can store about 80,000 emails in RAM, using under 300MB of RAM compared to the 20,000 emails which were stored in 1GB RAM in the original design. - With this pool the average email lifespan is about 3-4 hours. - It's likely 200,000 emails could fit in memory, but there hasn't been a real need. - This is one of the design details I love because it's based on real application usage patterns. RAM is precious and CPU is not, so use compression to save RAM at the expense of CPU, knowing you won't have to take the CPU hit twice, most of the time.
  • Mailinator does not guarantee anonymity and privacy: - There is no privacy. Anyone can read any inbox at anytime. - Relaxing these constrains, while shocking, makes the design much simpler. - For the user it is simple because there is no sign up needed. When a web site asks you for an email address you can just enter an mailinator address. You don't need to create a separate account. Typing in the email address effectively creates the mailinator account. Simple. - In practice users still get a high level of privacy.
  • Goal of survivability leads to aggressive SPAM filtering. - Mailinator doesn't have anything against SPAM, but because it gets so much SPAM, it must be filtered out when it threatens the up time of the system. - Which leads to this rule: If you do anything (spammer or not) that starts affecting the system - your emails will be refused and you may be locked out.
  • To be accepted an email must pass the following filter chain: - Bounce: all bounced emails are dropped. - IP: too much email from a single IP are dropped - Subject: too much email on the same subject is dropped - Potty: subjects containing words that indicate hate or crimes or just downright nastiness are dropped.
  • Surviving Email Floods from a Single IP Adress - An AgingHashmap is used to filter out spammers from a particular IP address. When an email arrives on a IP address the IP is put in the map and a counter is increased for all subsequent emails. - After a certain period of time with no emails the counter is cleared. - When a sender reaches a threshold email count the sender is blocked. This prevents a sender from flooding the system. - Many systems use this sort of logic to protect all sorts of resources, like comments. You can use memcached for the same purpose in a distributed system.
  • Protecting Against Zombie Attacks: - Spam can be sent from a large coordinates sets of different IP addresses, called zombie networks. The same message is sent from thousands of different IP addresses so the techniques for stopping email from a single IP address are not sufficient. - This filtering is a little more complex than IP blocking because you have to parse enough of the email to get the subject line and matching subject strings is a little more resource intensive. - When something like 20 emails with the same subject within 2 minutes, all emails with that subject are then banned for 1 hour. - Interestingly, subjects are not banned forever because that would mean Mailinator would have to track subjects forever and the system design is inherently transient. This is pretty clever I think. At the cost of a few "bad" emails getting through the system is much simpler because no persistent list must be managed and that list surely would become a bottleneck. A system with more stringent SPAM filtering goals would have to create a much more complex and less robust architecture. - Nealy 9% of emails are blocked with this filter. - From my reading Mailinator filters only on IP and subject, so it doesn't have to read the body of the email body to accept or reject the email. This minimizes resource usage when most email will be rejected.
  • To lessen the danger from DOS attacks: - All connections that are silent for a specific period of time are droped. - Mailinator sends replies to email senders very slowly, like 10 or 20 or 30 seconds, even for a very small amount of data. This slows down spammers who are trying to send out spam as fast as possible and may make them rethink sending email again to that address. The wait period is reduced during busy periods so email isn't dropped.

    Lessons Learned

  • Perfection is a trap. How many systems are made much more complicated by the drive to be 100% everything. If you've been in those meetings you know what they are like. Oh, we can't do it this way or that way because there's .01% chance of something going wrong. Instead ask: how imperfect can you be and be good enough?
  • What you throw out is as important as what you keep in. We have many preconceptions of how to design systems. We make take for granted that you need to scale-out, you need to have email accessible days later, and that you must provide private accounts for everyone. But you really need these things? What can you toss?
  • Know the purpose of your system and design accordingly. Being everything to everyone means you are nothing to nobody. Keeping emails for a short period of time, allowing some SPAM to get through, and accepting less than 100% uptime create a strong vision for the system that help drive the design in all areas. You would only build your own SMTP server if you had a very strong idea of what your system was about and what you needed. I know this would have never occurred to me as an idea. I would have added more hardware.
  • Fail fast for the common case before committing resources. A high percentage of email is rejected so it makes sense to reject it as early as possible in the stack to minimize resources to accomplish the task. Figure out how to short circuit frequently failed items as fast as possible. This is important and often over looked scaling strategy.
  • Efficiency often means build it yourself. Off the shelf tools tend to do the whole job. If you only need part of the job done you may be able to write a custom component that runs much faster.
  • Adaptively forget. A little failure is OK. All the blocked IP addresses don't need to be remembered forever. Let the block decisions build up from local data rather than global state. This is powerfully simple and robust architecture.
  • Java doesn't have to be slow. Enough said.
  • Avoid the disk. Many applications need to hit the disk, but the disk is always a bottleneck. Can you design around the disk using other creative strategies?
  • Constrain resource usage. Put in constraints, like inbox size, that will keep your system for spiking uncontrollably. Unconstrained resource usage must be avoided with limited resources.
  • Compress data. Compression can be a major win when trying to conserve RAM. I've seen memory usage drop by more than half when using compression with very little overhead. If you are communicating locally, just have the client encode the data and keep it encoded. Build APIs to access the data without have to decode the full message.
  • Use fixed size resource pools to handle load. Many applications don't control resource usage, like memory, and they crash when too much is used. To create a really robust system fix your resources and drop work when those resources are full. You can age resources, give priority access, give fair access, or use any other logic to arbitrate resource access, but because the resource will be limited, you will stay up under load.
  • If you don't keep data it can't be subpoenaed. Because Mailinator doesn't store email or logs on disk noting can be subpoenaed.
  • Use what you know. We've seen this lesson a few times. Paul knew Java better than anything else, so he used it, made it work, and he got the job got done.
  • Find your own Mailinators. Sure, Mailinator is a small system. In a large system it would just be a small feature, but your system is composed of many Mailinator sized projects. What if you developed some of those like Mailinator?
  • KISS exists, though it's rare. Keeping it simple is always talked about, but rarely are we shown real examples. It's mostly just your way is complex and my way is simple because it's my way. Mailinator is a good example of simple design.
  • Robustness is a function of architecture. To create a design that efficiently uses memory and survives massive spam attacks required an architectural approach that looked at the entire stack.

    Related Articles

  • PlentyOfFish champions straight forward bare bones simplicity.
  • Varnish smartly uses OS features to find incredible performance.
  • ThemBid gracefully pieces together open source components.

    Click to read more ...

  • Monday
    Dec102007

    Future of EJB3 !! ??

    What is the future of EJB3 in the industry , given the current trends ? There are a lot of arguments regarding EJB3 being heavy weighted ..... Also, what could be the alternatives of EJB3 ? How about the scalability, persistence, performance and other factors ?

    Click to read more ...

    Sunday
    Dec022007

    Database-Clustering: a8cjdbc - update: version 1.3

    The new version of a8cjdbc finished some limitations. Now Clobs and Blobs are supported, and some fixes using binary data. The version was also fully tested with Postgres and mySQL. Since Version 1.3 there is also a free trail version for download available. Check it out and test yourself... Take a look at: http://www.activ8.at/homepage/en/a8cjdbc.php I've downloaded the latest version and setup a environment with one virtual database and two database backends. I tried to make a "non real life szenario": The first backend was a Postgres node, the second was a mySQL node. Everything works fine - failover - recoverylog, etc... with to different backend database types. So check out the trial version and test yourself the clustered driver and give me some results about your experience with a8cjdbc. As I only tested mySQL and Postgres (and the non real life szenario with two different backend types) - maybe someone else have experiences with out databases? greetings Wolfgang

    Click to read more ...

    Sunday
    Dec022007

    a8cjdbc - update verision 1.3

    The new version of a8cjdbc finished some limitations. Now Clobs and Blobs are supported, and some fixes using binary data. The version was also fully tested with Postgres and mySQL. Since Version 1.3 there is also a free trail version for download available. Check it out and test yourself... Take a look at: http://www.activ8.at/homepage/en/a8cjdbc.php I've downloaded the latest version and setup a environment with one virtual database and two database backends. I tried to make a "non real life szenario": The first backend was a Postgres node, the second was a mySQL node. Everything works fine - failover - recoverylog, etc... with to different backend database types. So check out the trial version and test yourself the clustered driver and give me some results about your experience with a8cjdbc. As I only tested mySQL and Postgres (and the non real life szenario with two different backend types) - maybe someone else have experiences with out databases? greetings Wolfgang

    Click to read more ...

    Monday
    Nov192007

    Tailrank Architecture - Learn How to Track Memes Across the Entire Blogosphere

    Ever feel like the blogosphere is 500 million channels with nothing on? Tailrank finds the internet's hottest channels by indexing over 24M weblogs and feeds per hour. That's 52TB of raw blog content (no, not sewage) a month and requires continuously processing 160Mbits of IO. How do they do that? This is an email interview with Kevin Burton, founder and CEO of Tailrank.com. Kevin was kind enough to take the time to explain how they scale to index the entire blogosphere.

    Sites

  • Tailrank - We track the hottest news in the blogosphere!
  • Spinn3r - A blog spider you can specialize with your own behavior instead of creating your own.
  • Kevin Burton's Blog - his blog is an indexing mix of politics and technical talk. Both are always interesting.

    Platform

  • MySQL
  • Java
  • Linux (Debian)
  • Apache
  • Squid
  • PowerDNS
  • DAS storage.
  • Federated database.
  • ServerBeach hosting.
  • Job scheduling system for work distribution.

    Interview

  • What is your system is for? Tailrank originally a memetracker to track the hottest news being discussed within the blogosphere. We started having a lot of requests to license our crawler and we shipped that in the form of Spinn3r about 8 months ago. Spinn3r is self contained crawler for companies that want to index the full blogosphere and consumer generated media. Tailrank is still a very important product alongside Spinn3r and we're working on Tailrank 3.0 which should be available in the future. No ETA at the moment but it's actively being worked on.
  • What particular design/architecture/implementation challenges does your system have? The biggest challenge we have is the sheer amount of data we have to process and keeping that data consistent within a distributed system. For example, we process 52TB of content per month. this has to be indexed in a highly available storage architecture so the normal distributed database problems arise.
  • What did you do to meet these challenges? We've spent a lot of time in building out a distributed system that can scale and handle failure. For example, we've built a tool called Task/Queue that is analogous to Google's MapReduce. It has a centralized queue server which hands out units of work to robots which make requests. It works VERY well for crawlers in that slower machines just fetch work at a slower rate while more modern machines (or better tuned machines) request work at a higher rate. This ends up easily solving one of the main distributed computing fallacies that the network is homogeneous. Task/Queue is generic enough that we could actually use it to implement MapReduce on top of the system. We'll probably open source it at some point. Right now it has too many tentacles wrapped into other parts of our system.
  • How big is your system? We index 24M weblogs and feeds per hour and process content at about 160-200Mbps. At the raw level we're writing to our disks at about 10-15MBps continuously.
  • How many documents, do you serve? How many images? How much data? Right now the database is about 500G. We're expecting it to grow well beyond this in 2008 as we expand our product offering.
  • What is your rate of growth? It's mostly a function of customer feature requests. If our customers want more data we sell it to them. In 2008 we're planning on expanding our cluster to index larger portions of the web and consumer generated media.
  • What is the architecture of your system? We use Java, MySQL and Linux for our cluster. Java is a great language for writing crawlers. The library support is pretty solid (though it seems like Java 7 is going to be killer when they add closures). We use MySQL with InnoDB. We're mostly happy with it though it seems I end up spending about 20% of my time fixing MySQL bugs and limitations. Of course nothing is perfect. MySQL for example was really designed to be used on single core systems. The MySQL 5.1 release goes a bit farther to fix multi-core scalability locks. I recently blogged about how these the new multi-core machines should really be considered N machines instead of one logical unit: Distributed Computing Fallacy #9.
  • How is your system architected to scale? We use a federated database system so that we can split the write load as we see more IO. We've released a lot of our code as Open Source a lot of our infrastructure and this will probably be released as Open Source as well. We've already opened up a lot of our infrastructure code:
  • http://code.tailrank.com/lbpool - load balancing JDBC driver for use with DB connection pools.
  • http://code.tailrank.com/feedparser - Java RSS/Atom parser designed to elegantly support all versions of RSS
  • http://code.google.com/p/benchmark4j/ - Java (and UNIX) equivalent of Windows' perfmon
  • http://code.google.com/p/spinn3r-client/ - Client bindings to access the Spinn3r web service
  • http://code.google.com/p/mysqlslavesync/ - Clone a MySQL installation and setup replication.
  • http://code.google.com/p/log5j/ - Logger facade that supports printf style message format for both performance and ease of use.
  • How many servers do you have? About 15 machines so far. We've spent a lot of time tuning our infrastructure so it's pretty efficient. That said, building a scalable crawler is not an easy task so it does take a lot of hardware. We're going to be expanding FAR past this in 2008 and will probably hit about 2-3 racks of machines (~120 boxes).
  • What operating systems do you use? Linux via Debian Etch on 64 bit Opterons. I'm a big Debian fan. I don't know why more hardware vendors don't support Debian. Debian is the big secret in the valley that no one talks about. Most of the big web 2.0 shops like Technorati, Digg, etc use Debian.
  • Which web server do you use? Apache 2.0. Lighttpd is looking interesting as well.
  • Which reverse proxy do you use? About 95% of the pages of Tailrank are served from Squid.
  • How is your system deployed in data centers? We use ServerBeach for hosting. It's a great model for small to medium sized startups. They rack the boxes, maintain inventory, handle network, etc. We just buy new machines and pay a flat markup. I wish Dell, SUN, HP would sell directly to clients in this manner. One right now. We're looking to expand into two for redundancy.
  • What is your storage strategy? Directly attached storage. We buy two SATA drives per box and set them up in RAID 0. We use the redundant array of inexpensive databases solution so if an individual machine fails there's another copy of the data on another box. Cheap SATA disks rule for what we do. They're cheap, commodity, and fast.
  • Do you have a standard API to your website? Tailrank has RSS feeds for every page. The Spinn3r service is itself an API and we have extensive documentation on the protocol. It's also free to use for researchers so if any of your readers are pursuing a Ph.D and generally doing research work and needs access to blog data we'd love to help them out. We already have the Ph.D students at the University of Washington and University of Maryland (my Alma Matter) using Spinn3r.
  • Which DNS service do you use? PowerDNS. It's a great product. We only use the recursor daemon but it's FAST. It uses async IO though so it doesn't really scale across processors on multicore boxes. Apparenty there's a hack to get it to run across cores but it isn't very reliable. AAA caching might be broken though. I still need to look into this.
  • Who do you admire? Donald Knuth is the man!
  • How are you thinking of changing your architecture in the future? We're still working on finishing up a fully sharded database. MySQL fault tolerance and autopromotion is also an issue.

    Click to read more ...

  • Tuesday
    Nov132007

    Flickr Architecture

    Update: Flickr hits 2 Billion photos served. That's a lot of hamburgers.

    Flickr is both my favorite bird and the web's leading photo sharing site. Flickr has an amazing challenge, they must handle a vast sea of ever expanding new content, ever increasing legions of users, and a constant stream of new features, all while providing excellent performance. How do they do it?

    Site: http://www.flickr.com

    Information Sources

  • Flickr and PHP (an early document)
  • Capacity Planning for LAMP
  • Federation at Flickr: Doing Billions of Queries a Day by Dathan Pattishall.
  • Building Scalable Web Sites by Cal Henderson from Flickr.
  • Database War Stories #3: Flickr by Tim O'Reilly
  • Cal Henderson's Talks. A lot of useful PowerPoint presentations.

    Platform

    Click to read more ...