High Scalability -

Entries in Product (120)

Friday

Nov162007

Product: lbpool - Load Balancing JDBC Pool

Friday, November 16, 2007 at 12:40PM

From the website: The lbpool project provides a load balancing JDBC driver for use with DB connection pools. It wraps a normal JDBC driver providing reconnect semantics in the event of additional hardware availability, partial system failure, or uneven load distribution. It also evenly distributes all new connections among slave DB servers in a given pool. Each time connect() is called it will attempt to use the best server with the least system load. The biggest scalability issue with large applications that are mostly READ bound is the number of transactions per second that the disks in your cluster can handle. You can generally solve this in two ways. 1. Buy bigger and faster disks with expensive RAID controllers. 2. Buy CHEAP hardware on CHEAP disks but lots of machines. We prefer the cheap hardware approach and lbpool allows you to do this. Even if you *did* manage to use cheap hardware most load balancing hardware is expensive, requires a redundant balancer (if it were to fail), and seldom has native support for MySQL. The lbpool driver addresses all these needs. The original solution was designed for use within MySQL replication clusters. This generally involves a master server handling all writes with a series of slaves which handle all reads. In this situation we could have hundreds of slaves and lbpool would load balance queries among the boxes. If you need more read performance just buy more boxes. If any of them fail it won't hurt your application because lbpool will simply block for a few seconds and move your queries over to a new production server. In this post Kevin Burton of Spinn3r mentions they've been using this product to good effect for handling MySQL replication faults, balancing, and crashed servers.

Click to read more ...

Todd Hoff |

3 Comments |

Permalink |

Print Article

Email Article

Load Balancing,

MySQL,

Product,

lbpool

Monday

Nov122007

a8cjdbc - Database Clustering via JDBC

Monday, November 12, 2007 at 11:05PM

Practically any software project nowadays could not survive without a database (DBMS) backend storing all the business data that is vital to you and/or your customers. When projects grow larger, the amount of data usually grows larger exponentially. So you start moving the DBMS to a separate server to gain more speed and capacity. Which is all good and healthy but you do not gain any extra safety for this business data. You might be backing up your database once a day so in case the database server crashes you don't lose EVERYTHING, but how much can you really afford to lose? Well clearly this depends on what kind of data you are storing. In our case the users of our solutions use our software products to do their everyday (all day) work. They have "everything" they need for their business stored in the database we are providing. So is 24 hours of data loss acceptable? No, not really. One hour? Maybe. But what we really want is a second database running with the EXACT same data. We mostly use PostgreSQL which does not have built in database replication. There is some solution based on triggers to replicate the data from one database to another one. We have learned that setting all this up on an existing database with plenty of tables is rather complicated and changing the database structure afterwards can not be done with simple create/alter statements anymore. And since we ARE running solutions that constantly change and improve, we need to be able to deploy updates including database structure changes quickly and easily. So what we really wanted was a transparent JDBC layer that does the replication for us. We tested a great solution called "Sequoia", but it is also a rather heavy-weight product with a lot of features that did not really help in the performance department and that we didn't need anyway. What we needed was:

a JDBC driver so the application does not know anything about the replication
of course: transactional safety for write operations
load-balanced reads (we are running 2 database servers, so why waste the ability to do parallel reads from 2 servers and almost multiply the performance by 2?)
for backups: the ability to detach one server, do the backup on that machine and then reattach the server
automatic and transparent failover / failsafe
Fast In-VM-Replication - no serialisation
Easy integration

Click to read more ...

Todd Hoff |

5 Comments |

Permalink |

Activ8,

JDBC,

Java,

MySQL,

Postgres,

Product

Tuesday

Nov062007

Product: ChironFS

Tuesday, November 6, 2007 at 2:42AM

If you are trying to create highly available file systems, especially across data centers, then ChironFS is one potential solution. It's relatively new, so there aren't lots of experience reports, but it looks worth considering. What is ChironFS and how does it work? Adapted from the ChironFS website: The Chiron Filesystem is a Fuse based filesystem that frees you from single points of failure. It's main purpose is to guarantee filesystem availability using replication. But it isn't a RAID implementation. RAID replicates DEVICES not FILESYSTEMS. Why not just use RAID over some network block device? Because it is a block device and if one server mounts that device in RW mode, no other server will be able to mount it in RW mode. Any real network may have many servers and offer a variety of services. Keeping everything running can become a real nightmare!

Click to read more ...

Todd Hoff |

3 Comments |

Permalink |

Linux,

fuse,

replication

Saturday

Oct202007

Should you build your next website using 3tera's grid OS?

Saturday, October 20, 2007 at 8:27AM

Update 2: 3tera has added Dynamic Appliances, which are "packaged data center operations like backup, migration or SLAs that users can add to their applications to provide functionality." Update: in an effort to help cross the chasm of how start building a website using their grid OS, 3tera is offering their Assured Success Plan. The idea is to provide training, consulting, and support so you can get started with some confidence you'll end up succeeding. If you are starting or extending a website you have a problem: what technologies should you use? Now there are more answers to that question than ever. One new and refreshingly innovative answer is 3tera's grid OS. In this podcast interview with Bert Armijo from 3tera, we'll learn how 3tera wants to change how you build websites. How? By transforming the physical into the virtual and then allowing the virtual to be manipulated as if it were real. Could I possibly be more abstract? Not really. But when I think of what they are doing that's the mental model I see whirling around in my mind. Don't worry, I promise we'll drill down to how it can help you in the real world. Let's see how. I think of 3tera's product as like staying at a nice hotel. At home you are in charge. If something needs doing you must do it. If something breaks you must fix it. But at a nice hotel everything just happens for you. Your room is cleaned, beds are made, outrageously expensive candy bars are replaced in the mini-bar, food arrives when you order it and plates disappear when you are done, and the courtesy mint is placed just so on your pillow. You are free to simply enjoy your stay. All the other details of living just happen. That's the same sort of experience 3tera is trying to provide for your website. You can concentrate on your application and 3tera, through their GUI on the front-end and their AppLogic grid operating system on the back-end, worries about all the housekeeping. I think Bert summed up their goal wonderfully when said their aim is to:

Get peoples hands off physical boxes and to give them a way to define complex infrastructures in a reusable way that they can then instantiate, trade, sell, or replicate, backup up and manage as individual units. This is what AppLogic that does incredibly way.

What they are doing is taking hard physical resources like CPU and storage and decoupling them from their physical sources so you can just order and use them on demand without worrying how its done under the covers. This is trend that has been happening for a while, but their grid OS takes that process to the next level. Your physical co-lo cage is now a private virtual data center. Physical boxes, once lovingly spec'ed, bought, and installed are now allocated on demand from a phalanx of preconfigured and separately maintained servers. Physical storage, once lovingly pieced together from disks, controllers, and networks is now allocated from a vast unending sea of virtual storage. Physical load balancers are now programs you can create. What this means for you is you can take a website architecture you've draw up on your white board and simply and quickly create it in a data center. Its all configurable from a GUI. You can bring on 10 new web servers with a simple drag and drop operation. It's basically your white board diagram come to life, only you get to skip all the nasty implementation bits. In the virtual world the nasty non application related implementation bits are someone else's problem. 3tera's value proposition pretty easy to understand:

Simplify the data center. You no longer need to locate, outfit, staff, maintain, and support a co-lo space.

Simplify operations. A few people can manage a lot machines.

Simplify disaster recovery. Failover is complicated and often doesn't work as planned. With AppLogic your redundant data center is always the same because the virtual data center is copied a unit. You can pick it up and move it anywhere you want.

Simplify the cost model for growth. If you grow how are you going to fund your hardware? Growing on a grid is more agile, incremental, and requires less upfront investment.

Simplify your architecture. The grid OS provides a powerful implementation model of how you should structure, grow, and maintain your system. You don't need to code it from scratch or think it up yourself. In short: customers don't care about your servers. Hardware and the data center do not add value. You core competency is in your application and running your business, not playing with servers. Well, that's it for the overview. Please listen to this podcast for all the nitty-gritty details. Download audio file (1:16 minutes, mp3).

Podcast Notes

I know what you are probably saying. You are saying: "But Todd, the podcast is over an hour long, couldn't you have please made it longer? I have nothing else to do today and I need to waste more time!" What can I say, Bert was very knowledgeable and helpful, and this is a new model for building scalable websites so I was trying to figure out how I could physically make a website using their product. That takes a lot of questions. I am happy with the result though. I think I have a good picture of how their system works and I think its well worth investigating if you are in the market for creating or expanding a website. Here are some notes taken from the podcast.

They started 3 years ago. At that time nobody could understand what they were trying to build. They have just now been able to build the higher level features, like Smart Appliances, that they wanted to build originally. They've been concentrating on making all the plumbing work.

The AppLogic grid operating system allows you to take hard infrastructure servers, load balancers, firewalls, VPNs, all these boxes you need to make a website and it allows you deploy these in a virtual data center

A virtual data center (VDC) is like a cage you would buy from a co-location service except you operate and manage it through a browser. You can be anywhere in the world and you can use hosting services anywhere in the world.

An entry level package ranges from $500 to a few thousand a month. The starting point is 4 - 32 CPUs, some amount of storage and some amount bandwidth. You add resources as you need to. Overage charges are passed through to you from the data center provider. They don't mark it up.

They don't own any servers. They contract with hosting providers data centers, like Softlayer and Layeredtech, for a uniform set of resources.

They offer templates for a scalable virtualized LAMP infrastructure as a starting point for building your own applications.

Their GUI shows you the architecture. You don't have to think of physical boxes.

There's a controller for the VDC through which you can provision your system.

You can still login to any physical or logical service. You have root access. You can install anything and manage the system, but you don't have to worry about where it physically resides.

To create an application: - You use the controller to provision a LAMP cluster. - Then you log into Apache server and configure it how you wish. - Then restart and it begins to serve. - Say you want 10 front-end web servers. - The load balancer is a virtual load balancer you program. - You use virtual NAS. - Upload code to the NAS. - Then have all apache servers run off the NAS. So you don't have to log into all and upload code.

Shared storage is part of the virtual data center by definition. You can create as many volumes as you wish. All are mirrored for high availability. If a virtual server goes down AppLogic will simply restart on another available resource in the data center.

Partners build the grid backbone to which nodes and other resources are attached. AppLogic runs that grid backbone. When you sign up you provision the virtual data center the nodes on the backbone are assigned to your VDC. A controller allows you to provision your VDC. Anything you can do in co-lo cage you can do, but there's nothing physical. AppLogic carries out your commands on the grid.

They provide standards for the hosting service. A variety of machine classifications available. Have customers with 50TB of storage. The largest number of CPUs in a single VDC is over 450.

To see if the VDC meets your requirements you run a test on the VDC. Once you have resources in your VDC they are not shared with anyone else so you can be confident the performance will be as tested. It's not a VPS. Their customers run production systems. They are all running a business of some sort.

Pricing is designed to be attractive for startups, but not artificially low to over-subscribe.

Currently there's no data center API. It's scriptable from the CLI. Smart Appliances can package up a data center operations into a drag and drop package. You can drag them into any application. Their first Smart Appliance is "follow me" which can move your application to a data center that is close to you. If you are in Asia you can move your data center to Asia. So your data center can follow you around. No coding is needed on your part. Just drag it into your VDC.

With AppLogic instead of managing a bunch of different things you manage your application. You do it once. AppLogic maintains the infrastructure for you.

In an upgrade of 10 Apache servers you don't upgrade standing infrastructure. You take a copy of your application and upgrade the copy.

Let's say you have an Apache server you want to patch. You create one prototype, which they call a class volume. Then when application restarts all the new changes will be picked up everywhere.

The power of what it means to be virtual can be seen in their rollback model. You don't upgrade in-place. You upgrade a copy. Because everything is virtual its easy to make copies of your entire data center. So you can copy your data center, keep the original running, and switch to the upgraded version. If the upgrade version doesn't work you can rollback to the original version of your VDC. This would be almost impossible using traditional methods. An application is the full state of the application with all its data. So you are operating a full complete copy of the application with all of its data. You can rollback to a complete running instance of the application. You just restart the old version.

For upgrades that require transformations, like database upgrades, you can write a script to run a database transformation.

They don't over automate. They don't only want to have their way of doing things.

The model an application has having two parts: the appliance and the content. For a web server this means: - Web Server - Content that it's serving.

You first create a prototype of what you want your system to look like. This becomes a class from which you later can create instances. There are templates, like the Linux appliance, to build from. Through their on-line system you configure your system, install packages, etc. When it works the way you want you can drag into your catalog as a template for building new instances. You can create hundreds of copies if you choose.

Content would be served off a mount location from inside the VDC.

You can upgrade the catalog element and restart the appliance and it will automatically upgrade for you. Not transactional. It's only an individual basis.

You can pin machines. You can get the environment to make machine specific configurations. You can put appliances into standby so you can quickly add additional resources on demand.

Their load balancer is Pound. No spam detection, but it is session aware. You can use others if you want.

They specialize in the code that runs the grid. They aren't specialists in load balancers and routers, etc.

In the VDC you can share infrastructure. You can email each other a clustered database, for example. You can save and package up an integration effort as an assembly. Save it. Sell it. Share it.

You can create an active-active redundancy scheme and pay for only resources you need because you can bring on resources like the front-end when you need them.

Many companies periodically make a local copy of their VDC and move it to their disaster center. - Remember, with a VDC it's easy to pick up your whole data center and move it somewhere else. The catalog doesn't have to be copied each time. Just the data for applications can be copied over. Not so bad with a fast backbone. - Disaster recovery can be triggered by a 3rd party or scripts. - This model is sufficient for companies that can accept some down time. - If no data loss can be tolerated you need replicate in an acive-active architecture.

Some companies maintain fungible data centers. They constantly copy their data center over to backup locations. If an app goes down they can fire up a replacement.

With AppLogic you can create a stub that can start an application on demand if it's not already running. This allows you to share resources. You can shut it down at night and save those resources for other applications.

Here's how they would handle TechCrunch: - Let's say you have an 8 grid data center. Let's say your normal load takes 20-30% of that. - First thing you'll do is use more resources from within the grid. - Then reconfigure appliances with more resources and restart them. - Then call your provider to add more resources. - Softlayer, for example, has a 500-1000 server inventory. So you can add servers to your grid within an hour a two. Currently this process requires human intervention.

Finding good OPs people is difficult. So with the VDC you can automate most of it and you don't need a big OPs team.

In you VDC your data center configuration is in the meta data, so its not kept as tacit knowledge. One or two people can run a thousand servers because you aren't really running servers, you are running applications.

Monitoring - AppLogic in control of all resources. You can build dashboards right off the bat. - You van plug your monitored variables into their monitoring system. - The data are available over the web. - Widgets are available for the display of live stats.

Different Way of Thinking about Your System - Typically you put the database on fastest server. Instead, they recommend allocating high end machines to everything so your database can run anywhere. A different way of thinking about your system. - Same with SAN. You don't need a SAN with the storage in the VDC. You are locking yourself into certain ways of thinking that don't apply in the VDC. Concept of using a SAN is just another lock-in.

Some Observations and Conclusions

I think the grid/virtualization approach, in one form or another, is the wave of the future. It simply makes it easier for companies to scale applications. And as applications themselves are structured to run natively on a grid, it will become even easier.

Reaching the full potential of the virtual data center depends on having a more granular billing strategy and more fine grained control over resource management. For example, if I have 6 CPU grid and I want to upgrade. I don't want to pay for a 12 CPU data center just so I can upgrade a copy. I don't need 12 normally, I just transiently need 12. So during my upgrade I want my script to trigger allocating a copy of my VDC, do the upgrade, switch to it, and then decommission the old VDC. So I want for a time to have 6 extra servers for the time it takes to upgrade. Then the old VDC should go away. And I should only be billed for the resources I am using while I am using them. This would also give a more satisfactory solution to the TechCrunch scenario.

You need to architect your system to take advantage of the grid. To me this means a shared nothing architecture that can be grown horizontally by adding more machines on demand. Applications should read their configuration off shared storage so the configuration doesn't need to be configured on each machine and you can bring up new machines based on a template. If you need to scale a new machine should come up and automatically start handling load. Queuing architectures, for example, have this attribute.

They need a data center API so you can treat the data center like an object. This would allow you to orchestrate various data centers around the world as a single coopering unit.

Operations within a grid would benefit from standardization. I know this enters the application realm, but operations like upgrade and failover are common and hard. So it would be useful of common processes could be developed and easily deployed.

They need turnkey options for those new to the game. As it stands the path from signing up to their service and deploying a web service is little scary. They are very honest in saying they do only one part of the overall picture. But many people need a painting, not a brush and paint. It would be helpful to have out of the box plans for solving the most common problems people face. I would like to thank Bert again for taking this time for the interview! May the grid be with you, always.

Product: The Spread Toolkit

Sunday, October 14, 2007 at 9:10AM

Complex applications coordinating work across a lot of machines often need a highly performing fault tolerant message layer. Though a blast to write, it's probably a better use of your time to use an off the shelf solution. And that's where Spread comes in. Flickr, for example, uses Spread to create real-time event feeds from their web server logs. What exactly is Spread? From the Spread website:

Spread is an open source toolkit that provides a high performance messaging service that is resilient to faults across local and wide area networks. Spread functions as a unified message bus for distributed applications, and provides highly tuned application-level multicast, group communication, and point to point support. Spread services range from reliable messaging to fully ordered messages with delivery guarantees. Spread can be used in many distributed applications that require high reliability, high performance, and robust communication among various subsets of members. The toolkit is designed to encapsulate the challenging aspects of asynchronous networks and enable the construction of reliable and scalable distributed applications. Some of the services and benefits provided by Spread:
Reliable and scalable messaging and group communication.
A very powerful but simple API simplifies the construction of distributed architectures.
Easy to use, deploy and maintain.
Highly scalable from one local area network to complex wide area networks.
Supports thousands of groups with different sets of members.
Enables message reliability in the presence of machine failures, process crashes and recoveries, and network partitions and merges.
Provides a range of reliability, ordering and stability guarantees for messages.
Emphasis on robustness and high performance.
Completely distributed algorithms with no central point of failure.

In Building Scalable Web Sites Cal Henderson describes how Flickr uses Spread to create a log of real-time events, like photos uploaded and discussions started, as they happen. Spread is connected to their web servers. As photos are uploaded these web server events are messaged in real-time to agents consuming the feed. The advantage of this architecture is it sheds load away from the database. Otherwise the database would have to be continuously polled for new events by each agent.

LAMP and the Spread Toolkit

The Spread Toolkit: Architecture and Performance

Click to read more ...

Todd Hoff |

11 Comments |

Permalink |

flickr

Wednesday

Oct102007

WAN Accelerate Your Way to Lightening Fast Transfers Between Data Centers

Wednesday, October 10, 2007 at 3:18PM

How do you keep in sync a crescendo of data between data centers over a slow WAN? That's the question Alberto posted a few weeks ago. Normally I'm not into all boy bands, but I was frustrated there wasn't a really good answer for his problem. It occurred to me later a WAN accelerator might help turn his slow WAN link into more of a LAN, so the overhead of copying files across the WAN wouldn't be so limiting. Many might not consider a WAN accelerator in this situation, but since my friend Damon Ennis works at the WAN accelerator vendor Silver Peak, I thought I would ask him if their product would help. Not surprisingly his answer is yes! Potentially a lot, depending on the nature of your data. Here's a no BS overview of their product:

What is it? - Scalable WAN Accelerator from Silver Peak (http://www.silver-peak.com)

What does it do? - You can send 5x-100x times more data across your expensive, low-bandwidth WAN link.

Why should you care? - Your data centers become more like co-located real-time peers. - You can sync a lot more media and other large files across data centers. 50x improvement in data replication performance over a WAN. - You may be able to operate on remote database more like a local database. 5x-20x improvement is SQL data manipulation and unique query performance. - A 2 hour database backup would take 4 minutes. 10x-30x improvement in transferring large data sets over SQL. A good disaster planning feature.

How does it work? - You buy an accelerator appliance for both sides of you link. All your WAN traffic flows through these boxes. - The appliances then use various techniques to effectively decrease latency and increase bandwidth across the link: -- Traffic reduction. Accelerators look for patterns in data across a link, caching the data on either side of the link, and then not sending the data when similar patterns are seen again. This can lead to a 90% reduction in traffic. -- Compression. Data are compressed across the link. compression ratios from 0 to 2-5x are seen, depending on the content type. -- TCP Manipulation. The TCP/IP protocol is gamed to yield better performance. For example, a proxy on both sides is used to get a bigger window size. -- Application Manipulation. Various application protocols, like CIFS, NFS, and Outlook, can be gamed to improve performance.

How much does it cost? - $10k to $130k per box. $10k for the 2Mbps appliance and $130k for the 500Mbps. - They are the scale leaders and are specifically good at "high-end" (> 50Mbps) replication.

Who uses it? - Fidelity Bank, Ernst & Young, Panasonic.

Is it for real? - Yes. It works and is installed and running in many data centers.

How do you get it? - Contact sales at http://www.silver-peak.com/Contact/contact.asp.

Where do you go for more information? - White paper Directory - http://www.silver-peak.com/InfoCenter/index.htm#whitepapers - Understanding WAN Acceleration Techniques - http://www.silver-peak.com/assets/download/pdf/technologydescriptions.pdf

Is there anything else interesting you should know? - The appliance performs encryption and compression so you don't need perform those functions on your own CPUs. - The appliances fail to wire so if a box fails traffic passes unaccelerated. If you can't live with that you need to buy 2 boxes per end of the link (4 boxes total).

How much will you benefit? - The more duplication in your data the better job they can do. There's tons of duplicated data in a database feed , for example, so they can really help supercharge database performance. - Latency/time improvements depend on the link. The higher the latency the link has the less bandwidth you can use. For example, a 100ms link is limited to 5Mbps throughput per flow due to the TCP window size (64KB/100ms ~ 5Mbps). They can take this to several hundred Mbps per flow. - Image files are often pre-compressed. As compression removes duplicate information they can't be as efficient at the de-duplication as in other scenarios, though they can still improve throughput. An interesting side-effect of speeding up the WAN link is that it often reveals bottlenecks in other parts of the system. A slow WAN might be hiding:

Underpowered servers. Servers that could process a trickle of data may be overwhelmed by a flood of data.

Slow applications. Apps that could pump data at slow WAN speeds may not be able drive a faster WAN. You may need to take a look at your software architecture or storage network.

Underpowered server links. Accelerate a 2mbps link to a 20mbps link and your network infrastructure on the data center side may not be able to handle the truth. Obviously the cost of the solution means its targeted more for moderate sized companies or a service provider offering their customers a quality upsell. But if you are stuck wondering how the heck you are going to squeeze more bits between your data centers, it may be just the magic bullet you need.

Click to read more ...

Todd Hoff |

Product: Wackamole

Sunday, October 7, 2007 at 1:45AM

Wackamole is an application that helps with making a cluster highly available. It manages a bunch of virtual IPs, that should be available to the outside world at all times. Wackamole ensures that a single machine within a cluster is listening on each virtual IP address that Wackamole manages. If it discovers that particular machines within the cluster are not alive, it will almost immediately ensure that other machines acquire these public IPs. At no time will more than one machine listen on any virtual IP. Wackamole also works toward achieving a balanced distribution of number IPs on the machine within the cluster it manages. There is no other software like Wackamole. Wackamole is quite unique in that it operates in a completely peer-to-peer mode within the cluster. Other products that provide the same high-availability guarantees use a "VIP" method. Wackamole is an application that runs as root in a cluster to make it highly available. It uses the membership notifications provided by the Spread toolkit to generate a consistent state that is agreed upon among all of the connected Wackamole instances. Wackamole is released under the CNDS Open Source License. Note: This post has been adapted from the linked to web site.

White paper on building HA/LB Clusters by Theo Schlossnagle.

Click to read more ...

Todd Hoff |

7 Comments |

Permalink |

Print Article

Email Article

Load Balancing,

Product

Thursday

Oct042007

You Can Now Store All Your Stuff on Your Own Google Like File System

Thursday, October 4, 2007 at 3:38AM

New update: Parascale’s CTO on what’s different about Parascale. Let's say you have gigglebytes of data to store and you aren't sure you want to use a CDN. Amazon's S3 doesn't excite you. And you aren't quite ready to join the grid nation. You want to keep it all in house. Wouldn't it be nice to have something like the Google File System you could use to create a unified file system out of all your disks sitting on all your nodes? According to Robin Harris, a.k.a StorageMojo (a great blog BTW), you can now have your own GFS: Parascale launches Google-like storage software. Parascale calls their softwate a Virtual Storage Network (VSN). It "aggregates disks across commodity Linux x86 servers to deliver petabyte-scale file storage. With features such as automated, transparent file replication and file migration, Parascale eliminates storage hotspots and delivers massive read/write bandwidth." Why should you care? I don't know about you, but the "storage problem" is one the most frustrating parts of building websites. There's never a good answer that is affordable. Should you build a SAN or a NAS? How do you make it redundant? How do you make it perform? How do you back it up? How do you grow it without a defense appropriations sized budget? Should you use RAID? Which level and where for what reason? Should you use SCSI, iSCSI, SAS, SATA, or alpha beta? Which vendor should you use? There are so many conflicting opinions about everything. It's all a confusing mess to me. So I like the simplicity of buying commodity nodes with just a bunch of disks attached. But the question has always been how do you turn all those disks into a unified storage system without writing a ton of software on top? Harris says this is what Parascale has done for you:

VSN, like GFS, builds availability and scalability around low-cost servers and disks. NAS appliances rely on costly low-volume boxes that are closed and don't scale. GFS has been deployed in production clusters of over 5,000 servers, proving the scalability of the architecture. Fast, reliable, low-cost and massively scalable storage powers the growth of new applications like Web 2.0, video-on-demand, and hi-resolution image archiving. Parascale is the first of a new generation of software-only storage solutions.

They make a big deal out of it being a software only system. Harris says why this is a good thing:

I like software-based systems because hardware is a commodity. When you create custom hardware you also create low-volume, high-cost components whose economics go from bad to worse. If you *need* to do it, then go for it. But data is getting cooler and the requirement for specialized high-performance hardware is shrinking relative to the market.

Other systems use an appliance model. Appliances can add a lot of value, but they are also a way of monetizing you. A software system on commodity hardware has the potential to give good value. Will it? I didn't see pricing so it's hard to tell. Even odder is their pricing model. You are leasing the software per year, per disk spindle. Do you have any idea how much this will cost? Neither do I. I sounds like it could be horribly expensive or really reasonable. We'll have to see. Another thing that bothers me is that you can't run a database on top of their file system. This means I need an entire separate storage system for my database. You can run a database on a NAS or SAN, so this is a definite disadvantage. Anyway, it's just another interesting option to consider when architecting your website.

LiveJournal created an open source distributed file system called MogileFS that builders may find useful.

Parascale Announces Industry's First Software-Only Storage Solution for Digital Content

Click to read more ...

Todd Hoff |

3 Comments |

Permalink |

Print Article

Email Article

Clustered Storage System,

Product

Friday

Sep282007

Kosmos File System (KFS) is a New High End Google File System Option

Friday, September 28, 2007 at 12:13AM

There's a new clustered file system on the spindle: Kosmos File System (KFS). Thanks to Rich Skrenta for turning me on to KFS and I think his blog post says it all. KFS is an open source project written in C++ by search startup Kosmix. The team members have a good pedigree so there's a better than average chance this software will be worth considering. After you stop trying to turn KFS into "Kentucky Fried File System" in your mind, take a look at KFS' intriguing feature set:

Incremental scalability: New chunkserver nodes can be added as storage needs increase; the system automatically adapts to the new nodes.

Availability: Replication is used to provide availability due to chunk server failures. Typically, files are replicated 3-way.

Per file degree of replication: The degree of replication is configurable on a per file basis, with a max. limit of 64.

Re-replication: Whenever the degree of replication for a file drops below the configured amount (such as, due to an extended chunkserver outage), the metaserver forces the block to be re-replicated on the remaining chunk servers. Re-replication is done in the background without overwhelming the system.

Re-balancing: Periodically, the meta-server may rebalance the chunks amongst chunkservers. This is done to help with balancing disk space utilization amongst nodes.

Data integrity: To handle disk corruptions to data blocks, data blocks are checksummed. Checksum verification is done on each read; whenever there is a checksum mismatch, re-replication is used to recover the corrupted chunk.

File writes: The system follows the standard model. When an application creates a file, the filename becomes part of the filesystem namespace. For performance, writes are cached at the KFS client library. Periodically, the cache is flushed and data is pushed out to the chunkservers. Also, applications can force data to be flushed to the chunkservers. In either case, once data is flushed to the server, it is available for reading.

Leases: KFS client library uses caching to improve performance. Leases are used to support cache consistency.

Chunk versioning: Versioning is used to detect stale chunks.

Client side fail-over: The client library is resilient to chunksever failures. During reads, if the client library determines that the chunkserver it is communicating with is unreachable, the client library will fail-over to another chunkserver and continue the read. This fail-over is transparent to the application.

Language support: KFS client library can be accessed from C++, Java, and Python.

FUSE support on Linux: By mounting KFS via FUSE, this support allows existing linux utilities (such as, ls) to interface with KFS.

Tools: A shell binary is included in the set of tools. This allows users to navigate the filesystem tree using utilities such as, cp, ls, mkdir, rmdir, rm, mv. Tools to also monitor the chunk/meta-servers are provided.

Deploy scripts: To simplify launching KFS servers, a set of scripts to (1) install KFS binaries on a set of nodes, (2) start/stop KFS servers on a set of nodes are also provided. This seems to compare very favorably to GFS and is targeted at:

Primarily write-once/read-many workloads

Few millions of large files, where each file is on the order of a few tens of MB to a few tens of GB in size

Mostly sequential access As Rich says everyone needs to solve the "storage problem" and this looks like an exciting option to add to your bag of tricks. What we are still missing though is a Bigtable like database on top of the file system for scaling structured data. If anyone is using KFS please consider sharing your experiences.

Hadoop

Google Architecture

You Can Now Store All Your Stuff on Your Own Google Like File System.

Click to read more ...

Todd Hoff |

9 Comments |

Permalink |

Print Article

Email Article

Cluster File System,

Product

Thursday

Sep272007

Product: Sequoia Database Clustering Technology

Thursday, September 27, 2007 at 1:54PM

Sequoia is a transparent middleware solution offering clustering, load balancing and failover services for any database. Sequoia is the continuation of the C-JDBC project. The database is distributed and replicated among several nodes and Sequoia balances the queries among these nodes. Sequoia handles node and network failures with transparent failover. It also provides support for hot recovery, online maintenance operations and online upgrades.

Features in a nutshell

No modification of existing applications or databases.

Operational with any database providing a JDBC driver.

High availability provided by advanced RAIDb technology.

Transparent failover and recovery capabilities.

Performance scalability with unique load balancing and query result caching features.

Integrated JMX-based administration and monitoring.

100% Java implementation allowing portability across platforms with a JRE 1.4 or greater.

Open source licensed under Apache v2 license.

Professional support, training and consulting provided by Continuent Inc∞. Sequoia is the core technology providing database clustering capabilities. It is composed of a controller implementing the RAIDb (Redundant Array of Inexpensive Databases)∞ technology. Sequoia controllers are replicated for HA and scalability purposes. Controllers use group communication to synchronize the cluster. Hedera∞ is a group communication wrapper that can be plugged to work with multiple group communication implementations such as Appia∞, JGroups or Spread. Sequoia comes with a JDBC driver for Java application. Additional drivers for PHP, Perl, ODBC∞, MySQL native API∞ and C/C++ applications are also provided through the Carob project∞. with transparent failover capabilities.

Click to read more ...

Todd Hoff |

Entries in Product (120)

Product: lbpool - Load Balancing JDBC Pool

a8cjdbc - Database Clustering via JDBC

Product: ChironFS

Should you build your next website using 3tera's grid OS?

Podcast Notes

Some Observations and Conclusions

Related Sites and Articles

Product: The Spread Toolkit

Related Articles

WAN Accelerate Your Way to Lightening Fast Transfers Between Data Centers

Product: Wackamole

Related Articles

You Can Now Store All Your Stuff on Your Own Google Like File System

Related Articles

Kosmos File System (KFS) is a New High End Google File System Option

Related Articles

Product: Sequoia Database Clustering Technology

Features in a nutshell