Entries in Clustering (13)

Sunday
Dec022007

a8cjdbc - update verision 1.3

The new version of a8cjdbc finished some limitations. Now Clobs and Blobs are supported, and some fixes using binary data. The version was also fully tested with Postgres and mySQL. Since Version 1.3 there is also a free trail version for download available. Check it out and test yourself... Take a look at: http://www.activ8.at/homepage/en/a8cjdbc.php I've downloaded the latest version and setup a environment with one virtual database and two database backends. I tried to make a "non real life szenario": The first backend was a Postgres node, the second was a mySQL node. Everything works fine - failover - recoverylog, etc... with to different backend database types. So check out the trial version and test yourself the clustered driver and give me some results about your experience with a8cjdbc. As I only tested mySQL and Postgres (and the non real life szenario with two different backend types) - maybe someone else have experiences with out databases? greetings Wolfgang

Click to read more ...

Thursday
Nov152007

Video: Dryad: A general-purpose distributed execution platform

Dryad is Microsoft's answer to Google's map-reduce. What's the question: How do you process really large amounts of data? My initial impression of Dryad is it's like a giant Unix command line filter on steroids. There are lots of inputs, outputs, tees, queues, and merge sorts all connected together by a master exec program. What else does Dryad have to offer the scalable infrastructure wars? Dryad models programs as the execution of a directed acyclic graph. Each vertex is a program and edges are typed communication channels (files, TCP pipes, and shared memory channels within a process). Map-reduce uses a different model. It's more like a large distributed sort where the programmer defines functions for mapping, partitioning, and reducing. Each approach seems to borrow from the spirit of its creating organization. The graph approach seems a bit too complicated and map-reduce seems a bit too simple. How ironic, in the Alanis Morissette sense. Dryad is a middleware layer that executes graphs for you, automatically taking care of scheduling, distribution, and fault tolerance. It's written in C++, but apparently few write directly to this layer, most people use higher layer interfaces. A Job Manager runs the program. It's a library you link in and it loads and executes the graph. A daemon runs on each machine to run jobs. A name server provides access to cluster resources. The DAG is a multigraph so you can have multiple edges between vertices. A DAG was chosen because it's not too cold, or too hot, the porridge is just right. Cycles are too hard. Simpler isn't as useful. DAGs support relational algebra and can split multiple inputs and outputs nicely. One interesting aspect is a a channel is a sequence of structure items that are C++ objects. This means pointers can be passed directly so you don't have to worry about serialization overhead. No restrictions are put on the data model. Graphs are dynamically changeable at runtime which allows for a lot of optimizations. Several case studies were provided. It's probably just me, but I didn't really understand what was going on. Google's example is much better. Everyone can relate to counting words in a document. My thoughts while watching is that the graph stuff sounds cool and general, but it's hard to map it efficiently to solutions when the problems have large numbers of inputs. You have to manually optimize for available RAM and CPUs. The system should do all this work for you. But the graph approach is powerful. The programmer provide the bits of atomic behaviour and the system can then try various optimizations. The code doesn't have to change because the graph can be manipulated abstractly on its own. So you can write something like a SQL query. Then something like a query planner figures out how to execute the query on Dryad.

Click to read more ...

Monday
Nov122007

a8cjdbc - Database Clustering via JDBC

Practically any software project nowadays could not survive without a database (DBMS) backend storing all the business data that is vital to you and/or your customers. When projects grow larger, the amount of data usually grows larger exponentially. So you start moving the DBMS to a separate server to gain more speed and capacity. Which is all good and healthy but you do not gain any extra safety for this business data. You might be backing up your database once a day so in case the database server crashes you don't lose EVERYTHING, but how much can you really afford to lose? Well clearly this depends on what kind of data you are storing. In our case the users of our solutions use our software products to do their everyday (all day) work. They have "everything" they need for their business stored in the database we are providing. So is 24 hours of data loss acceptable? No, not really. One hour? Maybe. But what we really want is a second database running with the EXACT same data. We mostly use PostgreSQL which does not have built in database replication. There is some solution based on triggers to replicate the data from one database to another one. We have learned that setting all this up on an existing database with plenty of tables is rather complicated and changing the database structure afterwards can not be done with simple create/alter statements anymore. And since we ARE running solutions that constantly change and improve, we need to be able to deploy updates including database structure changes quickly and easily. So what we really wanted was a transparent JDBC layer that does the replication for us. We tested a great solution called "Sequoia", but it is also a rather heavy-weight product with a lot of features that did not really help in the performance department and that we didn't need anyway. What we needed was:

  • a JDBC driver so the application does not know anything about the replication
  • of course: transactional safety for write operations
  • load-balanced reads (we are running 2 database servers, so why waste the ability to do parallel reads from 2 servers and almost multiply the performance by 2?)
  • for backups: the ability to detach one server, do the backup on that machine and then reattach the server
  • automatic and transparent failover / failsafe
  • Fast In-VM-Replication - no serialisation
  • Easy integration

    Click to read more ...

Page 1 2