Entries in management (6)

Thursday
Jun272019

2019 Open Source Database Report: Top Databases, Public Cloud vs. On-Premise, Polyglot Persistence

2019 Open Source Database Report: Top Databases, Public Cloud vs. On-Premise, Polyglot Persistence

Ready to transition from a commercial database to open source, and want to know which databases are most popular in 2019? Wondering whether an on-premise vs. public cloud vs. hybrid cloud infrastructure is best for your database strategy? Or, considering adding a new database to your application and want to see which combinations are most popular? We found all the answers you need at the Percona Live event last month, and broke down the insights into the following free trends reports:

Click to read more ...

Tuesday
Apr162019

MySQL High Availability Framework Explained – Part III: Failover Scenarios

MySQL High Availability Framework Explained – Part III: Failover Scenarios

In this three-part blog series, we introduced a High Availability (HA) Framework for MySQL hosting in Part I, and discussed the details of MySQL semisynchronous replication in Part II. Now in Part III, we review how the framework handles some of the important MySQL failure scenarios and recovers to ensure high availability.

MySQL Failover Scenarios

Scenario 1 – Master MySQL Goes Down

  • The Corosync and Pacemaker framework detects that the master MySQL is no longer available. Pacemaker demotes the master resource and tries to recover with a restart of the MySQL service, if possible.
  • At this point, due to the semisynchronous nature of the replication, all transactions committed on the master have been received by at least one of the slaves.
  • Pacemaker waits until all the received transactions are applied on the slaves and lets the slaves report their promotion scores. The score calculation is done in such a way that the score is ‘0’ if a slave is completely in sync with the master, and is a negative number otherwise.
  • Pacemaker picks the slave that has reported the 0 score and promotes that slave which now assumes the role of master MySQL on which writes are allowed.
  • After slave promotion, the Resource Agent triggers a DNS rerouting module. The module updates the proxy DNS entry with the IP address of the new master, thus, facilitating all application writes to be redirected to the new master.
  • Pacemaker also sets up the available slaves to start replicating from this new master.

Thus, whenever a master MySQL goes down (whether due to a MySQL crash, OS crash, system reboot, etc.), our HA framework detects it and promotes a suitable slave to take over the role of the master. This ensures that the system continues to be available to the applications.

Scenario 2 – Slave MySQL Goes Down

  • The Corosync and Pacemaker framework detects that the slave MySQL is no longer available.
  • Pacemaker tries to recover the resource by trying to restart MySQL on the node. If it comes up, it is added back to the current master as a slave and replication continues.
  • If recovery fails, Pacemaker reports that resource as down – based on which alerts or notifications can be generated. If necessary, the ScaleGrid support team will handle the recovery of this node.
  • In this case, there is no impact on the availability of MySQL services.

Scenario 3 – Network Partition – Network Connectivity Breaks Down Between Master and Slave Nodes

This is a classical problem in any distributed system where each node thinks the other nodes are down, while in reality, only the network communication between the nodes is broken. This scenario is more commonly known as split-brain scenario, and if not handled properly, can lead to more than one node claiming to be a master MySQL which in turn leads to data inconsistencies and corruption.

Let’s use an example to review how our framework deals with split-brain scenarios in the cluster. We assume that due to network issues, the cluster has partitioned into two groups – master in one group and 2 slaves in the other group, and we will denote this as [(M), (S1,S2)].

  • Corosync detects that the master node is not able to communicate with the slave nodes, and the slave nodes can communicate with each other, but not with the master.
  • The master node will not be able to commit any transactions as the semisynchronous replication expects acknowledgement from at least one of the slaves before the master can commit. At the same time, Pacemaker shuts down MySQL on the master node due to lack of quorum based on the Pacemaker setting ‘no-quorum-policy = stop’. Quorum here means a majority of the nodes, or two out of three in a 3-node cluster setup. Since there is only one master node running in this partition of the cluster, the no-quorum-policy setting is triggered leading to the shutdown of the MySQL master.
  • Now, Pacemaker on the partition [(S1), (S2)] detects that there is no master available in the cluster and initiates a promotion process. Assuming that S1 is up to date with the master (as guaranteed by semisynchronous replication), it is then promoted as the new master.
  • Application traffic will be redirected to this new master MySQL node and the slave S2 will start replicating from the new master.

Thus, we see that the MySQL HA framework handles split-brain scenarios effectively, ensuring both data consistency and availability in the event the network connectivity breaks between master and slave nodes.

This concludes our 3-part blog series on the MySQL High Availability (HA) framework using semisynchronous replication and the Corosync plus Pacemaker stack. At ScaleGrid, we offer highly available hosting for MySQL on AWS and MySQL on Azure that is implemented based on the concepts explained in this blog series. Please visit the ScaleGrid Console for a free trial of our solutions.

Wednesday
Apr032019

2019 PostgreSQL Trends Report: Private vs. Public Cloud, Migrations, Database Combinations & Top Reasons Used

2019 PostgreSQL Trends Report: Private vs. Public Cloud, Migrations, Database Combinations & Top Reasons Used

PostgreSQL is an open source object-relational database system that has soared in popularity over the past 30 years from its active, loyal, and growing community. For the 2nd year in a row, PostgreSQL has kept the title of #1 fastest growing database in the world according to the DBMS of the Year report by the experts at DB-Engines. So what makes PostgreSQL so special, and how is it being used today? We found the answers at the Postgres Conference in March where we surveyed PostgreSQL users, contributors, and SQL and NoSQL database administrators alike. In this free PostgreSQL Trends Report, we break down PostgreSQL hosting use across public cloud vs. private cloud vs. hybrid cloud, most popular cloud providers, migration trends, database combinations with Postgres, and why PostgreSQL is preferred over popular RDBMS alternatives.

Private Cloud vs. Public Cloud vs. Hybrid Cloud

Click to read more ...

Tuesday
Feb192019

Intro to Redis Cluster Sharding – Advantages, Limitations, Deploying & Client Connections

Intro to Redis Cluster Sharding – Advantages, Limitations, Deploying & Client Connections

Redis Cluster is the native sharding implementation available within Redis that allows you to automatically distribute your data across multiple nodes without having to rely on external tools and utilities. At ScaleGrid, we recently added support for Redis Clusters on our platform through our fully managed Redis hosting plans. In this post, we’re going to introduce you to the advanced Redis Cluster sharding opportunities, discuss its advantages and limitations, when you should deploy, and how to connect to your Redis Cluster.

Sharding with Redis Cluster

Click to read more ...

Tuesday
Mar252008

Paper: On Designing and Deploying Internet-Scale Services

Greg Linden links to a heavily lesson ladened LISA 2007 paper titled On Designing and Deploying Internet-Scale Services by James Hamilton of the Windows Live Services Platform group. I know people crave nitty-gritty details, but this isn't a how to configure a web server article. It hitches you to a rocket and zooms you up to 50,000 feet so you can take a look at best web operations practices from a broad, yet practical perspective. The author and his team of contributors obviously have a lot of in the trenches experience. Many non-obvious topics are covered. And there's a lot to learn from.

The paper has too many details to cover here, but the big sections are:

  • Recommendations
  • Automatic Management and Provisioning
  • Dependency Management
  • Release Cycle and Testing
  • Operations and Capacity Planning
  • Graceful Degradation and Admission Control
  • Customer Self-Provisioning and Self-Help
  • Customer and Press Communication Plan

    In the recommendations we see some of our old favorites:
  • Expect failure and design for failure.
  • Implement redundancy and fault recovery.
  • Depend upon a commodity hardware slice.
  • Keep things simple and robust.
  • Automate everything.

    Personally, I'm still trying to figure out how to make something simple.

    Next are some good thoughts on how to design operations friendly software:
  • Quick service health check. This is the services version of a build verification test.
  • Develop in the full environment.
  • Zero trust of underlying components.
  • Do not build the same functionality in multiple components.
  • One pod or cluster should not affect another pod or cluster.
  • Allow (rare) emergency human intervention.
  • Enforce admission control at all levels.
  • Partition services.
  • Understand the network design.
  • Analyze throughput and latency.
  • Treat operations utilities as part of the service.
  • Understand access patterns.
  • Version everything.
  • Keep the unit/functional tests from the last release.
  • Avoid single points of failure.
  • Support single-version software. Have all your customers run the same version.
  • Implement multi-tenancy. Apparently a lot of software requires cloning hardware installations to support multiple customers. Don't do that. Have your software work for multiple customers all on the same hardware.

    And the paper continues along the same lines in each section. Good detailed advice on lots of different topics.

    You'll undoubtedly agree with some of the advice and disagree with some. Greg wants faster release cycles, thinks having server affinity for some things is OK, and thinks the advice on allowing humans to throttle load won't work in a crisis. Perfectly valid points, but what's fun is to consider them. Some companies, for example, have a dead-man's switch that must be thrown before one master can failover to another in a multi-datacenter situation. Is that wrong or right? Only the shadow knows.

    The advice to "document all conceivable component failures and modes and combinations" sounds good but is truly difficult to do in practice. I went through this process once on a telco project and it took months just to cover all the failure scenarios on a few cards. But the spirit is right I think.

    My favorite part of the whole paper is:
    We have long believed that 80% of operations issues originate in design and development, so this section
    on overall service design is the largest and most important. When systems fail, there is a natural tendency
    to look first to operations since that is where the problem actually took place. Most operations issues,
    however, either have their genesis in design and development are best solved there.

    Understand this and I think much of the rest follows naturally.
  • Monday
    Jan282008

    Product: ISPMan Centralized ISP Management System 

    From FRESH Ports and their website: ISPman is an ISP management software written in perl, using an LDAP backend to manage virtual hosts for an ISP. It can be used to manage, DNS, virtual hosts for apache config, postfix configuration, cyrus mail boxes, proftpd etc. ISPMan was written as a management tool for the network at 4unet where between 30 to 50 domains are hosted and the number is crazily growing. Managing these domains and their users was a little time consuming, and needed an Administrator who knows linux and these daemons fluently. Now the help-desk can easily manage the domains and users. LDAP data can be easily replicated site wide, and mail box server can be scaled from 1 to n as required. An LDAP entry called maildrop tells the SMTP server (postfix) where to deliver the mail. The SMTP servers can be loadbalanced with one of many load balancing techniques. The program is written with scalability and High availability in mind. This may not be the right software for you if you want to run a small ISP on a single box or if you want to use this software as an LDAP editor or a DNS management software by itself. ISPMan is written mostly in Perl and is based on four major components. All these components are based on open standards and are easily customizable.

  • LDAP-directory works as a central registry of information about users, hosts, dns, processes etc. All information related to resources is kept in this directory. The LDAP directory can be replicated to multiple machines to balance the load.
  • Ispman-webinterface is an intuitive Iinterface to manage informations about your ISP infrastructure. This interface allows you to edit your LDAP registry to change different informations about your resources such as adding a new domain, deleting a user etc. The interface can run on http or https and is only available after successful authentification as an ISPMan admin. Access control to this interface can also be limited to designated IP addresses either via Apache access control functions or via ISPMan ACL.
  • Ispman-agent is a component of ISPMan that runs on hosts taking part in the ISP, these agents read the LDAP directory for processes assigned to them and take appropriate actions Example : create directory for new domains, create mailbox for users, etc. These agents are a very important part of the system and are should be run continuously. The agents are run via a fault taulerant services manager called « daemontools » that makes sure that the agents recovers immediately in case of any failure.
  • ISPman-customer-control-panel is an interface targeted towards customers (domain owners). Using this interface the domain owners can manage their own dns, webserver settings, users, mailing lists, access control etc.

    Click to read more ...