Entries in network (5)

Tuesday
Apr162019

MySQL High Availability Framework Explained – Part III: Failover Scenarios

MySQL High Availability Framework Explained – Part III: Failover Scenarios

In this three-part blog series, we introduced a High Availability (HA) Framework for MySQL hosting in Part I, and discussed the details of MySQL semisynchronous replication in Part II. Now in Part III, we review how the framework handles some of the important MySQL failure scenarios and recovers to ensure high availability.

MySQL Failover Scenarios

Scenario 1 – Master MySQL Goes Down

  • The Corosync and Pacemaker framework detects that the master MySQL is no longer available. Pacemaker demotes the master resource and tries to recover with a restart of the MySQL service, if possible.
  • At this point, due to the semisynchronous nature of the replication, all transactions committed on the master have been received by at least one of the slaves.
  • Pacemaker waits until all the received transactions are applied on the slaves and lets the slaves report their promotion scores. The score calculation is done in such a way that the score is ‘0’ if a slave is completely in sync with the master, and is a negative number otherwise.
  • Pacemaker picks the slave that has reported the 0 score and promotes that slave which now assumes the role of master MySQL on which writes are allowed.
  • After slave promotion, the Resource Agent triggers a DNS rerouting module. The module updates the proxy DNS entry with the IP address of the new master, thus, facilitating all application writes to be redirected to the new master.
  • Pacemaker also sets up the available slaves to start replicating from this new master.

Thus, whenever a master MySQL goes down (whether due to a MySQL crash, OS crash, system reboot, etc.), our HA framework detects it and promotes a suitable slave to take over the role of the master. This ensures that the system continues to be available to the applications.

Scenario 2 – Slave MySQL Goes Down

  • The Corosync and Pacemaker framework detects that the slave MySQL is no longer available.
  • Pacemaker tries to recover the resource by trying to restart MySQL on the node. If it comes up, it is added back to the current master as a slave and replication continues.
  • If recovery fails, Pacemaker reports that resource as down – based on which alerts or notifications can be generated. If necessary, the ScaleGrid support team will handle the recovery of this node.
  • In this case, there is no impact on the availability of MySQL services.

Scenario 3 – Network Partition – Network Connectivity Breaks Down Between Master and Slave Nodes

This is a classical problem in any distributed system where each node thinks the other nodes are down, while in reality, only the network communication between the nodes is broken. This scenario is more commonly known as split-brain scenario, and if not handled properly, can lead to more than one node claiming to be a master MySQL which in turn leads to data inconsistencies and corruption.

Let’s use an example to review how our framework deals with split-brain scenarios in the cluster. We assume that due to network issues, the cluster has partitioned into two groups – master in one group and 2 slaves in the other group, and we will denote this as [(M), (S1,S2)].

  • Corosync detects that the master node is not able to communicate with the slave nodes, and the slave nodes can communicate with each other, but not with the master.
  • The master node will not be able to commit any transactions as the semisynchronous replication expects acknowledgement from at least one of the slaves before the master can commit. At the same time, Pacemaker shuts down MySQL on the master node due to lack of quorum based on the Pacemaker setting ‘no-quorum-policy = stop’. Quorum here means a majority of the nodes, or two out of three in a 3-node cluster setup. Since there is only one master node running in this partition of the cluster, the no-quorum-policy setting is triggered leading to the shutdown of the MySQL master.
  • Now, Pacemaker on the partition [(S1), (S2)] detects that there is no master available in the cluster and initiates a promotion process. Assuming that S1 is up to date with the master (as guaranteed by semisynchronous replication), it is then promoted as the new master.
  • Application traffic will be redirected to this new master MySQL node and the slave S2 will start replicating from the new master.

Thus, we see that the MySQL HA framework handles split-brain scenarios effectively, ensuring both data consistency and availability in the event the network connectivity breaks between master and slave nodes.

This concludes our 3-part blog series on the MySQL High Availability (HA) framework using semisynchronous replication and the Corosync plus Pacemaker stack. At ScaleGrid, we offer highly available hosting for MySQL on AWS and MySQL on Azure that is implemented based on the concepts explained in this blog series. Please visit the ScaleGrid Console for a free trial of our solutions.

Monday
Aug102015

How Google Invented an Amazing Datacenter Network Only They Could Create

 

Google with justly earned pride recently announced:

Today at the 2015 Open Network Summit, we are revealing for the first time the details of five generations of our in-house network technology. From Firehose, our first in-house datacenter network, ten years ago to our latest-generation Jupiter network, we’ve increased the capacity of a single datacenter network more than 100x. Our current generation — Jupiter fabrics — can deliver more than 1 Petabit/sec of total bisection bandwidth. To put this in perspective, such capacity would be enough for 100,000 servers to exchange information at 10Gb/s each, enough to read the entire scanned contents of the Library of Congress in less than 1/10th of a second.

Google’s datacenter network is the magic behind what makes much of Google really work. But what is “bisectional bandwidth” and why does it matter? We talked about bisectional bandwidth a while back in Changing Architectures: New Datacenter Networks Will Set Your Code And Data Free. In short, bisectional bandwidth refers to the networks Google servers use to talk to each other.

Historically datacenter networks were oriented around talking to users. Let’s say a request for a web page came in from a browser. The request would go to a server and a reply was crafted by talking to just a few other servers, or perhaps even none at all, and the reply would be sent back to the client. This style of network is called a North/South oriented network. Very little internal communication was needed to implement a request.

That all changed as website and API services grew richer over time. Now literally thousands of backend requests can be made to create a single web page. Mind blowing. This meant communication shifted from being dominated by talking to users to talking to other machines within a datacenter. So these are called East/West oriented networks.

The shift to East/West dominate communication patterns meant a different topology was needed for datacenter networks. The old traditional fat tree network designs were out and something new needed to take its place.

Google has been on the forefront of developing new rich service supportive network designs largely because of their guiding vision of seeing The Datacenter as a Computer. Once your datacenter is the computer then your network is equivalent to a backplane on a single computer, so it must be as fast and reliable as possible so remote disk and remote storage can be accessed as if they were local.

Google’s efforts revolve around a three pronged plan of attack: use a Clos topology, use SDN (Software Defined Networking), and build their own custom gear in their own Googlish way.

Until now we’ve had a limited exposure to Google’s network designs. While we don’t exactly have an all access pass, Amin Vahdat, Google Fellow and Technical Lead for networking at Google, shared a lot of juicy details in a great talk: ONS [Open Networking Summit] 2015: Wednesday Keynote. There’s also a paper: Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google’s Datacenter Network.

Why release details earlier than they usually do? Google has some real competition with Amazon and they need to find compelling points of differentiation. Google hopes their datacenter network is one such point.

So what makes Google different? The overall message:

  • The end of Moore’s Law means how programs are built is changing.

  • Google has figured it out. Google knows how to build great networks and achieve proper datacenter balance.

  • You can prosper by taking advantage of the innovations and capabilities of Google’s Cloud Platform, the very same platform that powers Google Search.

  • So climb on board, the network is fine! 

Is that enough? Perhaps it's not a message with mass appeal, but it may find a home with the discriminating buyer. 

Some key points from the talk for me:

  • We don’t know how to build big networks that deliver lots of bandwidth. Google says their network provides 1 Pb/sec of total bisection bandwidth, but it turns out that’s not nearly enough. To support a datacenter’s worth of large compute servers you’ll need 5 Pb/sec networks. Keep in mind the entire internet today is probably near 200Tb/s.

  • It’s more efficient to schedule jobs over huge clusters. Otherwise you have leftover CPU in one place and leftover memory in another. So if you can build your system correctly, a datacenter scale computer gives you a decided economy of scale.

  • Google built their datacenter network system using lessons they learned from the server and storage world: scale out, logically centralize, use commodity components, and never ever manage singlets of anything. Manage all your servers, storage, and networks as a unified whole.

  • The I/O gap is huge. Amin says it has to get solved, if it doesn’t then we’ll stop innovating. Storage capacity has increased through disaggregation. The opportunity is to access global datacenter storage as if it were local. This will get harder and harder with flash and NVM. A new tier of flash and NWM will completely change programming models. Note: unfortunately he didn’t expand on this notion, I dearly wished he had. Amin, can we talk?

What you look for in a good story are characters that act from a core identity. Here we see Google operating from a unique vision that grew organically from their deep experience building scalable software systems. Probably only Google would have had the guts to follow their vision through and build a datacenter network so completely different from accepted wisdom. That takes huge huevos. And it makes for a good story.

Here’s my hopelessly inadequate gloss on the talk:

Click to read more ...

Tuesday
Dec092008

Rules of Thumb in Data Engineering

This is an interesting and still relevant research paper by Jim Gray, Prashant Shenoy at Microsoft Research that examines the rules of thumb for the design of data storage systems. It looks at storage, processing, and networking costs, ratios, and trends with a particular focus on performance and price/performance. Jim Gray has an updated presentation on this interesting topic: Long Term Storage Trends and You. Robin Harris has a great post that reflects on the Rules of Thumb whitepaper on his StorageMojo blog: Architecting the Internet Data Center - Parts I-IV.

Click to read more ...

Tuesday
Nov182008

Scalability Perspectives #2: Van Jacobson – Content-Centric Networking

Scalability Perspectives is a series of posts that highlights the ideas that will shape the next decade of IT architecture. Each post is dedicated to a thought leader of the information age and his vision of the future. Be warned though – the journey into the minds and perspectives of these people requires an open mind.

Van Jacobson

Van Jacobson is a Research Fellow at PARC. Prior to that he was Chief Scientist and co-founder of Packet Design. Prior to that he was Chief Scientist at Cisco. Prior to that he was head of the Network Research group at Lawrence Berkeley National Laboratory. He's been studying networking since 1969. He still hopes that someday something will start to make sense.

Scaling the Internet – Does the Net needs an upgrade?

As the Internet is being overrun with video traffic, many wonder if it can survive. With challenges being thrown down over the imbalances that have been created and their impact on the viability of monopolistic business models, the Internet is under constant scrutiny. Will it survive? Or will it succumb to the burden of the billion plus community that is constantly demanding more and more? Does the Net Need an Upgrade? To answer this question a distinguished panel of Van Jacobson, Rick Hutley, Norman Lewis, David S. Isenberg has discussed the issue on the Supernova conference. In this compelling debate available on IT Conversations, the panel addresses the question and provides some differing perspectives. One of the perspectives is Content-based networking described by Van Jacobson.

A New Way to look at Networking

Today's research community congratulates itself for the success of the internet and passionately argues whether circuits or datagrams are the One True Way. Meanwhile the list of unsolved problems grows. Security, mobility, ubiquitous computing, wireless, autonomous sensors, content distribution, digital divide, third world infrastructure, etc., are all poorly served by what's available from either the research community or the marketplace. In this amazing Google Tech Talk Van Jacobson use various strained analogies and contrived examples to argue that network research is moribund because the only thing it knows how to do is fill in the details of a conversation between two applications. Today as in the 60s problems go unsolved due to our tunnel vision and not because of their intrinsic difficulty. And now, like then, simply changing our point of view may make many hard things easy.

Content-centric networking

The founding principle of Content-centric networking is that a communication network should allow a user to focus on the data he or she needs, rather than having to reference a specific, physical location where that data is to be retrieved from. This stems from the fact that the vast majority of current Internet usage (a "high 90% level of traffic") consists of data being disseminated from a source to a number of users. The current architecture of the Internet revolves around a conversation model, created in the 1970s to allow geographically distributed users to use a few big, immobile computers. The content-centric approach seeks to make the basic architecture of the network to current usage patterns. The new approach comes with a wide range of benefits, one of which being building security (both authentication and ciphering) into the network, and at the data level. Despite all its advantages, this idea doesn't seem to map very well to some of the current uses of the Web (like web applications, where data is generated on the fly according to user actions) or real-time applications like VoIP and instant messaging. But one can envision an Internet where content-centric protocols take care of the diffusion-based uses of the network, creating an overlay network, while genuine conversation-centric protocols stay on the current infrastructure.

Solutions or workarounds?

There are many solutions or workarounds for the problems posed by traditional conversation based networking such as Content Delivery Networks, caching, distributed filesystems, P2P and PKI. By taking the perspective of Van Jacobson we can investigate new dimensions of these problems. What could be the impact of this perspective on the future of the Internet architecture? What do you think? I recommend the New Way to Look at Networking video by Van Jacobson. He tells us the brief history of Networking from the phone system to the Internet and his vision for dissemination networking.

Information Sources

Click to read more ...

Sunday
Aug242008

A Scalable, Commodity Data Center Network Architecture

Looks interesting... Abstract: Today’s data centers may contain tens of thousands of computers with significant aggregate bandwidth requirements. The network architecture typically consists of a tree of routing and switching elements with progressively more specialized and expensive equipment moving up the network hierarchy. Unfortunately, even when deploying the highest-end IP switches/routers, resulting topologies may only support 50% of the aggregate bandwidth available at the edge of the network, while still incurring tremendous cost. Nonuniform bandwidth among data center nodes complicates application design and limits overall system performance. In this paper, we show how to leverage largely commodity Ethernet switches to support the full aggregate bandwidth of clusters consisting of tens of thousands of elements. Similar to how clusters of commodity computers have largely replaced more specialized SMPs and MPPs, we argue that appropriately architected and interconnected commodity switches may deliver more performance at less cost than available from today’s higher-end solutions. Our approach requires no modifications to the end host network interface, operating system, or applications; critically, it is fully backward compatible with Ethernet, IP, and TCP.

Click to read more ...