15 Mar 2013

What's a Cluster OS?

I've often heard people talk of a cluster OS, but never understood exactly what they meant. But I now do, so I thought it would be helpful to describe its functions. This is guided by (and biased towards) Amazon AWS, Google App Engine, and Google's internal cluster OS. A cluster OS provides:

- Storage systems of various flavors: strongly or eventually consistent, backed by hard discs or SSDs, replicated between datacenters (either synchronously or asynchronously) or not replicated. One can have a blob store for unstructured data, and many abstractions for structured data: tables with rows and columns, key-value pairs, JSON databases, hierarchical filesystems, POSIX or otherwise, block devices, SQL databases, an archival system, a tape backup system, etc.

- Schedulers that run distributed programs in one more clusters, choosing both the cluster and the machines in those clusters to run the programs in, subject to various goals and constraints.

We can minimize latency as experienced by users, which means picking the closest cluster to the user. If an app uses separate frontend and backend servers, one may want to locate the frontend servers close to the underlying storage, or close to the users. Or one might want to pick the cheapest cluster for a task. Which cluster is cheaper also varies from time to time as other clients compete for that cluster. There may also be legal factors calling for data to be stored in a particular country, or not stored in a particular country.

Similarly, one has to choose machines in a cluster to run a given distributed program. This depends on the requirements of each instance of the program — CPU, memory, etc — that applications declare ahead of time, and which the scheduler then bin-packs into physical servers.

- A system for running batch processes, distinct from interactive ones, which receive user traffic.

- A rollout and rollback system for deployment: This must support features like deploying to a canary datacenter, waiting for an application-defined period and confirming everything is okay before continuing. Or support upgrading user by user (so that a given user does not randomly switch between the old and the new system).

It must support a limited rollout rate (take down only two machines in a cluster at once). It must be able to drain traffic from a particular instance before upgrading it. It must support rollback as well, and ideally support running multiple versions at once, as App Engine does.

- Network systems like DNS, HTTP reverse proxies, mail gateways, CDNs, VPN gateways, firewalls, virtualized networks, etc. A reverse proxy may provide compression, encryption, caching, spoon-feeding of slow clients, blocking DOS attacks, etc.

- Rate limiting, task resizing and routing systems: rate limiting can be useful to prevent your app from crashing under excessive load, task resizing can increase or decrease the number of tasks of your job in response to load, and routing, well, routes HTTP (or other protocol) requests to your tasks. Ideally requests should be routed to the nearest cluster, but your instances in that cluster may not have the necessary data to process the request, or may be overloaded, etc. Similarly, if load increases, we ideally want to spin up more instances in the closest datacenter. Or, alternatively, the cheapest. Load balancers should support retry mechanisms.

- Health-checking for the load balancers to detect which of our instances are healthy, so that traffic is not sent to an unhealthy one.

- Various other general-purpose abstractions like a search system, perhaps structured search, caching (like memcached), a lock service, task queues, a notification/pub-sub system, perhaps a framework for running A/B tests.

- Various special-purpose services like image and video transcoding pipelines, a payment system, a system to send push messages to mobile devices, OAuth for users to login to your app using the cloud provider's login system (Google or Amazon).

- Isolation from other users of the cloud, or from independent apps run by the same user, dealing with hardware failures, datacenter maintenance, etc, in a manner as close to invisible to users as possible.

- A logging system, a monitoring system that tracks both application-independent (latency) and -dependent metrics (number of notes created, if you're building a notes app), an alerting system that can page developers based on criteria defined in terms of these metrics, a dashboard with graphs for all these metrics, etc.

- An access control system for developers, perhaps role-based, with groups and permissions. Maybe developers can see the dashboard, only certain release engineers can deploy new versions, SREs can see logs (which may contain sensitive user data), etc.

- Testing, like load testing or integration testing. This will involve a QA setup that you first deploy to, check everything is okay, and then deploy to production.

- Other aspects like quota enforcement, resource accounting, billing, etc.

- VMs may be a part of the system, and they have their advantages, but they don't provide any of the above facilities that a cluster OS would. A VM just provides machines running a traditional OS, while we need a cluster OS.

No comments:

Post a Comment