Welcome to Operator framework

level Ⅰ

Basic install

Automated application provisioning and configuration management
Level Ⅱ

Seamless Upgrades

Patch and minor version upgrades supported
Level Ⅲ

Full lifecycle

App lifecycle, storage lifecycle (backup, failure recovery)
Level Ⅳ

Deep Insights

Metrics, alerts, log processing and workload analysis
Level Ⅴ

Auto Pilot

Horizontal/vertical scaling, auto config tuning, abnormal detection, scheduling tuning

Terminology

Operator - the custom controller installed on a Kubernetes cluster
Operand - the managed workload provided by the Operator as a service
Custom Resource (CR) - an instance of the CustomResourceDefinition the Operator ships that represents the Operand or an Operation on an Operand (also known as primary resources)
Managed resources - the Kubernetes objects or off-cluster services the Operator uses to constitute an Operand (also known as secondary resources)
Custom Resource Definition (CRD) - an API of the Operator, providing the blueprint and validation rules for Custom Resources

Level Ⅰ: basic install

The Operator offers the following basic features:

Feature	Example
Installation of the workload Operator deploys an Operand or configures off-cluster resources Operator waits for managed resources to reach a healthy state Operator conveys readiness of application or managed resources to the user leveraging the status block of the Custom Resource	An Operator deploys a database by creating Deployment, ServiceAccount, RoleBinding, ConfigMap, PersistentVolumeClaim and Secret object, initializes an empty database schema and signals readiness of the database to accept queries.
Configuration of the workload Operator provides configuration via the spec section of the Custom Resource Operator reconciles configuration and updates to it with the status of the managed resources	An Operator, managing a database, can increase the capacity of the database by resizing the underlying PersistentVolumeClaim based on changes the databases Custom Resource instance.

Feature

Example

Installation of the workload

Operator deploys an Operand or configures off-cluster resources
Operator waits for managed resources to reach a healthy state
Operator conveys readiness of application or managed resources to the user leveraging the status block of the Custom Resource

An Operator deploys a database by creating Deployment, ServiceAccount, RoleBinding, ConfigMap, PersistentVolumeClaim and Secret object, initializes an empty database schema and signals readiness of the database to accept queries.

Configuration of the workload

Operator provides configuration via the spec section of the Custom Resource
Operator reconciles configuration and updates to it with the status of the managed resources

An Operator, managing a database, can increase the capacity of the database by resizing the underlying PersistentVolumeClaim based on changes the databases Custom Resource instance.

GUIDING QUESTIONS TO DETERMINE OPERATOR REACHING LEVEL I

What installation configuration can be set in the CR?
What additional installation configuration could still be added?
Can you set Operand configuration in the CR? If so, what configuration is supported for each Operand?
Does the managed application / workload get updated in a non-disruptive fashion when the configuration of the CR is changed?
Does the status of the CR reflect that configuration changes are currently applied?
What additional Operand configuration could still be added?
Do all of the instantiated CRs include a status block? If so, does it provide enough insight to the user about the application state?
Do all of your CRs have documentation listing valid values and mandatory fields?

Level Ⅱ: Seamless Upgrades

The Operator offers the following features related to upgrades:

Feature	Example
Upgrade of the managed workload Operand can be upgraded in the process of upgrading the Operator, or Operand can be upgraded as part of changing the CR Operator understands how to upgrade older versions of the Operand, managed previously by an older version of the Operator	An Operator deploys a database by creating Deployment, ServiceAccount, RoleBinding, ConfigMap, PersistentVolumeClaim and Secret object, initializes an empty database schema and signals readiness of the database to accept queries.
Upgrade of the Operator Operator can be upgraded seamlessly and can either still manage older versions of the Operand or update them Operator conveys inability to manage an unsupported version of the Operand in the status section of the CR	An Operator managing a database can update an existing database from a previous to a newer version without data loss. The Operator might do so as part of a configuration change or as part of an update of the Operator itself.

Feature

Example

Upgrade of the managed workload

Operand can be upgraded in the process of upgrading the Operator, or
Operand can be upgraded as part of changing the CR
Operator understands how to upgrade older versions of the Operand, managed previously by an older version of the Operator

Upgrade of the Operator

Operator can be upgraded seamlessly and can either still manage older versions of the Operand or update them
Operator conveys inability to manage an unsupported version of the Operand in the status section of the CR

An Operator managing a database can update an existing database from a previous to a newer version without data loss. The Operator might do so as part of a configuration change or as part of an update of the Operator itself.

GUIDING QUESTIONS TO DETERMINE OPERATOR REACHING LEVEL Ⅱ

Can your Operator upgrade your Operand?
Does your Operator upgrade your Operand during updates of the Operator?
Can your Operator manage older Operand versions?
Is the Operand upgrade potentially disruptive?
If there is downtime during an upgrade, does the Operator convey this in the status of the CR?

Level Ⅲ: Full lifecycle

The Operator offers one or more of the following lifecycle management features:

Feature	Example
Ability to create backups of the Operand Ability to restore a backup of an Operand Orchestration of complex re-configuration flows on the Operand Implementation of fail-over and fail-back of clustered Operands Support for adding/removing members to a clustered Operand Enabling application-aware scaling of the Operand	An Operator managing a database provides the ability to create an application-consistent backup of the data by flushing the database log and quiescing the write activity to the database files.

Feature

Example

Ability to create backups of the Operand

Ability to restore a backup of an Operand

Orchestration of complex re-configuration flows on the Operand

Implementation of fail-over and fail-back of clustered Operands

Support for adding/removing members to a clustered Operand

Enabling application-aware scaling of the Operand

An Operator managing a database provides the ability to create an application-consistent backup of the data by flushing the database log and quiescing the write activity to the database files.

GUIDING QUESTIONS TO DETERMINE OPERATOR REACHING LEVEL Ⅲ

Does your Operator support backing up the Operand?
Does your Operator support restoring an Operand from a backup and get it under management again?
Does your Operator wait for reconfiguration work to be finished and in the expected sequence?
Is your Operator taking cluster quorum into account, if present?
Does your Operator allow adding/removing read-only slave instances of your Operator?

Level Ⅳ: Deep insights

The Operator offers one or more of the following deep insights features:

Feature	Example
Monitoring Operator exposes metrics about its own health Operator exposes health and performance metrics about the Operand Alerting and Events Operand sends useful alerts Custom Resources emit custom events Metering Operator leverages Operator Metering	A database Operator continues to parse the logging output of the database software and understands noteworthy log events, e.g. running out of space for database files and produces alerts. The operator also instruments the database and exposes application level, e.g. database queries per second.

Feature

Example

Monitoring

Operator exposes metrics about its own health
Operator exposes health and performance metrics about the Operand

Metering

Operator leverages Operator Metering

A database Operator continues to parse the logging output of the database software and understands noteworthy log events, e.g. running out of space for database files and produces alerts. The operator also instruments the database and exposes application level, e.g. database queries per second.

GUIDING QUESTIONS TO DETERMINE OPERATOR REACHING LEVEL Ⅳ

Does your Operator expose a health metrics endpoint?
Does your Operator expose Operand alerts?
Does your Operator watch the Operand to create alerts?
Does your Operator emit custom Kubernetes events?
Does your Operator expose Operand performance metrics?

Level Ⅴ: Auto Pilot

Feature	Example
Auto Scaling Operator scales the Operand up under increased load based on Operand metric Operator scales the Operand down below a certain load based on Operand metric Auto-healing Operator can automatically heal unhealthy Operands based on Operand metrics/alerts/logs Operator can prevent the Operand from transitioning into an unhealthy state based on Operand metrics Auto-tuning Operator is able to automatically tune the Operand to a certain workload pattern Operator dynamically shifts workloads onto best suited nodes Abnormality detection Operator determines deviations from a standard performance profile	A database Operator monitors the query load of the database and automatically scales additional read-only slave replicas up and down. The Operator also detects subpar index performance and automatically rebuilds the index in times of reduced load. Further, the Operator understands the normal performance profile of the database and creates alerts on excessive amount of slow queries. In the event of slow queries and high disk latency the Operator automatically transitions the database files to another PersistentVolume of a higher performance class.

Feature

Example

Auto Scaling

Operator scales the Operand up under increased load based on Operand metric
Operator scales the Operand down below a certain load based on Operand metric

Auto-healing

Operator can automatically heal unhealthy Operands based on Operand metrics/alerts/logs
Operator can prevent the Operand from transitioning into an unhealthy state based on Operand metrics

Auto-tuning

Operator is able to automatically tune the Operand to a certain workload pattern
Operator dynamically shifts workloads onto best suited nodes

Abnormality detection

Operator determines deviations from a standard performance profile

A database Operator monitors the query load of the database and automatically scales additional read-only slave replicas up and down. The Operator also detects subpar index performance and automatically rebuilds the index in times of reduced load. Further, the Operator understands the normal performance profile of the database and creates alerts on excessive amount of slow queries. In the event of slow queries and high disk latency the Operator automatically transitions the database files to another PersistentVolume of a higher performance class.

GUIDING QUESTIONS TO DETERMINE OPERATOR REACHING LEVEL Ⅴ

Can your Operator read metrics such as requests per second or other relevant metrics and auto-scale horizontally or vertically, i.e., increasing the number of pods or resources used by pods?
Based on question number 1 can it scale down or decrease the number of pods or the total amount of resources used by pods?
Based on the deep insights built upon level 4 capabilities can your Operator determine when an Operand became unhealthy and take action such as redeploying, changing configurations, restoring backups etc.?
Again considering that with level 4 deep insights the Operator has information to learn the performance baseline dynamically and can learn the best configurations for peak performance can it adjust the configurations to do so?
Can it move the workloads to better nodes, storage or networks to do so?
Can it detect and alert when anything is working below the learned performance baseline that can’t be corrected automatically?

Operator Capabilities

Operator capability levels

level Ⅰ

Basic install

Level Ⅱ

Seamless Upgrades

Level Ⅲ

Full lifecycle

Level Ⅳ

Deep Insights

Level Ⅴ

Auto Pilot

Terminology

Level Ⅰ: basic install

Installation of the workload

Configuration of the workload

GUIDING QUESTIONS TO DETERMINE OPERATOR REACHING LEVEL I

Level Ⅱ: Seamless Upgrades

Upgrade of the managed workload

Upgrade of the Operator

GUIDING QUESTIONS TO DETERMINE OPERATOR REACHING LEVEL Ⅱ

Level Ⅲ: Full lifecycle

GUIDING QUESTIONS TO DETERMINE OPERATOR REACHING LEVEL Ⅲ

Level Ⅳ: Deep insights

Monitoring

Alerting and Events

Metering

GUIDING QUESTIONS TO DETERMINE OPERATOR REACHING LEVEL Ⅳ

Level Ⅴ: Auto Pilot

Auto Scaling

Auto-healing

Auto-tuning

Abnormality detection

GUIDING QUESTIONS TO DETERMINE OPERATOR REACHING LEVEL Ⅴ