Middle's infrastructure

Understanding the infrastructure of Middle

Middle is built on a composite Amazon Web Services (AWS) cloud infrastructure. Nearly every component of a Middle installation uses an AWS-managed service, including the primary database, queue service, DNS, load balancer, compute instances, container orchestration, infrastructure configuration, and more. This is a deliberate design choice made to gain AWS's world-class business continuity, security and compliance features.

ENTERPRISE versus INTEGRATE

ENTERPRISE

ENTERPRISE customers get their own "stack" in AWS. Each ENTERPRISE stack has its own database, queue and compute instances. For ENTERPRISE customers, this reduces costs at scale, minimizes downtime, minimizes risk of breaches ,and improves overall data governance.

Data is physically separate for ENTERPRISE customers.

INTEGRATE

INTEGRATE customer accounts live on shared cloud infrastructure. Tenanted core application logic separates data between INTEGRATE customer accounts.

Infrastructure components

Aurora database

A Middle "stack" is composed of a number of components, the most important of which is the primary database. We choose to rely on AWS managed services for their security and reliability. Each stack uses AWS Aurora as its primary database. AWS Aurora is AWS's flagship relational database management product: it's expensive, powerful, secure, and fault-tolerant.

Data backups

Aurora offers a high degree of redundancy in case of a disaster at the AWS datacenter, known as an "availability zone" in AWS parlance, in which the instance lives. AWS Aurora stores duplicates of the disk at other availability zones within a region. This is a "by-default" feature that cannot be turned off and a core part of Aurora's design. It means that if an AWS datacenter suffers a fire or other disaster, your data will persist in the remaining availability zones in that region, and will be recovered. As a side note an AWS availability zone has never been destroyed by a fire or other disaster. They're known to have excellent security and operational control.

Automatic recovery

In the case of a physical disk failure or other situation where an Aurora instance cannot continue, Aurora automatically recovers. This takes about 5-10 minutes. It should be noted that Aurora can run in a "High Availability" mode, another instance on "hot standby" can immediately take over. This feature is very expensive. It doubles the cost of the primary database, is almost never actually used, and shaves off only 5-10 minutes of downtime. We judged this not to be a reasonable use of customer's money.

Point-in-time recovery

In the case of a developer error that causes data loss, or any situation requiring recovery, Aurora supports "point-in-time" recovery. In an extraordinary situation, with point-in-time recover we could roll back an Aurora database to any point in time within the rollback window, which is set to 7 days for all Middle customers.

AWS SQS Queue management

Middle uses AWS SQS, a managed queue service with excellent resilience.

AWS Lambda

AWS Lambda is used by Middle to run integration code. It's is a managed compute environment with excellent sandboxing, security, reliability, and scaling capability. App code is only run in AWS Lambda and is isolated on a network level from all other components. The only way in which our Lambda functions communicate with Middle is by way of dropping messages into AWS SQS. There is no network access to other systems, and each app's functions are authorized to drop messages only in a single queue. Queue access is managed by AWS IAM.

AWS CloudFormation

AWS CloudFormation is a managed "infrastructure-as-code" provisioning tool. Middle stacks are NOT "managed by hand." Instead, we have programmed a reusable YAML template that represents all services needed to power a Middle installation. This means we can (and do) program infrastructure changes, test them, have them live in version control, subject to code reviews, and deployed just like code. In this way, we greatly reduce the chance of operational mistakes and increase the overall reliability of the system. Furthermore, in the case where a customer chooses to cancel Middle, deleting their data is easy: we just delete the stack. Finally, should an availability zone be permanently destroyed, and should we need to re-create a customer's stack in a different availability zone, it is trivial to simply deploy a new stack for them in a new availability, and point the new stack to an existing database.

AWS CloudWatch Logs

AWS CloudWatch Logs, a managed logging service that is resilient, scalable, fault-tolerant, and generous to compliance requirements. We use CloudWatch Logs to store integration code "standard output," when the programmer writes "print" statements in their code and when unhandled exceptions occur. CloudWatch Logs supports expiration timers, which we've set to 30 days for privacy compliance. Like every other part of a Middle stack, each customer gets their own set of Log Groups.

AWS Elastic Load Balancer

AWS Elastic Load Balancer, a managed load balancer service which we use to service requests to a customer installation's web portal. This is what powers the internal API that runs the web page that Middle users interact with. AWS Elastic Load Balancer is a fantastic product with, like every other AWS product mentioned so far, excellent reliability. It will be able to scale up regardless of what load is thrown at it.

AWS EC2 Instances

AWS EC2 Instances, a managed virtual machine service, which is where we host all Middle's offline processes. This is the most compute-intensive part of our stack, and is where records are processed, validated, stored, and where workflows are evaluated and executed. Primary data is not stored on EC2 instances; some primary data is copied over to EC2 instances running ElasticSearch to power search UI. EC2 Instances are fault tolerant and reliable. They're automatically replaced if a failure occurs. They are less reliable than Aurora, which is why they're not used as a primary data store.

Use of best practices

Middle's adoption of AWS services allows us to easily adopt a number of best practices for system security and reliability including:

  • Blue/green rolling deployments, at an infrastructure and application level

  • Encryption in Aurora, EC2 with AWS-managed keys

  • Encryption for website traffic, with AWS ELB

  • Better-than-backups point in time recovery, in Aurora

  • Network isolation, with AWS VPC

Last updated