Understanding the infrastructure advantages of a Middle installation
Middle's infrastructure lives entirely within Amazon Web Services (AWS). Nearly every component of a Middle installation uses an AWS-managed service, including the primary database, queue service, DNS, load balancer, compute instances, container orchestration, infrastructure configuration, and more. This is a deliberate design choice made to gain AWS's world-class business continuity, security and compliance features. In short, Middle's judicious use of AWS services, combined with a single-tenant architecture, adoption of the "infrastructure-as-code" system architecture model, allows Middle to gain the benefits of AWS at a reasonable cost to the customer.
Each of Middle's customers gets their own "stack" in AWS. Each stack has its own database, queue and compute instances. This was a design choice made to reduce costs at scale, minimize downtime, minimize risk of breaches and improve overall data governance.
Other systems merely use application logic to separate data between customers, Middle's data is physically separated. A Middle application programming error could never cause one customer's data to be visible by another.
Where other systems may struggle to deploy platform updates, Middle can roll out deployments to some customers, but not others. In this way, Middle can test platform updates on a small number of deployed production environments, and only if the update proves to be successful do deployments move ahead to other customers. Furthermore, we can respect customer wishes to not receive updates at a particular time, such as a time when their system is under significant load with an important process running.
Platforms with a shared-multi-tenant architecture often struggle with the noisy neighbor problem (a situation where one large tenant hogs resources that should be available for others), Middle's single-tenant architecture ensures that all resources available for that tenant are dedicated for that customer's own use.
Middle customers can feel confident that if they ever were to discontinue using the product, all data would be irrevocably deleted with ease. Middle simply deletes the stack, and all associated resources such as the database, go away forever.
A Middle "stack" is composed of a number of components, the most important of which is the primary database. We choose to rely on AWS managed services for their security and reliability. Each stack uses AWS Aurora as its primary database. AWS Aurora is AWS's flagship relational database management product: it's expensive, powerful, secure, and fault-tolerant.
Aurora offers a high degree of redundancy in case of a disaster at the AWS datacenter, known as an "availability zone" in AWS parlance, in which the instance lives. AWS Aurora stores duplicates of the disk at other availability zones within a region. This is a "by-default" feature that cannot be turned off and a core part of Aurora's design. It means that if an AWS datacenter suffers a fire or other disaster, your data will persist in the remaining availability zones in that region, and will be recovered. As a side note an AWS availability zone has never been destroyed by a fire or other disaster. They're known to have excellent security and operational control.
In the case of a physical disk failure or other situation where an Aurora instance cannot continue, Aurora automatically recovers. This takes about 5-10 minutes. It should be noted that Aurora can run in a "High Availability" mode, another instance on "hot standby" can immediately take over. This feature is very expensive. It doubles the cost of the primary database, is almost never actually used, and shaves off only 5-10 minutes of downtime. We judged this not to be a reasonable use of customer's money.
In the case of a developer error that causes data loss, or any situation requiring recovery, Aurora supports "point-in-time" recovery. In an extraordinary situation, with point-in-time recover we could roll back an Aurora database to any point in time within the rollback window, which is set to 7 days for all Middle customers.
AWS Lambda is used by Middle to run user-authored integration code. It's is a managed compute environment with excellent sandboxing, security, reliability, and scaling capability. User-authored code is only run in AWS Lambda and is isolated on a network level from all other components. The only way in which our Lambda functions communicate with Middle is by way of dropping messages into AWS SQS. There is no network access to other systems, and each App's functions are authorized to drop messages only in a single queue. Queue access is managed by AWS IAM.
AWS CloudFormation is a managed "infrastructure-as-code" provisioning tool. Middle stacks are NOT "managed by hand." Instead, we have programmed a reusable YAML template that represents all services needed to power a Middle installation. This means we can (and do) program infrastructure changes, test them, have them live in version control, subject to code reviews, and deployed just like code. In this way, we greatly reduce the chance of operational mistakes and increase the overall reliability of the system. Furthermore, in the case where a customer chooses to cancel Middle, deleting their data is easy: we just delete the stack. Finally, should an availability zone be permanently destroyed, and should we need to re-create a customer's stack in a different availability zone, it is trivial to simply deploy a new stack for them in a new availability, and point the new stack to an existing database.
AWS CloudWatch Logs, a managed logging service that is resilient, scalable, fault-tolerant, and generous to compliance requirements. We use CloudWatch Logs to store integration code "standard output," when the programmer writes "print" statements in their code and when unhandled exceptions occur. CloudWatch Logs supports expiration timers, which we've set to 30 days for privacy compliance. Like every other part of a Middle stack, each customer gets their own set of Log Groups.
AWS Elastic Load Balancer, a managed load balancer service which we use to service requests to a customer installation's web portal. This is what powers the internal API that runs the web page that Middle users interact with. AWS Elastic Load Balancer is a fantastic product with, like every other AWS product mentioned so far, excellent reliability. It will be able to scale up regardless of what load is thrown at it.
AWS EC2 Instances, a managed virtual machine service, which is where we host all Middle's offline processes. This is the most compute-intensive part of our stack, and is where records are processed, validated, stored, and where workflows are evaluated and executed. Primary data is not stored on EC2 instances; some primary data is copied over to EC2 instances running ElasticSearch to power search UI. EC2 Instances are fault tolerant and reliable. They're automatically replaced if a failure occurs. They are less reliable than Aurora, which is why they're not used as a primary data store.
Middle's adoption of AWS services allows us to easily adopt a number of best practices for system security and reliability including:
- Blue/green rolling deployments, at an infrastructure and application level, as discussed above
- Encryption in Aurora, EC2 with AWS-managed keys
- Encryption for website traffic, with AWS ELB
- Better-than-backups point in time recovery, in Aurora
- Network isolation, with AWS VPC