Cloud Architecture – SwipeSense

Introduction

The SwipeSense platform is made up primarily of evented microservices which process data in real time as it comes in from the hardware installed at our client hospitals. It is hosted primarily in the AWS us-east-1 region and relies on only a couple services outside of the AWS ecosystem. The infrastructure is designed to exemplify the following attributes:

Scalable - The infrastructure needs to seamlessly accommodate the growth of the business.
Stable - Customers experiencing bugs should be a rare event.
Available - Hospitals never stop and neither do we. Our data is available to our customers 24/7 without interruption.
Performant - Customers should not have to wait on technology to find the insights they need.
Malleable - We are continuously updating our products and data; the architecture should support easy modification without interruption to existing products and services.

Conventions

To accomplish the above attributes, we practice and implement the following conventions.

Auto-scaling - To support scalability, each service and store when possible should be auto-scaling. i.e. it should react automatically to accommodate the its inputs. This is difficult or impractical for some SQL databases, but for everything else, the capacity of the system should react to demand and should scale gracefully without intervention.
Evented - To be performant, our services should be designed to react to data as it flows into the platform. If it can be done in “real-time” we prefer it that way to provide ourselves with more flexibility down the line. This excludes scheduled ETL aggregation tasks, but for the most part, services are triggered by various events and operate asynchronously.
Highly-available - To support availability, services should run in several availability zones in order to accommodate AWS outages.
Monitored - To support stability, each service should always implement two types of reporting:
- Exceptions - These are logic exceptions in code and are reported to sentry.io.
- Infrastructure - These are alerts such as disk capacity, cpu utilization, dead-letter queue alerts, etc. Essentially these are alerts about the surrounding infrastructure, not the code and are available in AWS Cloudwatch.
Serverless Microservices - In order to be malleable, our platform will consist of many smaller services that depend on each other but deploy independently. Say “no” to monolithic code-bases and services. These pieces should deploy independently and without interruption so we can modify parts of the architecture without interrupting functionality. If possible, these services should also be “serverless” to abstract away infrastructure and capacity concerns.
Single service data ownership/abstracted data access - In general, each datastore should be owned, i.e. writable, by one and only one service. If it needs to support outside modification of data, it should implement apis to provide a layer of abstraction to those modifications. When reading data in unowned datastores, that data should be accessible only via the centralized data broker of the platform - the GraphQL API. This abstracted data layer offers us one huge benefit - if we ever need to refactor the storage of data, whether for performance or to accommodate a new platform use-case, the only two services we need to modify are the owner microservice and graphql. This supports malleability by allowing us to continuously reevaluate and redesign storage mechanisms without interrupting the consumers of that data.

Current Design

Update the following diagram at https://drive.google.com/file/d/1x-gHIPdjV7s2pSZ9CnU0jf6oKM_qAxNO/view?usp=sharing

The current infrastructure and its data-flow is modeled in the above diagram. It can be broken down into several subsections:

Streaming Data

Streaming data is handled by AWS Kinesis, which is a serverless, auto-scaling streaming service. We use it to implement a form of micro-batching. We use kinesis firehose to batch data into AWS S3. We then use S3 events to post notifications of new batches to an AWS SNS channel. Our evented microservices subscribe to these SNS channels, allowing them to process incoming data as it is batched. Although this microbatching introduces a small amount of latency (60s or time to create a 1mb batch, whichever comes first), it offers us a couple extra benefits:

All of our streamed data is stored and archived in S3 in perpetuity. We never throw out any data that flows through the platform - it is archived.
AWS Athena is a data warehousing solution that is able to query JSON data directly in S3. Storing our batches in S3 via firehose allows us to query any input stream with SQL for easy data warehousing and ad-hoc analysis.
SNS Events triggering a lambda function has no concurrency limits allowing essentially unlimited parallelism. Non-firehose kinesis streams must provision “shards” for each concurrent lambda process, which is a bummer.

Serverless Microservices

Our microservices are implemented exclusively with the Serverless Framework which utilizes AWS Lambda for serverless functions. They are either evented, reacting to incoming data, or scheduled. Each service doesn’t necessarily map to one function, but rather a set of functionality. As an example, the Asset Service both derives the location of assets, and also aggregates this location information on scheduled cadence. These are two discreet functions, but housed in the same repository and deployed as one serverless project. A service also defines the supporting resources (such as databases) that it needs to function, which Serverless provisions via AWS Cloudformation. Our current services:

Mesh Events Dynamo - Subscribes to incoming raw sensor data and inserts proximity_events and dispense_events into dynamodb so they are indexed and available for the hygiene location algorithm.
Hygiene Algorithm Scheduler - Subscribes to incoming raw sensor data and submits hygiene algorithm sidekiq jobs to a redis queue. Each job submission is the start of a recursive chain of location algorithm “sessions” where each session calculates one badged user's visits over a 10 minute period of time. Each recursive chain progresses forward in time in these 10 minute increments until the user/badge in question leaves the facility. These jobs are executed by the legacy “Admin App” rails application, described later.
Contact Tracing Scheduler - Subscribes to incoming raw sensor data and submits hygiene algorithm jobs to a redis queue with the contact_tracing = true option. These jobs are executed by the legacy “Admin App” rails application, described later. This is a minor modification of the above service that schedules badges in facilities that have contact tracing enabled.
Contact Tracing Persistence - This subscribes to the feed of “visits” produced by the hygiene algorithm and inserts them into the contact tracing database. This and the above scheduler are the two lambdas that make up the Contact Tracing Service.
Hardware Status - Subscribes to incoming raw sensor data and maintains a dynamodb store describing the current status (last ping, battery voltage, online status) of every hardware component.
Room Occupancy - Provides a json endpoint for ingesting EMR data from Redox - specifically the ADT feed (admission, discharge, transfer) in order to derive which hospital rooms are occupied in real-time.
Visit Aggregation - This subscribes to the feed of “visits” produced by the hygiene algorithm, aggregates them in real-time, and inserts them into the hygiene compliance database. Typically an aggregation job would be colocated with the service that creates the data, however that data is produced by our legacy “Admin App” rails application.
Asset Algorithm Scheduler - Subscribes to incoming raw sensor data and triggers the asset service for active asset tags. In retrospect, this should be part of the Asset Service.
Asset Service - Derives the location of assets and also produces utilization statistics and aggregations.
Asset Alerts - Could conceivable fit into the Asset Service, however was split out due to being written in a different runtime (golang vs python). This subscribes to a feed of assets that have moved to new locations and checks if there are any configured alerts that match the movement.
PDF Reports - In a nutshell, this service masquerades as users and generates PDF reports by converting pages in our web_reports app. It is currently triggered by a job in the legacy “Admin App”.
Rounding Compliance - This is a scheduled service. It runs on every half hour mark, and calculates rounding compliance for shifts that just ended at the end of the previous hour. Its only input is data that can be fetched from graphql.
Utilization Aggregation - This is actually just a function in the Asset Service, but it is a batch ETL job that runs once a day as opposed to the other evented functions that are part of that service.

Data Plane

SwipeSense has a lot of data and we choose storage mechanisms based on the access patterns of our data. This also conveniently provides silos for data access so we can grant granular access privileges to individual stores if we need to. Each store has one owner. The owner can directly write to and read from the store. If the store needs to be modified by other workflows, the owner must provide an abstracted means of doing so - e.g. a lambda to invoke or and api gateway restful endpoint. All read access to datastores is abstracted via graphql (more on this later). A summary of the current stores:

OLTP (mysql) - This is our oldest datastore and was created directly in the AWS console years ago. Due to not having an implicit owner, it is the only store with shared ownership: it is modifiable by our two rails applications: “Admin App” and graphql_api. This is probably the most important of all stores, containing the hierarchical configuration and relationships for each client hospital. All networks, facilities, departments, units, users, locations, hardware, and their associations are held in this store.
Hygiene Compliance (mysql) - A legacy store created directly in the AWS console. Its owner is the “Visit Aggregation” microservice. It holds hand hygiene compliance data aggregated across the many dimensions available in the Hand Hygiene app.
Redis Job Queue (redis) - The final legacy store created directly in AWS console. This breaks the one-owner rule and is writable both by the “Hygiene Algorithm Scheduler” service, which schedules new jobs, as well as the legacy “Admin App” which processes the jobs.
Proximity Events (dynamodb) - Owned and defined by the “Mesh Events Dynamo” service. This holds the last week of “proximity events” which are the raw datum for deriving location.
Dispense Events (dynamodb) - Owned and defined by the “Mesh Events Dynamo” service. This holds the last week of “dispense events” which are the raw datum for deriving dispenser usage.
Hardware Status (dynamodb) - Owned by the “Hardware Status” service. This holds the current state of all sensor hardware.
Room Occupancy (dynamodb) - Owned by the “Room Occupancy” service. Holds a historical log of occupancy status for each room in an EMR-integrated hospital.
Asset Location (dynamodb) - Owned by the “Asset Service”. This contains an indexed history of every location an asset has been in.
Asset Utilization (postgresql) - Owned by the “Asset Service”. This contains aggregate information about where an asset has been for reporting purposes in the Asset Tracking app.
Rounding Compliance (postgresql) - Owned by the “Rounding Compliance” service. This contains fact data and aggregations available in the Nursing Insights app.
Contact Tracing (postgresql) - Owned by the Contact Tracing Service. This contains all the visits generated by the contact tracing algorithm and is the primary datasource for the Contact Tracing frontend app.

Kubernetes

In the beginning, SwipeSense was a monolithic Ruby on Rails application (hey, it was the cool thing to do in 2012). Over time, this application has been largely dismantled and its functionality moved to microservices, but it still holds a subset of critical functionality that hasn’t been practical to rewrite. Although the application is old, its hosted infrastructure is cutting edge. It is containerized and deployed on an auto-scaling Kubernetes cluster. The cluster currently runs two apps:

Admin App - The legacy rails application dating back to 2012. At this point it holds a few critical pieces of functionality:
- SwipeSense Admin (admin.swipesense.com) - The platform for managing data in the OLTP database. This is basically a highly configured rails_admin. It is used by both our customers and CS team to manage and configure all of our installations.
- Login Portal (portal.swipesense.com) - The login page for the whole platform. It provides a session cookie that authenticates user requests across all apps in the swipesense.com domain.
- The Hygiene/Contact Tracing Algorithm - This is super important as it is responsible for deriving location and dispenser usage for hand hygiene compliance. More recently (mid 2020), it was augmented with a contact tracing mode to produce visits for all areas of a hospital, not just patient rooms. There are entire articles on this in the Product & Engineering space on confluence. The algorithm runs as Sidekiq jobs, one job per badged user, per unit of time (10 minutes by default). These jobs are queued in the “Redis Job Queue” and are executed in parallel by auto-scaling worker pods within the cluster.
Graphql API - When we decided to create a centralized data abstraction for the platform, we turned to GraphQL. We wanted to use all the data modeling logic held in the admin app, but wanted to avoid adding a significant new piece of functionality to that monolith. We split the difference and created a new Rails app that we could develop on its own without some of the legacy conventions we had in place in the Admin App. In retrospect, it might have been nice to do this completely Serverless in golang, but this did help us get a completely working GraphQL implementation quickly and continues to scale well.

Single Page Applications (S3 SPAs)

Our frontend applications are built using React, and similar to the microservices, consume data directly from GraphQL. They are deployed in S3 buckets fronted by an AWS Cloudfront caching server. The current apps are:

Comm Hub Ingest Architecture

There is one part of the SwipeSense architecture which is not pictured in the cloud diagram as it is hybrid; half in the cloud and half on-premise (in the hospital). This is the comm hub architecture which is the very top of our sensor data funnel. The comm hub is an on-premise SwipeSense device that acts as our secured gateway into the cloud. It is a custom ARM-based linux box based on the open source Raspberry Pi reference design running a customized variant of Raspbian (which in turn is a custom variant of Debian made specifically for the Raspberry Pi). This is one of the most legacy parts of the Gen2 system and has its own devops procedures handled by the support/operations department. However it was originally built by engineering and a lot of the original design logic is documented (quite poorly in many cases) here. For the purposes of understanding the overall infrastructure, this part of the architecture looks like this:

Update the following diagram at https://drive.google.com/file/d/1S7Ar_d5QTJrLSk2znC8BD6f5nKAIIh06/view?usp=sharing

The kinesis firehose stream you see here is the same as the one labeled “Mesh Events Ingest” in the diagram at the top of this page.

Enterprise

Each installed facility gets a unique AWS instance code-named “Enterprise”. The Enterprise provides an OpenVPN endpoint for all hubs in its corresponding facility, and the comm hubs use this VPN connection to tunnel traffic to and from the cloud. This tunnel is bidirectional and is used by the Enterprise to manage comm hubs via Ansible playbooks (find ours here). Each of these instances also has a unique security group that whitelists access to only the hospital’s public IP range. These instances are provisioned and managed with AWS OpsWorks, which provides an easy-to-use UI to both provision new instances as well as run Chef devops recipes (find our recipes here).

3rd Party Services

Besides AWS, we rely on only a couple other 3rd parties for platform functionality:

Twilio/Sendgrid - Used for sending automated SMS and emails.
Redox - Our EMR integration partner. Throughout SwipeSense history, we’ve prioritized limiting the sensitive information we ingest into the platform. Specifically, ingesting PHI (personal health information) requires a great deal of oversight and thought from a security perspective, and is best to avoid if possible. To that effect, Redox offers us two primary services:
- Work with hospital EMR teams to provide endpoints for forwarding EMR HL7 feeds.
- Filter the above stream to strip out any PHI before forwarding the data to our platform. This alleviates PHI HIPAA concerns.

Related to

Process_Flow.drawio
9 KB Download
1586880132298-Process_Flow.drawio-6ab9b758f7e6910d75924874d9ad63d98f027334.png
300 KB Download
1586880132298-Process_Flow.drawio
10 KB Download
Process_Flow.drawio-6ab9b758f7e6910d75924874d9ad63d98f027334.png
200 KB Download
Screen_Shot_2020-04-14_at_2.56.46_PM.png
9 KB Download
Screen_Shot_2020-04-14_at_2.57.37_PM.png
10 KB Download
Screen_Shot_2020-04-14_at_1.26.23_PM.png
6 KB Download
Comm_Hub_Architecture.png
60 KB Download
Process_Flow_(5).png
700 KB Download