Introduction
Defined in a broad (and somewhat aspirational) way, Data Engineering at Swipesense is the supply of (transformed) production data to non-production services in a sustainable, resilient, reproducible manner. In general, data engineering efforts should seek to embody the following qualities:
Security - Maintaining security around our data is of utmost importance. We should limit access to the data to only the relevant individuals throughout the data engineering lifecycle and strive to ensure that that security is maintained through regular audits and testing. Data should be personally identifiable only when absolutely necessary. Data products (analyses, etc.) should be shared externally deliberately and thoughtfully.
Resiliency - ETL processes and data storage should be fault-tolerant and resilient to errors. Further, if errors do occur, they should be not affect downstream services or data outside of failing gracefully (i.e., it is significantly more preferable for new data not to appear due to an error than it would be to cause a dashboard to crash or to affect the accuracy or availability of existing data.
Sustainability - Providing production level data for non-production level services is a secondary concern to ensuring that production remains performant and available for customers. As a guiding principle, any data engineering effort should be invisible to production users in terms of performance, uptime, and functionality.
Reproducibility - Data engineering infrastructure should be able to be stood up and torn down in an automated fashion and all data engineering products (tables, databases, ETL jobs, etc.) should be able to be rebuilt or regenerated without need for manual intervention.
Accessibility - Data should be available and easily accessible for those with satisfactory permissions. It should be readily apparent and clear what a given metric or value is, how it was defined, and what its source data is. We will take a particular effort to ensure that derived statistics are clearly defined and will avoid magic numbers. Ideally, there should be one (and only one) way to derive or calculate a given measure. Accuracy (in terms of data quality) and transparency (in terms of data generation) are key parts of this quality.
Data as a First-Class Citizen - There’s no point to collecting and distributing data if it goes unused (or worse, is not trusted). We strive to ensure that timely and accurate data is both available to decision-makers and is a key part of the decisions they make. Further, we believe that timely and accurate data is a key part of assessing the successes and opportunities for improvement for Swipesense’s suite of products.
Note: These values are aspirational and may not be (yet) borne out in how data engineering operates at Swipesense.
Common Terms
ETL (Extract, Transform, and Load) - The process of taking data from production servers (Extract), applying any necessary changes (Transform), and uploading it onto non-production servers (Load). Depending on the steps required, there may be a lot of transformation or very little.
Business Intelligence (BI) - Software that allows for further editing and transformation of data plus visualization and presentation. At Swipesense, this typically refers to Tableau. If it has dashboards, chances are good that it is a BI tool.
How to Include Data Engineering in New Projects
Note: this section may be subject to change as we mature our data engineering stack.
When starting a new software project at Swipesense, it can be helpful to keep the following in mind from a data engineering perspective:
Definitions
Keeping a strong sense of what data the project will generate can help in making sure data is readily available down the line. It can be useful to have the following recorded somewhere:
What data will the new project generate that might be protected or personally identifiable? What steps can be taken to make sure that data is not sent to ETL pipelines or, if it is, that it is adequately protected?
What are the key concepts and derived statistics? How do you define or generate them?
In other words, if (for example) a hospital room is considered to be occupied after 8 hours within your application but after 12 hours on BI dashboards, there will be some clear discrepancies.
ETL
How can the data your project generates be collected for use in other cases?
Spend some time familiarizing yourself with the data warehouse and the data github page (yet to be created). Are there similarities with preexisting work?
Are there data dependencies to other applications? Ensure that crosswalks, primary keys, and variable names are consistent across applications when there are external data dependencies.