The goal in defining and assigning a DCS is to have a single point of initial contact for issues that come up during a sprint that were not part of the pre-planned work. The DCS is responsible for handling these issues, but this does not mean that they must to be the person to actually fix/address the issue. It may mean grabbing someone to pair with to fix it or identifying another engineer to hand the work off to if they are unable to handle it at that time. Potential unplanned issues include:
Slack Alerts:
#alerts_app notifications
These do not need to be checked immediately, but should be routinely checked 2-3 times/day. Depending on the alert, a ticket may need to be created to address the alert.
#alerts_infrastructure notifications
These should be checked immediately when an alert comes in during normal business hours (9am - 5pm Chicago) or right away when you get online in the morning if they come in overnight. Often these alerts are caused by intermittent AWS issues that resolve themselves. However, these can also be caused by deploying buggy code. Therefore, the infra alerts should be especially watched after deploys. The DCS should be the first set of eyes on these, but they should pull in help from other engineers if they are unable to address it or do not have the requisite knowledge to know if it needs addressing. After the system has returned to normal the DCS should verify, if needed, that no data has been lost or corrupted.
#alerts_security notifications
These should be checked immediately when an alert comes in. These can definitely be serious and the DCS should notify the VP of Engineering immediately. These absolutely require resolution as quickly as possible.
#alerts_anomaly
We use a third-party called Lacework which monitors all of our AWS CloudTrail logs. CloudTrail logs are the kind of logs triggered by access to services via permissions. So anytime someone, or a service, access some other service by virtue of IAM policies/permissions, that is logged. Lacework notifies of events that are unusual from the standpoint of our historical pattern of behavior. As of this writing, it is unclear exactly how high a priority these alerts should be treated. This document should be updated as we know more about our standard operating procedure with respect to these alerts.
See this document for further information on how we handle alerts as a team. To help identify priority/severity of alerts, see Appendix B and Appendix C on the miro board.
Other Alerts:
Sentry alerts
Check at the beginning of the week, looking for any anomalies in errors over the past week that may require further investigation.
AWS notifications
Check at the beginning of the week: https://phd.aws.amazon.com/phd/home#/account/dashboard/open-issues
Billing alerts
Check the metrics and see if it is a one off (just one data point above the threshold) or if the cost continues to stay high each subsequent day. If it is a one off, it is probably nothing to worry about. If it is not a one off, you can dig in deeper by going to billing dashboard and looking at the current month’s cost compared to previous for the various services
Configuring your phone for notifications:
When you are the DCS, it’s important that you’re able to receive notifications from Slack while you’re sleeping for #alerts_infrastructure and #alerts_security. All the time, even when you are not currently the DCS, it’s important that other team members can reach you while you’re sleeping via a phone call in the event they are DCS and need help. If you have an iPhone you can set this up:
Paul Lorsbach: 708.466.6741
Jeff Wagner: 630.392.0611
Ryan Schaul: 312.259.0121
Emily Spieler: 636.675.0501
Brian Jones: 803.917.9481
Aaron Kennedy:
Add all your engineering team members as contacts in your phone. Everyone’s phone number should be in their Slack profile so you can grab them from there.
Go to Settings -> Focus -> Do Not Disturb -> People, and then add the relevant contacts. You should also do this for Sleep mode, which is different than Do Not Disturb mode. This will allow those contacts to break through even when you’re sleeping with Sleep mode turned on or Do Not Disturb turned on. This also implies that when you are sleeping, you should be using Sleep or Do Not Disturb mode, rather than Airplane mode, so that you can make these customizations. Airplane mode blocks everything and there is nothing to customize.
When it is your turn to be the DCS, and only during that time, in the same place on your iPhone where you added the contacts, you should also allow Slack as an app to break through to you. This will allow you to wake up for a serious alert while you’re sleeping. However, you only want to be bothered in the rare event something happens in one of those two channels, not any other channel on Slack. So, if you currently have other general channels on Slack where the notification setting is such that you get notified for “all messages,” you definitely want to change that while you’re DCS because you are allowing Slack to break through during Do Not Disturb and that would be incredibly annoying to be bothered for those while you’re sleeping: https://www.loom.com/share/b0e4525be1c6404d97b8dda4950449be.
Be sure to place a checkmark on any alerts that you have looked at. The purpose of the checkmark is to assure your other team members that someone has looked at this alert, and they therefore can rest easy because an issue has not been escalated; they don’t have to look at it themselves. This reduces cognitive load on the rest of the team and is the purpose of the DCS. The checkmark simply means that you have looked at the alert and have determined that there is no need to get other team members involved. You may not know exactly what the issue is, but you know enough to know it’s not a serious threat and the issue can be handled in the future.
When reviewing alerts via a search history in Slack, be sure to not checkmark a current alert simply because it has been checked in the past. If it’s been seen a bunch, then sure; many occurrences of this alert in the past are very indicative that the alert is not serious. However, be sure to not blindly rely on a small number/handful of recent past checkmarks; it could easily be that your past self, or the last DCS, mistakenly checked a new alert. Making comments in-thread for an alert can help add context for a future DCS.
One-off requests:
These can include data pulls, debugging potential production issues, etc.
Before working on any of these, be sure the VP of Engineering is aware of the request and has ok’d doing the work during the current sprint.
The DCS’s first priority should be to address the issues outlined above. The DCS will work on tickets in the normal flow of the sprint if there are no issues needing their attention; however, these should either be low-priority tickets or, if they are pairing on a ticket, they should not be the anchor on that ticket.
Code Reviews:
It is the DCS’s responsibility to review that weeks pull requests. If they do not feel comfortable providing feedback on the material (or approving), it is their responsibility to seek out someone for assistance. They can reach out to the original developer to explain the codebase / project, or another developer to delegate to (or both).
Timeframe: Our goal is to improve our code review turnaround time. When you are notified of a review, it should be reviewed/delegated immediately.
Code review process overview:
Developer creates a pull request
Developer tags full team for review (others are still encouraged to look at it for learning purposes - although not required)
Slack-bot notifies of review request
DCS emoji’s slack alert to indicate they are looking at PR
DCS approves and/or requests changes
DCS requests backup help if need be
If change requests are made, developer prioritizes requests immediately so DCS can finalize approval
Once approved: developer handles next steps (merges, deploys, etc)