How to prepare to Be On-Call

Best practicesHow to prepare to Be On-Call

Prepare to Be On-Call: A Comprehensive Guide

Being on-call can be a daunting task, especially if you’re new to the role. The idea of being responsible for resolving issues at any time, day or night, can be stressful. However, with the right preparation and tools like Pager Hero, you can handle on-call duties efficiently and with confidence. Here’s a comprehensive guide on how to prepare for your on-call rotation.

1. Understand Your Responsibilities

Before you start your on-call shift, it's crucial to have a clear understanding of your responsibilities. This includes knowing:

  • What systems and services you are responsible for: Familiarize yourself with the architecture, dependencies, and critical components of the systems you might need to support. Ask yourself as many questions as possible about how the system works. Think about what can go south, and why. This mindset can help you not only during on-call shifts but also on your daily tasks becoming a better engineer.

  • The severity levels and their corresponding actions: Understand the difference between a P1 (critical) and P3 (low-priority) incident and the expected response times for each. It's not the same for the system to be down or partially affected.

  • Escalation procedures: Know when and how to escalate an issue if you can’t resolve it yourself. Although escalating an incident shouldn't be the first to think of, it's always good to know how to proceed in the worst-case scenario.

2. Set Up Your Environment

Tools and Access

Ensure you have all the necessary tools installed and that you can access them from home or on the go. When an incident happens, tools should be correctly set up, so that no time is wasted on configuring them.

Common tools include:

  • Monitoring and alerting systems: Make sure you’re familiar with the dashboards and alert configurations. This includes tools such as New Relic, Grafana, and Datadog.

  • Communication tools: Ensure you have access to Slack, email, or any other communication tools your team uses. This will allow you to access Pager Hero after you receive the incident call.

  • Remote access: Test your VPN, SSH keys, and any other remote access tools to make sure you can log into the necessary systems. Be sure you can access your hosting provider (AWS, GCP, Azure, etc) and all parts of the infrastructure.

Notifications

  • Set up notifications: Make sure your phone, pager, or email notifications are set up correctly so you don’t miss any alerts. By using Pager Hero, this process is simple. Go to the Pager Hero app and navigate to the home page. There, you'll find an option to update your Personal Info. Click the edit button and enter your phone number. Pager Hero will call you immediately if an alert is triggered on any of the monitored channels.

Pager Hero home page - where you can find your personal info Enter your phone number, and Pager Hero will call you!

  • Backup alerts: Configure a secondary alert method (e.g., SMS) in case the primary method (e.g., phone call) fails. Pager Hero covers this automatically by trying both SMS notifications and phone calls. Additionally, Pager Hero will tag you on the incident, ensuring you receive a Slack notification via your phone or email.

3. Prepare Your Mindset

Rest and Recovery

  • Sleep well: Get a good night’s sleep before your shift starts. Being well-rested will help you stay alert and make better decisions. Sleeping well will help you in general, so be sure to rest well for optimal performance.

  • Plan for rest breaks: During your shift, take short breaks to rest and recharge. This is especially important if your on-call period spans several days. Choose activities that are close to your home, so you can easily return to your desktop if anything goes wrong. For instance, I often go to a pizzeria near my home during my on-call shift.

Stress Management

  • Stay calm: Remember, it’s normal for things to go wrong. Stay calm and approach each problem methodically. When an incident happens, it's important to have a fresh mind and define clearly what is the scenario you are facing. Staying calm will help you define which are the facts that you are confident are happening and what is probable but need more evidence to confirm.

  • Stay calm: Remember, it’s normal for things to go wrong. Stay calm and approach each problem methodically. When an incident occurs, it's crucial to maintain a clear mind and accurately define the scenario you are facing. Staying calm will help you distinguish between the confirmed facts and the assumptions that require further evidence.

4. Familiarize Yourself with Documentation

  • Runbooks: Review runbooks and troubleshooting guides. These documents are invaluable during an incident, providing step-by-step instructions to resolve common issues. It's important to create runbooks that can adapt to different scenarios. We will write a blog post specifically about runbooks soon.

  • Knowledge base: Familiarize yourself with the internal wiki or knowledge base where past incidents and their resolutions are documented. Maintaining a comprehensive knowledge base of previous incidents is crucial, as it allows incident responders to benefit from past experiences.

5. Communicate Clearly

Status Updates

  • Regular updates: Provide regular updates on the incident status to stakeholders. Clear communication helps manage expectations and reduces anxiety among your team. It's also important to share updates during the investigation so Pager Hero can create a summary when the incident is resolved and add it to your knowledge base.

  • Post-incident reports: After resolving an incident, document what happened, how it was resolved, and any follow-up actions required. As mentioned previously, Pager Hero can assist with this by generating a comprehensive summary of the issue using AI once the incident is resolved.

6. Practice and Continuous Improvement

Drills and Simulations

  • Incident simulations: Participate in or conduct incident response drills. These simulations help you practice your response in a controlled environment. Additionally, involve yourself in handling real incidents during working hours whenever possible. This will help you become more comfortable and confident by being around your teammates.

  • Feedback loop: After an on-call shift, review what went well and what didn’t. Use this feedback to improve your processes and skills. Seek input from your colleagues on how you can perform better next time.

To sum up

By understanding your responsibilities, setting up your environment, preparing your mindset, familiarizing yourself with documentation, communicating clearly, and continuously improving your skills, you can handle on-call duties effectively. By following these steps and leveraging Pager Hero's capabilities, you can navigate on-call duties with confidence, ensuring efficient incident management and continuous improvement in your role. Approach on-call responsibilities methodically to transform what can be a stressful task into a manageable and even rewarding experience. Remember, preparation and continuous learning are key to success in this role. Good luck!

Contact us at any time!

Or drop us an email athello@pagerhero.io