How to end the incident management game of telephone
The Incident Management Game of Telephone is a term used to describe the communication breakdown that can occur between different stakeholders involved in the product development & management process.
In this game, information about what’s breaking in a product is passed from one person to another, like in the children's game of Telephone. It typically starts when a customer reports an issue, and the customer support team passes it on to the product manager responsible for the affected product.
The product manager then informs the development team, who may pass it on to other teams, such as QA or operations, as needed. Each team may interpret the issue differently or pass along incomplete information, leading to misunderstandings and confusion. As a result, the original issue reported by the customer may become distorted or lost in translation as it passes through multiple channels. Each person unintentionally adds their own interpretation or bias to the information, leading to a distortion of the original message by the time it reaches its final destination. This ultimately leads to an incomplete interpretation of the severity or breadth of the underlying issue(s) leading to the incident, and therefore, a breakdown in the usability of the relevant product feature for end users.
This breakdown in communication can lead to misunderstandings, misaligned expectations, and significant delays in delivering on customer specifications. It almost always results in a product that doesn’t meet the needs or expectations of its intended users, leading to customer dissatisfaction and decreased revenue.
While communication is one of the key tools in the product management skill set, it can be incredibly difficult to effectively be the glue holding multiple departments together, especially when things are breaking in your product and hurting users. In order to solve this problem, we’ve devised a playbook that’ll cover everything you need to tame the chaos and take charge of every incident without hesitation.
But first, let’s talk about how incident response usually works at companies today:
You’ve done your best to avoid it, but the inevitable has happened – a user has run into a serious bug, they’re unhappy, and now you have an incident on your hands. Now whatever purpose brought your user to you has been suddenly blocked, and they’re at their highest statistical risk for churning and never looking back. While eliminating bugs may be low on your priority list (“We eliminated X bugs in the last quarter” isn’t the most compelling sales pitch), for your users, bugs can render your product useless - nullifying the value of that special new product feature you just put out and making it worthless. Whatever feature you’ve built might as well not exist.
Of users that encounter a bug, only 1 out of 26 will complain to you about the bug. And 88% of users will simply leave. The vast majority of the time, you lose a user and never even learn why. You’re already losing customers and are going to keep losing customers until you can eliminate the bug. In a world of continuously increasing user expectations, you have to accelerate how quickly you go from a bug arising to a bug being fixed to minimize losses.
Let’s be optimistic and assume that a user reports the bug through whatever bug reporting system you have. Even now, you may be a long way from solving the problem. You still need to find out what's causing the issue and figure out how to fix it. But going from bug report to bug fix requires communication among people with different knowledge bases and incentives, and, as anyone in engineering will tell you, the communication workflow for giving accurate and valuable accounts of bugs is far from excellent. Diagnosing and fixing reported bugs is an inherently slow and noisy process, easily lasting weeks or months.
The slow process of reporting bugs
Let’s quickly walk through what a normal debugging process looks like for software companies.
A user runs into a bug and decides to report it.
The game of telephone begins when a user reports a bug. Unfortunately, users often don’t have a background that prepares them to write good bug reports and can easily give an account of what happened that misses necessary information or is inaccurate about what led to the issue. The user who runs into a bug likely doesn't know what seemingly irrelevant details might actually be crucial for causing the bug or the right terms to describe what they did.
Support cares, but can’t solve anything.
The first point of contact is unlikely to be a developer who would best know what questions to ask to identify what might have caused the bug. The customer-facing support folks who are the user’s first point of contact are heavily incentivized to care but also aren’t capable of resolving the underlying issue, and likely don’t know what an engineer needs to know in order to diagnose and fix the issue.
The engineering team tries to interpret.
The support team sends the issue ticket to a product leader who interprets support’s description of the bug and decides how highly fixing the bug should be prioritized. But the Engineering team likely hasn’t gotten all the information they need, so they have to either fill in the gaps themselves – and perhaps miss important details – or guess that some recent change they made must be responsible for the bug they’re just now hearing about.
Engineers (try to) tackle the right problem.
Armed with incomplete and potentially misleading information about the bug, devs are assigned the task of fixing the user’s issue. They likely won’t be able to replicate the user’s problem, or they may be able to fix something but not the thing that was giving the user problems. Maybe they test if a recent change to a checkout button is causing problems for a user who can’t check out, only to find everything working as intended.
The developer reports to the Product Lead that all is fine, but all isn’t fine. Perhaps integration with PayPal has broken, but the user didn’t mention that they were trying to use PayPal. While figuring out what to do about the bug, Product Leads may need to have several back and forth discussions to help the developer team understand what has gone wrong and how significant the problem is.
Now everyone is annoyed.
At each step of the communication process, we’ve gone further from people who are most directly emotionally affected but can do the least (users and the support members they communicate with) to people who can do more but are less closely connected to the end user’s experience. Support wants to fix users' issues as soon as possible, but the engineering team who can fix bugs has its own goals. Fixing things that are supposed to work feels like a step backward for developers, so receiving a message saying “Hey, this is broken” can feel like a nuisance, especially when you know you might be in for a wild bug-diagnosing adventure.
This process is doomed to waste time – even when everyone is prioritizing fixing the bug, there is still a game of telephone communicating the issue from user to dev that makes the dev’s job harder, because the dev gets an incomplete or inaccurate description of the bug. We’re relying on a slow process that also makes developers’ work harder because it relies on their ability to understand how users are seeing an application when they have the least customer-facing experience. Users get impatient, support feels helpless, and developers feel like they’re being sent distracting and impossible tasks.
So... how can you escape the trap of the incident management game of telephone?
Steps to escape the incident management game of telephone:
- Establish clear communication channels
- Document incident details
- Standardize incident response processes
- Conduct regular training
- Conduct regular reviews
Establish clear communication channels
Establishing clear communication channels for incident management involves setting up a clear and structured process for reporting and escalating incidents. The following steps can help in establishing clear communication channels:
- Designate a single point of contact - designate a single point of contact, such as a product manager or incident manager, who is responsible for receiving and triaging incident reports.
- Set up clear incident reporting protocols - define clear incident reporting protocols, such as the information required in an incident report, who to notify, and how to report the incident.
- Establish escalation paths - define escalation paths for incidents based on severity levels, and ensure that all stakeholders are aware of the process.
- Use a common incident tracking tool - use a common incident tracking tool, such as Jira or Trello, to ensure that all teams have access to the same incident information.
- Establish communication guidelines - establish communication guidelines for all stakeholders, including how and when to communicate, who to contact, and what information to provide.
By establishing clear communication channels for incident management, all stakeholders can work together more efficiently and effectively to resolve incidents. This helps to minimize the impact of incidents on customers and ensures that software products are of the highest quality.
Document incident details
Accurately documenting incident details is crucial for effective incident management for several reasons:
- Provides a clear understanding of the incident - accurate documentation provides a clear understanding of the incident, including the symptoms, root cause, and resolution steps. This information can help teams to identify the underlying issues and prevent similar incidents from happening in the future.
- Facilitates communication and collaboration - accurate documentation ensures that all stakeholders have access to the same information, which helps to facilitate communication and collaboration between teams. This, in turn, helps to resolve the incident more quickly and effectively.
- Ensures accountability - accurate documentation of incident details assigns ownership of the incident to a specific team member or group. This ensures accountability and clear communication, enabling teams to take appropriate action to resolve the incident.
- Supports continuous improvement - regular reviews of incident details help to identify trends and recurring issues, enabling teams to make improvements to incident management processes and prevent future incidents.
So... what’s the best way to accurately document incident details every time? PlayerZero. For every incident encountered by your users, we gather a comprehensive report so that your team doesn't have to speculate about reproducing it. This report includes steps to reproduce, metadata reports, network logs, console logs, and storage states all compiled into a single, shareable package. Our devtools automatically capture the incident, allowing for a speedy reproduction for your engineering team or if desired, you can effortlessly connect to your preferred engineering monitoring provider for even more in-depth insights.
Plus, with our Slack and Jira integrations, you’re just one click away from aligning the context with the your team’s tracking system of choice. Click here to learn more about how we can automate your team’s incident documentation process.
Standardize incident response processes
Standardizing your incident response process involves creating a structured and consistent approach to incident management. Here are some steps to standardize your incident response process:
- Define incident severity levels - define severity levels based on the impact the incident has on customers, the business, and operations. This helps teams to prioritize and triage incidents effectively.
- Develop incident response procedures - develop procedures for each severity level, outlining steps to follow when an incident is reported, who to notify, and what actions to take.
- Assign roles and responsibilities - assign specific roles and responsibilities to each team member, such as incident manager, technical lead, and communications lead, to ensure clear communication and accountability.
- Establish communication channels - establish clear communication channels, such as incident management tools and communication platforms, to ensure that all stakeholders can stay informed about the incident.
- Review and update - regularly review and update the incident response process based on feedback and lessons learned from previous incidents.
By standardizing your incident response process, you can improve incident management, reduce incident resolution time, and minimize the impact on customers.
Conduct regular training
Conducting regular training on incident response is important to ensure that all stakeholders are familiar with your incident response protocals and can effectively respond to incidents. Here are some steps to conduct regular training on incident response:
- Develop training materials - develop training materials, such as incident response procedures, best practices, and case studies, to educate stakeholders on incident response.
- Schedule regular training sessions - schedule regular training sessions, such as workshops, webinars, and tabletop exercises, to provide hands-on training and simulate incident scenarios.
- Assign roles and responsibilities - assign roles and responsibilities to each team member, such as incident manager, technical lead, and communications lead, during the training sessions to simulate real incident scenarios.
- Provide feedback and guidance - provide feedback and guidance to stakeholders on how to improve their incident response skills, such as communication, collaboration, and problem-solving.
- Review and update training materials - regularly review and update the training materials based on feedback and lessons learned from previous training sessions and incidents.
- Incorporate incident response training into onboarding - incorporate incident response training into the onboarding process for new employees to ensure that everyone is familiar with the incident response process.
By conducting regular training on incident response, you can ensure that all stakeholders are prepared to respond effectively to incidents, minimize the impact on customers, and continuously improve incident management processes.
Conduct regular reviews
Conducting regular reviews of incident response processes is crucial to continuously improving incident management processes and preventing future incidents. Here are some steps to conduct regular reviews for incident response:
- Define review criteria - define review criteria, such as incident severity, impact on customers, and resolution time, to evaluate incident response processes effectively.
- Schedule regular review meetings - schedule regular review meetings, such as weekly, monthly, or quarterly, to review incident response processes and discuss areas for improvement.
- Analyze incident data - analyze incident data, such as incident reports, root cause analysis, and post-incident reviews, to identify trends, recurring issues, and areas for improvement.
- Gather feedback from stakeholders - gather feedback from stakeholders, such as customers, employees, and management, to understand their experiences and identify areas for improvement.
- Identify areas for improvement - based on the analysis of incident data and feedback from stakeholders, identify areas for improvement in incident response processes, such as communication, collaboration, and incident resolution.
- Implement improvements - implement improvements to incident response processes based on the identified areas for improvement, such as updating incident response procedures, providing additional training, and improving communication channels.
PlayerZero eliminates the game of telephone.
By eliminating the need to write incident reports, PlayerZero cuts the time lost and communication problems that come from the game of telephone between the users who experience errors, the PMs & CS teams who manage them and the developers who debug them. No more translating what Support says happened; PlayerZero facilitates immediate communication of exactly what happened leading up to a user experiencing an issue. It gives product managers & developers better information about particular issues faster and allows them to more accurately understand where their priorities should be.
Everyone wants to do bigger and more exciting things than spend their time patching issues, so fixing issues should be as fast and easy as possible. PlayerZero allows you to automate the part of fixing bugs where the most time is lost – getting the bug a user experienced in front of a developer who can fix it – all in time for you to save the relationship and prevent the issue from infecting your other customers. PlayerZero allows you to spend less time fixing each bug and focus only on the incidents that actually affect your users. This means less time spent losing users and more time building the things that really move the needle. See for yourself how PlayerZero can save you time and improve your workflow.