Like everyone, I’ve spent the past few weeks in despair and disbelief over world events. While newscasters continue to question the failure of intelligence technology in the Middle East, it’s my job to question technology every day.
While previously unimaginable advances in AI have led to a new technology world order thanks to ChatGPT and its counterparts, the technology industry is growing more focused on the future of technology and the new innovations it will bring, as it should be. However, I would argue that they should be equally focused on something that has the potential to upend their systems entirely making the use of AI, or any other technology, moot.
As technology teams churn out more code, processes, and platforms at alarmingly fast rates to keep up with the pace of technology, software gets placed into production without the proper safeguards, protection, and Q&A in the name of more, faster, better.
Production teams are tasked with and have a responsibility to keep their systems up and running amid an environment where data is growing at an alarming 23% compounded annual growth rate (CAGR). The challenge becomes, as data grows and systems get bigger and more complex with more features added every year, rapid growth creates more issues – not the least of which is how to store and analyze all this data.
When those issues happen, support people have more to parse through to identify the root causes. But more concerning than this is that they simply are not prepared or trained for disasters. Why? Because they’re not trained as troubleshooters. The real way to control disasters is to prevent them from occurring in the first place.
A lesson from firefighting
Before my career in technology, I was a volunteer firefighter for my local community and it was both fun and rewarding. Each Sunday we met as a team to test all the equipment to make sure it worked. Every Tuesday we did training on a different aspect of firefighting. We picked one thing to train on so that when there was a fire, we were comfortable with each tool and how to use it. In enterprises, the people who end up “putting out the fires” are mostly focused on managing processes, releasing code, patching, and maintaining. They’re not trained to diagnose complex issues, i.e., fight the technology fires.
We need to set our technology teams up better to deal with the potential pitfalls of all these advances in technology, some of which we probably can’t even fathom at this relatively early juncture in generative AI’s development. Can you honestly look at your organizational structure and say it’s equipped to support the future as you continue to maintain your legacy data? Are your teams trained to be able to support your future needs, much less properly maintain your existing system?
Owning up to mistakes
Suppose we did establish adequate training measures for future catastrophes. Mistakes happen. We have to be able to own up to our mistakes and eliminate blame. We must learn from each event, train for issues and disasters, and prioritize the elimination of tech debt over new features. Because if we’re always adding on to the house, but we never fixed the foundation, well…you can see where I’m going with this.
We need to focus on maintaining the big complex systems as they grow larger. But first, now more than ever, we need to ensure the foundation is intact to withstand all the growth in the best case and a catastrophic event in the worst case. Even the best systems in the world that have “five nines” (meaning 99.999% uptime or 5.42 minutes or so downtime in a year) identify issues in non-production. Even with the best-trained team in the world supporting production, there are still issues that need to be solved. It can’t be stressed enough that with the pace of innovation only increasing if we do not become better at solving issues, we will only create more issues.
Using AI and ML to solve and support complex system problems
AI and ML can play a significant role in identifying and preventing future system issues. Incorporating AI and ML into the training regime can help teams understand the technology’s potential and develop the necessary skills to harness it effectively. Regular workshops and hands-on sessions can offer practical insights into dealing with real-world AI and ML scenarios in these three main areas:
- Predictive maintenance analyzes historical data and identifies patterns that signify a potential problem. By spotting these issues early, teams can address them proactively, preventing system damage and downtime.
- Anomaly detection involves ML algorithms combing through data to spot irregularities that deviate from normal system behavior. These could be signs of a cyberattack, a software bug, or a hardware failure. Once these anomalies are detected, alerts can be generated to allow for immediate action.
- Automated remediation uses ML to identify problems but then leverages AI to identify actions to fix the issues. This could include rerouting network traffic in response to a detected intrusion or restarting a failed server automatically. By leveraging AI in this way, companies can minimize system damage and maintain operational efficiency.
Preparing teams for troubleshooting tomorrow’s technology environments
Leaders have a responsibility to educate and train their teams before technology innovation gets the best of us. The strategies below should be status quo across all companies because all companies are technology companies. They should not be considered “best-in-class” practices, we need to shift our priorities to make some or all of them the norm for all.
- Introduce regular knowledge-sharing sessions and workshops focused on problem-solving techniques and strategies. These sessions can facilitate open communication and foster a collaborative environment, encouraging team members to share their experiences and learn from each other.
- Use simulated training environments to give teams a practical, hands-on experience of troubleshooting real-world scenarios. This will help them gain confidence, improve their speed at problem-solving, and ensure they are well-prepared for actual situations.
- Recognize and reward proactive behavior to motivate team members. When employees see that their efforts to prevent problems are valued, they are more likely to take the initiative in the future.
- Encourage continuous learning and development through online courses and certifications. This will not only keep the team members updated with the latest technologies and troubleshooting methods but also enhance their skills and competency.
- Consider the use of Agile methodology to help quickly identify and address issues and promote frequent testing and continuous improvement. Agile emphasizes regular feedback and iterative development, which can significantly aid in proactive problem-solving and efficient troubleshooting.
- Cultivate a mindset of swift action and resolution among teams, one that emphasizes the importance of rapidly responding to system alerts and anomalies, thus minimizing the potential for system damage and downtime.
AI is here to stay. To paraphrase a quote from Tristan Harris, executive director and co-founder of the Center for Humane Technology, “The cat (AI) may be out of the bag but the lion, tiger, etc. is still ahead of us.” The velocity of technology is increasing at a rate never seen before. With it, the need for troubleshooting and training cannot remain stagnant.