The Department of Veterans Affairs is leveraging data and AIOps to better respond to and address unanticipated abnormalities in the agency’s IT network.
“The Operations Triage Group ... [was] challenged with how we could better coordinate,” Jay Paluch, the group's director, said during a Business of VA interview Friday. “We had a couple [system problems] where they were large reaching across our enterprise, and it affected many systems. ... How might we better be able to not only recognize those, but alert ourselves that there's a problem, and start putting tools in place that let us know when there was a system problem and what it might be? All of those together then allows us to be able to find problems quicker or find them before they start to impact our users.”
By implementing this model and leveraging lessons learned from previous system problems to proactively resolve future challenges, VA has been able to detect abnormalities faster.
VA’s site reliability engineers, or SREs, make up the bulk of the agency’s triage group. They help VA understand where it can improve telemetry and coordinate with VA’s application performance monitoring (APM). The group also has an analytics team that is comprised of data analysts. This team is helping VA leverage incident data to better predict IT system challenges and avoid them.
"[Our] analytics team pours through the data we collect from every single one of the incidents that we have, over the course of years, and starts to use that information to help us predict better, as well as now starting to pull in a lot of that performance data so we can better understand what normal behavior is. ... It helps us start to pinpoint when we see abnormal behavior.”
The Operations Triage Group is also using synthetic monitoring to better track system status and detect abnormalities. Paluch noted that synthetic monitoring enables VA to create synthetic transactions, act as an end user would and measure every single step along the way.
“Now, I have an opportunity to see how well that [system] performs — say I've run that transaction every 15 minutes — now start to get a baseline for what does normal look like every single day and even every single night. I don't need to have someone there to run it. It runs automatically every 15 minutes,” Paluch said.
Paluch noted that 20,000 minutes of productivity are lost for every one minute of down time on a single system. These enhanced monitoring techniques are supporting and improving many of VA’s programs and services. SREs work closely with system owners to build telemetry from performance logs to then determine where abnormalities exist.
“Our SREs have been involved in many of VA's most critical systems. That would include those supporting veteran benefits, emergency room management, user authentication, to name a few. In many cases we transcend the [Office of Information and Technology] organization to support where we can as best we can,” Paluch told GovCIO Media & Research.
One of the systems this group has supported is VA’s telehealth service. Prior to the pandemic, VA received 1,400 calls per day. Now, VA sees over 30,000 calls a day. The Operations Triage Group partnered with the system owner to improve call quality and better support the influx of calls.
“Other cases ... we came in and helped them quickly figure out as a triage process, what the problem was, what they needed to do to get their system back up and running to the performance that it needed to be. Once that was done, then it was the steps ... to improve the layers of technology that you have and modernize some of those pieces to be able to improve it. We were able to do that and help them very successfully,” Paluch said.
VA plans to leverage AIOps to improve and automate monitoring. By developing detailed triage processes, VA can explain the steps it took to resolve abnormalities, then automate it. AIOps gives VA the ability to see the “events that happen, how we know we can fix them and where we're going ahead and we're going forward and fixing them,” Paluch said.
“In the next five years, I would love to see the VA in a situation where the majority of the incidents that we have are actually resolved before they become incidents,” Paluch said. “One of the big ways that we do that is recognizing them through the data that we collect and using the systems that we have to recognize those abnormal behaviors.”