How AIOps enhances operational efficiency

Digital data is everywhere, and its sheer volume and ambiguity often make it challenging for us humans to analyze. That’s why we use a special branch of AI called artificial intelligence for IT operations (AIOps) to reveal the deeper structure of copious data.

AIOps sits at the intersection of big data and machine learning to improve the efficiency of IT operations. It enables enterprises to consolidate information — such as logs, metrics, events, and alerts — from various sources and leverage big data and machine learning techniques to unlock deeper insights.

The technological research and consulting firm Gartner that first introduced AIOps describes it as follows:

… [combining] big data and machine learning to automate IT operations processes, including event correlation, anomaly detection and causality determination.

In this article, we’ll demonstrate the capabilities of AIOps for managing IT operations by walking you through a tutorial for using AIOps to analyze logs from a demo environment. We’ll also learn how AIOps can be applied to minimize the burden of sorting logs to find important data. But first, let’s explore the ins and outs of log analysis.

Before we get started, keep in mind that this tutorial uses Moogsoft — an AIOps platform — to analyze logs. You can create an account here to follow along.

What is log analysis?

In a digital environment, various servers and applications generate massive volumes of logs. Log analysis is the task of interpreting what is happening in this digital environment. These log files contain records of different activities happening within the system. 

If we analyze the contents and structure of these logs, we’ll find very little uniformity. It’s hard to discern the structure of a log file even if it comes from a single source. Now, imagine if we try to discern critical events from logs that originate from a multitude of sources. It gets complex very quickly. That’s why we have AIOps to enable automation and enhance IT operations to produce better business outcomes.

Using AIOps for analyzing logs

Humans are exceptional at picking out information from reasonably sized sources. But with the sheer volume of data in log files, relying only on human labor can result in poor visibility and degraded system performance.

AIOps can help manage these intricacies by performing log analysis. Intelligent automation can unlock insights from log data in real-time. This way, businesses can resolve problems much faster by moving from a reactive approach to a proactive approach.

Now that we have the basics down, let’s explore some of the ways AIOps can help with log analysis.

Performing noise reduction

In a digital environment, different sources run services that generate new events. An event is a data object that describes the occurrence of something that might be of interest, e.g., “a scheduler terminated 45 seconds ago.” AI can automatically detect these events. When an event of interest is detected, it is ingested into an AIOps platform. 

A busy server can flood our system with events. DevOps engineers may find it difficult to troubleshoot errors when problems occur. One event can trigger more events and cascade the system with duplicate events. This makes it impossible to search through the logs to find important information. AIOps isolates important details and lets us focus on those.

AIOps can identify whether an event is duplicated (whether it had previously occurred on the same node during a certain interval). If there is a matching event, the new event is marked as a duplicate. This reduces the number of events that we see — i.e., it reduces noise. By automatically consolidating repeatedly occurring events, we reduce the operational noise from various servers.

Performing event correlation

Correlation is the process of grouping alerts into incidents. Alerts occur when a unique event of operational interest is detected. When different alerts are similar to each other, an incident is created. The similarity is measured based on data fields that might be of interest. 

Correlation clusters alerts (by node, service, location, and so on) that all relate to the same underlying problem. It helps us make sense of events occurring in our system and monitor the entire infrastructure. By getting a clearer picture of the whole infrastructure, we can resolve high-impact problems faster. Data across the entire organization is aggregated in a single platform. This way, we can quickly notice important issues and take action, saving time and resources.

Consider the example of an application that operates globally. If a service at some source in the United States is affected, we want alerts to be correlated by location so that teams handling the U.S.-based service come together and investigate. While this happens, operations in other regions should execute smoothly. If AIOps can do this work, DevOps won’t need to search for the location of the affected services manually.

Detecting errors and anomalies

When data from the entire organization is federated into one environment — in this case, an AIOps platform — we can detect performance anomalies on different events and metrics. AIOps filters data, performs correlation, and triggers notifications when something goes wrong. 

Anomalous events are identified based on statistical calculations. In most cases, the performance of a system does not deviate widely under normal circumstances. However, when there is a deviation from the normal flow of events, AIOps can catch those irregularities. 

For instance, a server might have sudden spikes or a drop in CPU utilization. This could indicate problems with the underlying operating system (OS). If mission-critical applications run on that server, it could have catastrophic consequences. 

Anomalous events can also be detected based on fixed thresholds. This is especially useful when we have an idea of the normal ranges when systems work optimally. We can define a fixed upper and lower bound and let AIOps monitor the behavior of the system. For instance, if we want to monitor the memory usage on a server, we can define a fixed upper bound of 90% and a lower bound of 10%. When the memory usage hits either of these thresholds, AIOps can signal a possible error — e.g., “The server might run out of memory” or “The server is not utilizing the physical memory.”

How to leverage AIOps to tame logs

Now that we know how AIOps can manage an organization’s IT operations, let’s see this in action. To get started, we’ll need to create an account on an AIOps platform. We’ll follow this tutorial using Moogsoft

We’ll also need to set up a dummy environment that continuously generates streams of logs. These logs will be ingested into Moogsoft using a collector. 

Setting up our work environment

We can generate synthetic logs using any tool. For our purposes, we will use this repository to generate a tremendous amount of fake Apache logs. It’s best to clone the repository using Docker to install and manage all the dependencies automatically. In case you’re unfamiliar, Docker is software for building, running, and managing containers on servers. If you don’t have Docker installed on your machine, follow this tutorial to set it up. 

Once you have Docker set up, clone the above repository, open your terminal, and run this command to build an image of the script. (Make sure you have Docker Desktop running.)

docker build -t apache-fake-log-gen .

Generating synthetic logs

Once the image is built, run the following command to install all dependencies and get your server running: 

docker run -d -p 8888:80 apache-fake-log-gen -n 0 -s 1

The -n 0 flag generates logs for an infinite time. This lets you ingest streams of logs continuously in Moogsoft.

Finally, the -s flag indicates that the server generates a log every second.
You can check whether your container is running by clicking the Containers tab inside Docker Desktop and looking at the status of your container.

container status check

To see the synthetic logs generated by the script, click on the name of your container.

view synthetic logs

You’ll see Apache logs being generated continuously every second.

Connecting your local server with Moogsoft

Once we have our environment set up with logs, we can bring this data inside Moogsoft. The easiest way is to install a collector that gathers metrics data. Once you’re logged in, Collectors can be installed on any platform, depending on the OS. 

I’m working on a macOS machine, so I’ve selected that option. You should proceed by selecting the OS for your device.
You’ll also need an API key to connect your server with Moogsoft. To obtain that, click on the Ingestion window and the Collectors tab.

obtain API key

Use the following command to open a terminal inside your container:

docker exec -ti <YOUR_CONTAINER_NAME> /bin/bash  

Copy and paste the command with your own API key inside the Docker container:

export API_KEY='YOUR_API_KEY';
export CONTROLLER='https://api.moogsoft.ai';
bash -c "$(curl ${CONTROLLER}/v2/collector-installer/script\?platform=LINUX -kLH apikey:${API_KEY})"

Notice here that Moogsoft automatically configures the command for a Linux OS. This is because we’re using a Docker container running a Linux OS under the hood.

docker container running Linux OS

Once the collector is installed, you’ll see a message in your terminal:

[ Running collector... ]

Go back into your Moogsoft account and click on the Monitor tab. Here, you’ll see two options: Alerts and Incidents. Click on the Alerts tab.

alerts tab

Performing noise reduction

When raw events are ingested from the server into Moogsoft, it deduplicates new events that match the previously ingested events. In this case, our events are network logs. You’ll notice inside the Alerts tab that similar events are aggregated to isolate essential details from the logs.

performing noise reduction

The Event Count column shows the number of events deduplicated into one alert. The Severity column indicates how urgently the alert requires corrective action. You can click on any alert to see which events have been deduplicated, as well as the source, time, and description of the events.

Expanded alert details

When you expand an alert, you instantly get a view of the source and severity of the logs. You can see that two events have been merged to create a single alert. Some logs indicate events that never occurred. These might be of no use to DevOps and simply add noise to the workflow. Manually searching for information and extracting important data in this complex environment would take up a lot of time. AIOps automatically cleans noise and lets useful information pass through the source as alerts.

Perform event correlation

The events have been deduplicated as alerts to remove noise from the data. This gives us a clearer picture of what’s going around in the application. But it doesn’t stop here. These alerts are further processed to let us focus on what actually matters.

The alerts are clustered into incidents. Moogsoft performs correlation based on the relatedness of each alert. This relatedness could be determined on the basis of a node, service, location, or other related fields in the data.
To see which alerts are clustered into incidents, click on the Monitor tab and then click Incidents.

monitor incidents

Here, we see that 20 alerts are correlated and clustered into a new incident. We can click on the incident to see which alerts are correlated and on what basis. Moogsoft automatically performs correlation with no configuration required. However, we can also create our own rules to define a custom correlation definition. In this case, Moogsoft performed a correlation based on a similar source.

We can see from the description that a source has affected the network. Since we’re dealing with dummy logs from a single source, it’s easy to pinpoint where the problem originated. 

However, when multiple sources generate alerts, the environment gets complex. Determining the correlation between data fields in such an environment requires much manual work. If AIOps can do this work, DevOps wouldn’t need to rule out the potential causes of the system’s failure.

Scan for errors and anomalies

Once our data is ingested into Moogsoft, it starts to detect anomalies; no additional configuration is required. 

During the initial stage of data ingestion, Moogsoft does not assign a severity score since it needs some time to learn the data patterns. Once it learns from the data, it starts to assign a severity label to classify events, alerts, and incidents as anomalous. We can also tweak the configuration and override the default detection model.

scan for errors

To see the severity status, click on the Monitor tab and go to Alerts. You’ll find a column labeled Severity where one of the following six values will be assigned to an alert: unknown, clear, warning, minor, major, and critical. 

The six severity levels are described as follows:

  1. Unknown — The alerts in the incident have an unknown severity. 
  2. Clear — An event had been reported but subsequently cleared either manually or automatically.
  3. Warning — Some events have been detected with the potential to affect the services.
  4. Minor — A fault has been detected but is not affecting the services. An action may be required to prevent the fault from becoming a serious issue.
  5. Major — A fault has been detected that is affecting services. Corrective action is required.
  6. Critical — An immediate corrective action is required on a serious fault affecting the services.

These values are assigned using a dynamic thresholding mechanism. The algorithm performs statistical analysis of the data and automatically determines high and low thresholds. It adjusts to the changing data distribution rather than following a standard pattern. We do not need to manually read the values to recognize an anomaly whenever the data goes out of bounds.

visual view of events and alerts

We can also find a visual view of the events, alerts, and incidents over time by navigating to the Metrics tab. Here, we’ll find red and orange dots that indicate critical and major incidents, respectively.

Notice that these incidents reside on the peaks of the graph. This suggests that the metrics exceed the upper and lower thresholds that an AI model had learned automatically. The data follows a certain distribution, which can change over time. The anomaly detection algorithm adapts to changing data distribution so that it does not flag normal events as anomalous. 

The future of IT operations

AIOps is the future of IT operations. It enables developers, system engineers, security analysts, and DevOps engineers to instantly see everything in the system, know what’s wrong, and take proactive measures. Organizations use AIOps for IT operations, monitoring, and automation. 

Logs are an essential part of any system. They are a source of information for monitoring events in a digital environment. But the enormous volume of logs makes it difficult for engineers to analyze and extract critical insights. 

AIOps delivers intelligent IT operations and realizes the promise of automated log analysis. It analyzes the deep structure of logs, pulls information from copious data, reduces noise, performs correlations, and detects anomalies. By leveraging AIOps to tame logs, organizations can take instant actions on critical incidents happening in the environment — ensuring a much more secure and performant network because of it.

To learn more about the technology and tools developers can use to improve efficiency and productivity, check out our free guide: Unblocking Workflows: The Guide to Developer Productivity in 2022.

This blog post was created as part of the Mattermost Community Writing Program and is published under the CC BY-NC-SA 4.0 license. To learn more about the Mattermost Community Writing Program, check this out.

Najia Gul is a writer and a software engineer. She loves telling stories that are backed by data. Over the past years, she has helped tech companies across the globe to craft and create content on AI and machine learning in their distinctive brand voice.