Maintaining Critical IT NMS Infrastructure Tools 

by | Jul 28, 2020 | ZAG Standards

The importance of maintaining critical IT infrastructure tools like Network Management Systems (NMS) cannot be understated. These tools allow you to monitor equipment, applications, service utilization, and observe behavioral peak trends. This collected data is used for budgetary justifications to acquire new equipment, upgrade, or add additional services to help keep customers productive. Simply put, maintaining NMS tools help you maintain your business IT infrastructure.

Today’s IT environments are a blending of onsite systems, cloud-based services, and applications that the customer depends upon to conduct their business.  These systems need to maintain high availability and performance to help their business remain functional.

An NMS is a combined set of monitoring and alerting applications residing on one or more servers. Typically an NMS is used to manage systems and services daily. Customers rely on tools offered by the NMS to isolate problems, generate tickets, and escalate problems within their systems to an appropriate resource for further action.

Monitoring Settings

As with any application or product, it is imperative to maintain a support contract that allows the customer to reach out to the vendor for technical assistance, bug fixes, and upgrades. This assistance allows the customer to take advantage of new features that make managing their infrastructure easier and efficient, thus making them more productive and profitable.

Out-of-the-box, most NMS applications can generate basic notifications such as environment equipment inventory based on discoveries, and alerting based on device or service classification. Custom SLA labels should be created and assigned to these monitored items, but the actual alerting actions must be defined in the alert scripts themselves and based upon the customer’s needs. Therefore, an understanding of the customer business, key stakeholders, and communications requirements is needed to make the NMS a truly useful tool.

Notification Settings

The IT environment is under constant change, and as a result NMS alerting must be configured to send notifications out to key email addresses, text messages, and pagers based on the critical nature of the outage and at configured time dependent escalation levels.

Allowed Downtime based on SLA Level (24/7 Continuous Uptime Requirement)

SLA Downtime Allowed(Three 9s) – 99.9%(Four 9s) – 99.99%(Five 9s) – 99.999%
Daily1m 26s8s0s
Weekly10m 4s1m 0s6s
Monthly43m 49s4m 22s26s
Yearly8h 45m 56s52m 35s5m 15s

In most IT organizations, the use and maintenance of an NMS application resides with the Operations group, with an Engineering group typically utilizing different tools to manage and configure the network devices.

It is critical that both groups are familiar with the purpose and usage of these tools when confronted with an outage to avoid confusion about declaring a network device or service being up or down, or a performance degradation issue. Configuring notification settings for both groups will contribute to eliminating confusion and false alerts.

Equipment Maintenance

Network equipment suppliers and NMS vendors work to stay abreast of industry changes and customer demands. Although the network equipment may be stable and operating correctly, over time it will become unsupported and unmanageable by the NMS system. Therefore, it is important to review equipment performance regularly; outdated equipment can cause network performance and support issues.

In addition to normal NMS equipment maintenance there are other important factors needed to maintain the tool; staff training, environment knowledge, monitored device inventory, site contacts, and escalation procedures are vital to assure that the customer maintains a high level of productivity. Combined with NMS equipment, these factors put together a clear picture of a customer’s infrastructure and the impact it has on their business. Putting in the work now to maintain NMS systems and equipment will help speed up disaster recovery time and keep a customer’s business alert of any major issues with their IT infrastructure, so they can stay operational.

Related Content