Requirements for a Monitoring System

Alerts

Send (Email/SMS/etc)
Acknowledge (display who is working on the issue)
Delay
Send to certain groups/individuals
Escalation Path

Ability to set severity of levels for each service test (eg, disk on a production server vs disk on a development server)
- Different actions for different levels, i.e.
  - Level 1 (disk 95% full) alert Help Desk
  - Level 2 (disk 98% full) alert IT team

Display

Include or integrate with a real-time display system (with colours: Red, Yellow, Green, Purple,White and Blue)
- Red:
- Yellow:
- Green:
- White:
- Purple:

Display a time of last check

Show high level "summary" of status. eg. group Unix boxes together and show if any have issues

Ability to customise the display. e.g. summary page for IT helpdesk, Unix page for Unix admins, Network page for Networking Team.

Ability to restrict access to the monitoring system (we do not want the general community to see everything monitored)

Ability to search for a host

Monitor

Microsoft Windows: Windows NT, Windows XP, Windows Vista.
- Be able to process windows event logs and performance monitoring
UNIX: Solaris, AIX, HP-UX, IRIX, Linux, MacOS X, Tru64.
Services (DNS/FTP/SMTP/LDAP/etc)
Applications (Outlook, Calendar, Exchange, Certificate Services, Apache, Tomcat, etc)
- HTTP Application Monitoring
  - Expected Content returned
  - Acceptable response time (10 seconds to load a web page is not okay)
- Simulate a windows client application. e.g. click on an icon to launch Word. Enter some text. Save the document to a drive. Close word. Ensure the whole process worked.
Service level testing
- e.g. a web application requires a web server, DNS, LDAP, etc. If the DNS server fails, then so will the web application.
Allow for cluster testing (e.g. 1 web server out of a cluster of 5 fails, notify about the web server outage, but not the web service outage)
Network File shares
SAN Monitoring
Citrix Servers and Services
Printers
- Printer errors e.g. low toner
- Print Queues
SNMP Devices
Hardware (i.e. Dell DRAC, Sun Solaris), both via hardware card and OS software.
UPS
Other environmental inputs (temperature, humidity, etc)
Nightly backup
- Warn if backups take longer than expected
- Alarm if some backups fail

Networking

Provide integration with Cisco Works, or have similar functionality
WAN links, LAN links, VLANs, etc
- Verify link is up
- Verify Bandwidth is not saturated
Cisco/Networking hardware
- CPU load
- Environmental e.g. Power supplies, temperature alarms, etc
Ability to interact with probes (break down traffic to type and size)
Capture and track changes to hardware configurations

OS Monitoring

Disk
Memory
Processes
Response time
CPU Load
Hardware failures
OS Alerts ( systems event logs and syslog )

Database monitoring

Oracle
MySQL
MSSQL
Ingres

File Monitoring

file growths, if exist etc

Customise

Easy to extend/Customise your own tests (API to integrate with)

Alert on trends, ie 10% growth over 1 month might be ok but over 2 hours isn't.
Provide trending for network bandwidth usage or any data collected

Integration

Integrate with a helpdesk/Trouble Ticket system
- Automatically Submit Tickets
- Automatically Update existing Tickets

Integrate with (or include) an Asset management system
- Display serial number, manufacturer, warranty periods, history of repairs/replacement, etc

Integrate with other monitoring systems e.g. Ciscoworks, Oracle Enterprise Manager, HP, Compaq Insight Manager, etc

Integrate with with Microsoft Operations Manager (MOM) or offer the similar functionality as available in MOM

Agents

Locally installed agent to collect data (and temporarily store data locally)
Ability of central polling server to contact agent to get gathered data
Local agent has ability to send data to polling server
Ability to remotely update agents

Misc

History retention

Provide reports

Must be able to assign multiple IP addresses to each device and test each IP address individually if needed.

Minimal impact on service being monitored
Minimal effort to monitor (and manage) clients (remote devices)
- Do not require upgrades to existing infrastructure (e.g. must run latest version of software before it can be monitored)

Ability for remote monitoring servers to report to a cental server

Dependency aware (if a core router fails, do not send 100 alarms for devices behind it)

Allow for scheduled downtime (disable a test in the future)
- Require authorisation
- Require a reason to be displayed

Allow for regular maintenance windows (application is restarted every sun night - do not send out alarms)

Ability to delegate testing to other devices (eg. tier management structure)

Audit history in monitoring system ( server added date, when was monitoring disabled and why etc )

The system must be able to self-monitor

Be able to monitor 1000+ devices

Allow variable polling (some tests every 5 mins, some tests every 1 min)

Highly Reliable

Redundancy (if your main monitoring server fails, have a second server on standby)

Apply default thresholds to groups of devices. Allow "one off" exceptions to these thresholds. e.g. all file systems must be less than 90% full. For serverX /opt must be less than 94% full since it currently is at 93% and should not change.

Requirements for a Monitoring System

Alerts

Display

Monitor

Networking

OS Monitoring

Database monitoring

File Monitoring

Customise

Trending

Integration

Agents

Misc