System Monitoring with Xymon/Other Docs/FAQ/Generic Monitoring System Features
Appearance
Requirements for a Monitoring System
[edit | edit source]Alerts
[edit | edit source]- Send (Email/SMS/etc)
- Acknowledge (display who is working on the issue)
- Delay
- Send to certain groups/individuals
- Escalation Path
- Ability to set severity of levels for each service test (eg, disk on a production server vs disk on a development server)
- Different actions for different levels, i.e.
- Level 1 (disk 95% full) alert Help Desk
- Level 2 (disk 98% full) alert IT team
- Different actions for different levels, i.e.
Display
[edit | edit source]- Include or integrate with a real-time display system (with colours: Red, Yellow, Green, Purple,White and Blue)
- Red:
- Yellow:
- Green:
- White:
- Purple:
- Display a time of last check
- Show high level "summary" of status. eg. group Unix boxes together and show if any have issues
- Ability to customise the display. e.g. summary page for IT helpdesk, Unix page for Unix admins, Network page for Networking Team.
- Ability to restrict access to the monitoring system (we do not want the general community to see everything monitored)
- Ability to search for a host
Monitor
[edit | edit source]- Microsoft Windows: Windows NT, Windows XP, Windows Vista.
- Be able to process windows event logs and performance monitoring
- UNIX: Solaris, AIX, HP-UX, IRIX, Linux, MacOS X, Tru64.
- Services (DNS/FTP/SMTP/LDAP/etc)
- Applications (Outlook, Calendar, Exchange, Certificate Services, Apache, Tomcat, etc)
- HTTP Application Monitoring
- Expected Content returned
- Acceptable response time (10 seconds to load a web page is not okay)
- Simulate a windows client application. e.g. click on an icon to launch Word. Enter some text. Save the document to a drive. Close word. Ensure the whole process worked.
- HTTP Application Monitoring
- Service level testing
- e.g. a web application requires a web server, DNS, LDAP, etc. If the DNS server fails, then so will the web application.
- Allow for cluster testing (e.g. 1 web server out of a cluster of 5 fails, notify about the web server outage, but not the web service outage)
- Network File shares
- SAN Monitoring
- Citrix Servers and Services
- Printers
- Printer errors e.g. low toner
- Print Queues
- SNMP Devices
- Hardware (i.e. Dell DRAC, Sun Solaris), both via hardware card and OS software.
- UPS
- Other environmental inputs (temperature, humidity, etc)
- Nightly backup
- Warn if backups take longer than expected
- Alarm if some backups fail
Networking
[edit | edit source]- Provide integration with Cisco Works, or have similar functionality
- WAN links, LAN links, VLANs, etc
- Verify link is up
- Verify Bandwidth is not saturated
- Cisco/Networking hardware
- CPU load
- Environmental e.g. Power supplies, temperature alarms, etc
- Ability to interact with probes (break down traffic to type and size)
- Capture and track changes to hardware configurations
OS Monitoring
[edit | edit source]- Disk
- Memory
- Processes
- Response time
- CPU Load
- Hardware failures
- OS Alerts ( systems event logs and syslog )
Database monitoring
[edit | edit source]- Oracle
- MySQL
- MSSQL
- Ingres
File Monitoring
[edit | edit source]- file growths, if exist etc
Customise
[edit | edit source]- Easy to extend/Customise your own tests (API to integrate with)
Trending
[edit | edit source]- Alert on trends, ie 10% growth over 1 month might be ok but over 2 hours isn't.
- Provide trending for network bandwidth usage or any data collected
Integration
[edit | edit source]- Integrate with a helpdesk/Trouble Ticket system
- Automatically Submit Tickets
- Automatically Update existing Tickets
- Integrate with (or include) an Asset management system
- Display serial number, manufacturer, warranty periods, history of repairs/replacement, etc
- Integrate with other monitoring systems e.g. Ciscoworks, Oracle Enterprise Manager, HP, Compaq Insight Manager, etc
- Integrate with with Microsoft Operations Manager (MOM) or offer the similar functionality as available in MOM
Agents
[edit | edit source]- Locally installed agent to collect data (and temporarily store data locally)
- Ability of central polling server to contact agent to get gathered data
- Local agent has ability to send data to polling server
- Ability to remotely update agents
Misc
[edit | edit source]- History retention
- Provide reports
- Must be able to assign multiple IP addresses to each device and test each IP address individually if needed.
- Minimal impact on service being monitored
- Minimal effort to monitor (and manage) clients (remote devices)
- Do not require upgrades to existing infrastructure (e.g. must run latest version of software before it can be monitored)
- Ability for remote monitoring servers to report to a cental server
- Dependency aware (if a core router fails, do not send 100 alarms for devices behind it)
- Allow for scheduled downtime (disable a test in the future)
- Require authorisation
- Require a reason to be displayed
- Allow for regular maintenance windows (application is restarted every sun night - do not send out alarms)
- Ability to delegate testing to other devices (eg. tier management structure)
- Audit history in monitoring system ( server added date, when was monitoring disabled and why etc )
- The system must be able to self-monitor
- Be able to monitor 1000+ devices
- Allow variable polling (some tests every 5 mins, some tests every 1 min)
- Highly Reliable
- Redundancy (if your main monitoring server fails, have a second server on standby)
- Apply default thresholds to groups of devices. Allow "one off" exceptions to these thresholds. e.g. all file systems must be less than 90% full. For serverX /opt must be less than 94% full since it currently is at 93% and should not change.