Linux Guide/Monitoring
Introduction
[edit | edit source]This page is an TODO state. anyone is free to complete/contribute to it. For now (2010-06-11) it contains random notes I've been collecting through the time.
TODO is a mark meaning "to do" ("TODO" is automatically recognized by some editing tools as a pending tasks).
HARDWARE MONITORING
[edit | edit source]Rescanning the SCSI Bus
[edit | edit source]Next link provides a quick script to rescan the SCSI bus in Linux.
There is a simpler way that most of the time will work properly:
echo "- - -" > /sys/class/scsi_host/host0/scan
An slightly more complex script example for a Qlogic card:
#!/bin/bash for HBA in `ls -A /proc/scsi/qla2xxx/` do echo "scsi-qlascan" > /proc/scsi/qla2xxx/${HBA} done
Alternatively iscsiadm can be used if available:
iscsiadm -t discovery --type sendtargets --portal <IP> iscsiadm -t node --targename <targetname>-- portal<IP> --login
Amognst other documents available on the net Red Hat Enterprise Linux 5 Online Storage Reconfiguration Guide can also be a useful help.
Dmidecode reports information about your system's hardware as described in your system BIOS according to the SMBIOS/DMI standard (see a sample output). This information typically includes system manufacturer, model name, serial number, BIOS version, asset tag as well as a lot of other details of varying level of interest and reliability depending on the manufacturer. This will often include usage status for the CPU sockets, expansion slots (e.g. AGP, PCI, ISA) and memory module slots, and the list of I/O ports (e.g. serial, parallel, USB).
TODO IPMI
[edit | edit source]What is IPMI? The Intelligent Platform Management Interface (IPMI) specification defines a set of interfaces for platform management. It is implemented by a large number of hardware manufacturers to support system management on motherboards. The features of IPMI that most users will be interested in are sensor monitoring (i.e. CPU temperatures, fan speeds), remote power control, and serial-over-LAN (SOL). What is FreeIPMI? FreeIPMI provides in-band and out-of-band IPMI software based on the IPMI v1.5/2.0 specification. FreeIPMI provides tools and libraries for users to access and read IPMI sensor readings, system event log (SEL) entries, serial-over-LAN (SOL), remote power control functions, field replaceable unit (FRU) device information, and more. More information about FreeIPMI can be found at the FreeIPMI webpage at: http://www.gnu.org/software/freeipmi/index.html
TODO smartctl:
[edit | edit source]************************************************************************ ~# smartctl -d cciss,0 -a /dev/cciss/c0d0 smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ Device: HP DH072ABAA6 Version: HPD7 Serial number: 3PD19ZMN0000983153B8 Device type: disk Transport protocol: SAS Local Time is: Sat Jul 19 20:09:09 2008 CEST Device supports SMART and is Enabled Temperature Warning Enabled SMART Health Status: OK Current Drive Temperature: 29 C Drive Trip Temperature: 68 C Elements in grown defect list: 0 Vendor (Seagate) cache information Blocks sent to initiator = 899299930 Blocks received from initiator = 14843797 Blocks read from cache and sent to initiator = 3793967485 Number of read and write commands whose size <= segment size = 48565840 Number of read and write commands whose size > segment size = 0 Vendor (Seagate/Hitachi) factory information number of hours powered up = 945.00 number of minutes until next internal SMART test = 7 Error counter log: Errors Corrected by Total Correction Gigabytes Total ECC rereads/ errors algorithm processed uncorrected fast | delayed rewrites corrected invocations [10^9 bytes] errors read: 0 0 0 0 0 0.000 0 write: 0 0 0 0 0 0.000 0 Non-medium error count: 0 No self-tests have been logged Long (extended) Self Test duration: 840 seconds [14.0 minutes] ************************************************************************ ~# smartctl -d cciss,1 -a /dev/cciss/c0d0 smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ Device: HP DH072ABAA6 Version: HPD7 Serial number: 3PD19ZPV000098315CX2 Device type: disk Transport protocol: SAS Local Time is: Sat Jul 19 20:09:12 2008 CEST Device supports SMART and is Enabled Temperature Warning Enabled SMART Health Status: OK Current Drive Temperature: 30 C Drive Trip Temperature: 68 C Elements in grown defect list: 0 Vendor (Seagate) cache information Blocks sent to initiator = 920490987 Blocks received from initiator = 14368268 Blocks read from cache and sent to initiator = 3755437180 Number of read and write commands whose size <= segment size = 48820139 Number of read and write commands whose size > segment size = 0 Vendor (Seagate/Hitachi) factory information number of hours powered up = 945.02 number of minutes until next internal SMART test = 8 Error counter log: Errors Corrected by Total Correction Gigabytes Total ECC rereads/ errors algorithm processed uncorrected fast | delayed rewrites corrected invocations [10^9 bytes] errors read: 0 0 0 0 0 0.000 0 write: 0 0 0 0 0 0.000 0 Non-medium error count: 0 No self-tests have been logged Long (extended) Self Test duration: 840 seconds [14.0 minutes] ************************************************************************ ~# smartctl -d cciss,2 -a /dev/cciss/c0d0 smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ Device: HP DH072ABAA6 Version: HPD7 Serial number: 3PD1A0SD000098300K39 Device type: disk Transport protocol: SAS Local Time is: Sat Jul 19 20:09:15 2008 CEST Device supports SMART and is Enabled Temperature Warning Enabled SMART Health Status: OK Current Drive Temperature: 31 C Drive Trip Temperature: 68 C Elements in grown defect list: 0 Vendor (Seagate) cache information Blocks sent to initiator = 913141941 Blocks received from initiator = 11455509 Blocks read from cache and sent to initiator = 3697098775 Number of read and write commands whose size <= segment size = 49159966 Number of read and write commands whose size > segment size = 0 Vendor (Seagate/Hitachi) factory information number of hours powered up = 944.93 number of minutes until next internal SMART test = 18 Error counter log: Errors Corrected by Total Correction Gigabytes Total ECC rereads/ errors algorithm processed uncorrected fast | delayed rewrites corrected invocations [10^9 bytes] errors read: 0 0 0 0 0 0.000 0 write: 0 0 0 0 0 0.000 0 Non-medium error count: 0 No self-tests have been logged Long (extended) Self Test duration: 840 seconds [14.0 minutes]
DELL OMSA monitorization
[edit | edit source]Installing OMSA fro hardware monitorization in Dell Servers:
OMSA allows to monitor the health of RAIDs, motherboard/disk/chasis temperature, alarm generation, set/modify BIOS, watch installed devices,
To install under Debian:
1.- Add to /etc/apt/sources.list the next line:
deb ftp://ftp.sara.nl/pub/sara-omsa dell sara
2.- Execute
apt-get update && apt-get install dellomsa
That install OMSA in /opt/dell.
3.- To boot the system:
~# /opt/dell/srvadmin/dataeng/bin/dsm_sa_datamgr32d -run ~# /opt/dell/srvadmin/dataeng/bin/dsm_sa_eventmgr32d -run
OMSA Usage Examples:
[edit | edit source]To check the health of the disc connected to controller 0:
~# /etc/delloma.d/oma/bin/omreport.sh storage pdisk controller=0
The output will look similar to:
List of Physical Disks on Controller PERC 4e/Di (Embedded) Controller PERC 4e/Di (Embedded) ID : 0:0 Status : Ok Name : Physical Disk 0:0 State : Online Failure Predicted : No Progress : Not Applicable Type : SCSI Capacity : 68.24 GB (73274490880 bytes) Used RAID Disk Space : 68.24 GB (73274490880 bytes) Available RAID Disk Space : 0.00 GB (0 bytes) Hot Spare : No Vendor ID : MAXTOR Product ID : ATLAS10K5_73SCA Revision : JNZY Serial No. : J20KVCTK Negotiated Speed : 320 Capable Speed : 320 Manufacture Day : Not Available Manufacture Week : Not Available Manufacture Year : Not Available SAS Address : Not Available ID : 0:1 Status : Ok Name : Physical Disk 0:1 State : Online Failure Predicted : No Progress : Not Applicable Type : SCSI Capacity : 68.24 GB (73274490880 bytes) Used RAID Disk Space : 68.24 GB (73274490880 bytes) Available RAID Disk Space : 0.00 GB (0 bytes) Hot Spare : No Vendor ID : MAXTOR Product ID : ATLAS10K5_73SCA Revision : JNZY Serial No. : J20KV5RK Negotiated Speed : 320 Capable Speed : 320 Manufacture Day : Not Available Manufacture Week : Not Available Manufacture Year : Not Available SAS Address : Not Available ID : 0:2 Status : Ok Name : Physical Disk 0:2 State : Online Failure Predicted : No Progress : Not Applicable Type : SCSI Capacity : 68.24 GB (73274490880 bytes) Used RAID Disk Space : 68.24 GB (73274490880 bytes) Available RAID Disk Space : 0.00 GB (0 bytes) Hot Spare : No Vendor ID : MAXTOR Product ID : ATLAS10K5_73SCA Revision : JNZY Serial No. : J20KTS8K Negotiated Speed : 320 Capable Speed : 320 Manufacture Day : Not Available Manufacture Week : Not Available Manufacture Year : Not Available SAS Address : Not Available
To check the state/configuration of the RAID:
~# /etc/delloma.d/oma/bin/omreport.sh storage vdisk controller=0
That will look like:
Virtual Disk 0 on Controller PERC 4e/Di (Embedded) Controller PERC 4e/Di (Embedded) ID : 0 Status : Ok Name : Virtual Disk 0 State : Ready Progress : Not Applicable Layout : RAID-5 Size : 136.48 GB (146548981760 bytes) Device Name : /dev/sda Type : SCSI Read Policy : Adaptive Read Ahead Write Policy : Write Back Cache Policy : Direct I/O Stripe Element Size : 64 KB
To get an summary of the server:
~# /etc/delloma.d/oma/bin/omreport.sh system summary System Summary ------------------ Software Profile ------------------ Systems Management Name : Information not available. Version : 3.2.0 Description : Systems Management Software Operating System Name : Linux Version : Kernel 2.6.18.2 (i686) System Time : Sun Nov 25 18:30:37 2007 System Bootup Time : Fri Oct 12 15:20:31 2007 -------- System -------- System Host Name : MySuperServidor System Location : Please set the value --------------------- Main System Chassis --------------------- Chassis Information Chassis Model : PowerEdge 2850 Chassis Service Tag : Chassis Lock : Present Chassis Asset Tag : Processor 1 Processor Manufacturer : Intel Processor Family : Xeon Processor Version : Model 4 Stepping 3 Current Speed : 3200 MHz Maximum Speed : 3600 MHz External Clock Speed : 800 MHz Voltage : 1400 mV Processor 2 Processor Manufacturer : Intel Processor Family : Xeon Processor Version : Model 4 Stepping 3 Current Speed : 3200 MHz Maximum Speed : 3600 MHz External Clock Speed : 800 MHz Voltage : 1400 mV Memory Total Installed Capacity : 2048 MB Memory Available to the OS : 2023 MB Total Maximum Capacity : 16384 MB Memory Array Count : 1 Memory Array 1 Location : System Board or Motherboard Use : System Memory Installed Capacity : 2048 MB Maximum Capacity : 16384 MB Slots Available : 6 Slots Used : 2 ECC Type : Multibit ECC Slot PCI1 Adapter : [Not Occupied] Type : PCI X Data Bus Width : 64 Bits Speed : 133 MHz Slot Length : Long Voltage Supply : 3.3 Volts Slot PCI2 Adapter : [Not Occupied] Type : PCI X Data Bus Width : 64 Bits Speed : 133 MHz Slot Length : Long Voltage Supply : 3.3 Volts Slot PCI3 Adapter : PRO/100 S Server Adapter Type : PCI X Data Bus Width : 64 Bits Speed : 133 MHz Slot Length : Short Voltage Supply : 3.3 Volts BIOS Information Manufacturer : Dell Inc. Version : A04 Release Date : 09/22/2005 -------------- Network Data -------------- IP Address Data IP Address 0 : 192.168.2.2 IP Address 1 : 192.168.0.115 -------------------- Storage Enclosures -------------------- Storage Enclosures Name : Backplane Service Tag : 62P00P8
TODO logwatch
[edit | edit source]SOFTWARE MONITORING
[edit | edit source]TODO: Monit
[edit | edit source]Table of monitoring tools
[edit | edit source]Sintaxis | Brief explanation |
---|---|
top | Allows to watch and administer running processes (useful to kill processes). press 'q' to quit, 'k' to kill a process, |
htop | Similar to top, but with a more friendly menu based user interface. |
lsof | Shows which processes are "touching" a file or directory and also the set of files being accessed by a process (that includes too any network socket, pipe or device). |
netstat | Provides stats and reports for network usage and connections (established and listenning connections) |
vmstat | Provides stats about the memory ussage |
iostat | Provides stats about reads/writes to external devices |
inotifywatch inotifywait |
Modem Linux kernels allow to notify processes (user applications) any access or change to a file instantaneously. 'inotifywatch' and 'inotifywait' commands allows to wait for new events from the kernel notifying anything related to a set of files/directories. |
strace -p <pid> |
Allows to monitor system calls (calls to the services offered by the kernel) from a user application. |
stap | Allows to monitor the kernel in real time and with high detail. A tutorial can be read here |
oprofile and perfmon2 | allow access to hardware performance counters; A tutorial can be browsed here |
AMD CodeAnaylist | front-end graphical user interface to Oprofile. An introduction/tutorial can browsed here and here |
Intel VTune | Allows for Performance tuning in Intel hardware |