Skip Navigation Links | |
Exit Print View | |
Managing Services and Faults in Oracle Solaris 11.1 Oracle Solaris 11.1 Information Library |
1. Managing Services (Overview)
The preferred method to display fault or defect information and determine the FRUs involved is the fmadm faulty command. However, the fmdump command is also supported. fmdump is often used to display a historical log of problems on the system, and fmadm faulty is used to display the active problems.
Caution - Do not base administrative action on the output of the fmdump command, but rather on the fmadm faulty output. The log files can contain error statements, which should not be considered faults or defects. |
For more information, see How to Use Your Assigned Administrative Rights in Oracle Solaris 11.1 Administration: Security Services.
# fmadm faulty
See the following examples for a description of the text generated.
Example 3-1 fmadm Output With One Faulty CPU
1 # fmadm faulty 2 --------------- ------------------------------------ -------------- --------- 3 TIME EVENT-ID MSG-ID SEVERITY 4 --------------- ------------------------------------ -------------- --------- 5 Aug 24 17:56:03 7b83c87c-78f6-6a8e-fa2b-d0cf16834049 SUN4V-8001-8H Minor 6 7 Host : bur419-61 8 Platform : SUNW,T5440 Chassis_id : BEL07524BN 9 Product_sn : BEL07524BN 10 11 Fault class : fault.cpu.ultraSPARC-T2plus.ireg 12 Affects : cpu:///cpuid=0/serial=1F95806CD1421929 13 faulted and taken out of service 14 FRU : "MB/CPU0" (hc://:product-id=SUNW,T5440:server-id=bur419-61:\ 15 serial=3529:part=541255304/motherboard=0/cpuboard=0) 16 faulty 17 Serial ID. : 3529 18 1F95806CD1421929 19 20 Description : The number of integer register errors associated with this thread 21 has exceeded acceptable levels. 22 23 Response : The fault manager will attempt to remove the affected thread from 24 service. 25 26 Impact : System performance may be affected. 27 28 Action : Use 'fmadm faulty' to provide a more detailed view of this event. 29 Please refer to the associated reference document at 30 http://support.oracle.com/msg/SUN4V-8001-8H for the latest service 31 procedures and policies regarding this diagnosis.
Of primary interest is line 14, which shows the data for the impacted FRUs. The more human-readable location string is presented in quotation marks, "MB/CPU0". The quoted value is intended to match the label on the physical hardware. The FRU is also represented in a Fault Management Resource Identifier (FMRI) format, which includes descriptive properties about the system containing the fault, such as its host name and chassis serial number. On platforms that support it, the part number and serial number of the FRU are also included in the FRU's FMRI.
The Affects lines (lines 12 and 13) indicate the components that are affected by the fault and their relative state. In this example, a single CPU strand is affected. It is faulted and taken out of service.
Following the FRU description in the fmadm faulty command output, line 16 shows the state as faulty. The Action section might also include other specific actions instead of, or in addition to, the usual reference to the fmadm command.
Example 3-2 fmadm Output With Multiple Faults
1 # fmadm faulty 2 --------------- ------------------------------------ -------------- ------- 3 TIME EVENT-ID MSG-ID SEVERITY 4 --------------- ------------------------------------ -------------- ------- 5 Sep 21 10:01:36 d482f935-5c8f-e9ab-9f25-d0aaafec1e6c PCIEX-8000-5Y Major 6 7 Fault class : fault.io.pci.device-invreq 8 Affects : dev:///pci@0,0/pci1022,7458@11/pci1000,3060@0 9 dev:///pci@0,0/pci1022,7458@11/pci1000,3060@1 10 ok and in service 11 dev:///pci@0,0/pci1022,7458@11/pci1000,3060@2 12 dev:///pci@0,0/pci1022,7458@11/pci1000,3060@3 13 faulty and taken out of service 14 FRU : "SLOT 2" (hc://.../pciexrc=3/pciexbus=4/pciexdev=0) 15 repair attempted 16 "SLOT 3" (hc://.../pciexrc=3/pciexbus=4/pciexdev=1) 17 acquitted 18 "SLOT 4" (hc://.../pciexrc=3/pciexbus=4/pciexdev=2) 19 not present 20 "SLOT 5" (hc://.../pciexrc=3/pciexbus=4/pciexdev=3) 21 faulty 22 23 Description : The transmitting device sent an invalid request. 24 25 Response : One or more device instances may be disabled 26 27 Impact : Possible loss of services provided by the device instances 28 associated with this fault 29 30 Action : Use 'fmadm faulty' to provide a more detailed view of this event. 31 Please refer to the associated reference document at 32 http://support.oracle.com/msg/PCIEX-8000-5Y for the latest service 33 procedures and policies regarding this diagnosis.
Following the FRU description in the fmadm faulty command output, line 21 shows the state as faulty. Other state values that you might see in other situations include acquitted and repair attempted, as shown for SLOT 2 and SLOT 3 in lines 15 and 17.
Example 3-3 Showing Faults with the fmdump Command
Some console messages and knowledge articles might instruct you to use the older fmdump -v -u UUID command to display fault information. Although the fmadm faulty command is preferred, the fmdump command still operates, as shown in the following example:
1 % fmdump -v -u 7b83c87c-78f6-6a8e-fa2b-d0cf16834049 2 TIME UUID SUNW-MSG-ID EVENT 3 Aug 24 17:56:03.4596 7b83c87c-78f6-6a8e-fa2b-d0cf16834049 SUN4V-8001-8H Diagnosed 4 100% fault.cpu.ultraSPARC-T2plus.ireg 5 6 Problem in: - 7 Affects: cpu:///cpuid=0/serial=1F95806CD1421929 8 FRU: hc://:product-id=SUNW,T5440:server-id=bur419-61:\ 9 serial=9999:part=541255304/motherboard=0/cpuboard=0 10 Location: MB/CPU0
The information about the affected FRUs is still present, although separated across three lines (lines 8 through 10). The Location string presents the human-readable FRU string. The FRU lines presents the formal FMRI. Note that the severity, descriptive text, and action are not shown with the fmdump command, unless you use the -m option. See the fmdump(1M) man page for more information.
% /usr/sbin/psrinfo 0 faulted since 05/13/2011 12:55:26 1 on-line since 05/12/2011 11:47:26
The faulted state indicates that the CPU has been taken offline by a Fault Management response agent.
For more information, see How to Use Your Assigned Administrative Rights in Oracle Solaris 11.1 Administration: Security Services.
# fmadm faulty --------------- ------------------------------------ -------------- --------- TIME EVENT-ID MSG-ID SEVERITY --------------- ------------------------------------ -------------- --------- May 12 22:52:47 915cb64b-e16b-4f49-efe6-de81ff96fce7 SMF-8000-YX major Host : parity Platform : Sun-Fire-V40z Chassis_id : XG051535088 Product_sn : XG051535088 Fault class : defect.sunos.smf.svc.maintenance Affects : svc:///system/intrd:default faulted and taken out of service Problem in : svc:///system/intrd:default faulted and taken out of service Description : A service failed - it is restarting too quickly. Response : The service has been placed into the maintenance state. Impact : svc:/system/intrd:default is unavailable. Action : Run 'svcs -xv svc:/system/intrd:default' to determine the generic reason why the service failed, the location of any logfiles, and a list of other services impacted. Please refer to the associated reference document at http://support.oracle.com/msg/SMF-8000-YX for the latest service procedures and policies regarding this diagnosis.
Follow the instructions given in the Action section in the fmadm output.
# svcs -xv svc:/system/intrd:default svc:/system/intrd:default (interrupt balancer) State: maintenance since Wed May 12 22:52:47 2010 Reason: Restarting too quickly. See: http://support.oracle.com/msg/SMF-8000-YX See: man -M /usr/share/man -s 1M intrd See: /var/svc/log/system-intrd:default.log Impact: This service is not running.
Refer to the knowledge article, SMF-8000-YX, for further instructions on fixing this problem.