Skip to main content

8.2.2 Identifying a Fault


8.2.2 Identifying a Fault
This section describes how to identify a fault.
Identify the location of a fault by following the troubleshooting flow in Figure 8-1.
(1) Checking the LED indicators
Check the LEDs on the OPNL, rear panel, and individual units to identify the FRU requiring maintenance. For FRU maintenance, check the status with the LEDs before starting the maintenance.
  1. LEDs of the OPNL
    You can check the system status from the LEDs on the OPNL. For details, see "2.4.1 OPNL LEDs."
  2. Rear panel LEDs
    As with the CHECK LED on the OPNL, you can also check the system status from the CHECK LED on the rear of the SPARC M12-2/M12-2S. For details, see "2.4.2 System Locator."
  3. LEDs of individual FRUs
    From the FRU LEDs, you can locate not only any error occurrence in hardware in the SPARC M12-2/M12-2S but also the hardware that caused the error. For details, see "2.4.3 LEDs of Each Unit."
Note - Some FRUs such as memory do not have mounted LEDs. To check the status of a FRU without a mounted LED, use an XSCF command such as the showhardconf command. For details, see "(3) Checking the FRU status" below.
(2) Checking error messages
If a failure occurs in the system, analyze the failure cause from the failure occurrence time, abnormal event, and other data obtained from XSCF log information and Oracle Solaris messages.
Checking XSCF log information
The following example shows a check of system operation and failure information from XSCF log information. For details, see "12.1 Checking a Log Saved by the XSCF" in the Fujitsu SPARC M12 and Fujitsu M10/SPARC M10 System Operation and Administration Guide.
  1. Checking monitoring messages
XSCF> showlogs monitor
Jan 27 18:42:11 H4U2S115 Event: SCF:System powered on
Jan 27 18:45:41 H4U2S115 Event: SCF:PPAR-ID 0: Reset released
Jan 27 18:45:48 H4U2S115 Event: SCF:POST boot start from PPAR (PPAR ID 0)
Jan 27 18:45:49 H4U2S115 Event: SCF:Current PPARs' phase (PPARID 0 POST phase: Banner)
Jan 27 18:45:50 H4U2S115 Event: SCF:Current PPARs' phase (PPARID 0 POST phase: CPU Check)
Jan 27 18:45:51 H4U2S115 Event: SCF:Current PPARs' phase (PPARID 0 POST phase: CPU Register)
Jan 27 18:45:52 H4U2S115 Event: SCF:Current PPARs' phase (PPARID 0 POST phase: STICK Increment)
Jan 27 18:45:53 H4U2S115 Event: SCF:Current PPARs' phase (PPARID 0 POST phase: Extended Instruction)
Jan 27 18:45:54 H4U2S115 Event: SCF:Current PPARs' phase (PPARID 0 POST phase: MMU)
Jan 27 18:46:07 H4U2S115 Event: SCF:Current PPARs' phase (PPARID 0 POST phase: Memory Initialize)
Jan 27 18:46:43 H4U2S115 Event: SCF:Current PPARs' phase (PPARID 0 POST phase: MSCAN)
Jan 27 18:46:54 H4U2S115 Event: SCF:Current PPARs' phase (PPARID 0 POST phase: Cache)
Jan 27 18:46:59 H4U2S115 Event: SCF:Current PPARs' phase (PPARID 0 POST phase: Interrupt Queue)
Jan 27 18:47:00 H4U2S115 Event: SCF:Current PPARs' phase (PPARID 0 POST phase: Floating Point Unit)
Jan 27 18:47:10 H4U2S115 Event: SCF:Current PPARs' phase (PPARID 0 POST phase: Encryption)
Jan 27 18:47:12 H4U2S115 Event: SCF:Current PPARs' phase (PPARID 0 POST phase: Random number)
Jan 27 18:47:13 H4U2S115 Event: SCF:Current PPARs' phase (PPARID 0 POST phase: Cacheable Instruction)
Jan 27 18:47:22 H4U2S115 Event: SCF:Current PPARs' phase (PPARID 0 POST phase: Softint)
Jan 27 18:47:23 H4U2S115 Event: SCF:Current PPARs' phase (PPARID 0 POST phase: CPU Cross Call)
Jan 27 18:47:24 H4U2S115 Event: SCF:Current PPARs' phase (PPARID 0 POST phase: CMU-CH)
Jan 27 18:47:26 H4U2S115 Event: SCF:Current PPARs' phase (PPARID 0 POST phase: PCI-CH)
Jan 27 18:47:33 H4U2S115 Event: SCF:Current PPARs' phase (PPARID 0 POST phase: TOD)
Jan 27 18:47:34 H4U2S115 Event: SCF:Current PPARs' phase (PPARID 0 POST phase: MBC Check Before STICK Diag)
Jan 27 18:47:35 H4U2S115 Event: SCF:Current PPARs' phase (PPARID 0 POST phase: STICK Stop)
Jan 27 18:47:37 H4U2S115 Event: SCF:Current PPARs' phase (PPARID 0 POST phase: STICK Start)
Jan 27 18:47:38 H4U2S115 Event: SCF:Current PPARs' phase (PPARID 0 POST phase: CPU Speed Control)
Jan 27 18:47:39 H4U2S115 Event: SCF:Current PPARs' phase (PPARID 0 POST phase: SX)
Jan 27 18:47:40 H4U2S115 Event: SCF:Current PPARs' phase (PPARID 0 POST phase: RT)
Jan 27 18:47:41 H4U2S115 Event: SCF:Current PPARs' phase (PPARID 0 POST phase: RT/SX NC)
Jan 27 18:47:42 H4U2S115 Event: SCF:Current PPARs' phase (PPARID 0 POST phase: RT/SX Interrupt)
Jan 27 18:47:45 H4U2S115 Event: SCF:Current PPARs' phase (PPARID 0 POST phase: CPU Status Check)
Jan 27 18:47:46 H4U2S115 Event: SCF:Current PPARs' phase (PPARID 0 POST phase: System Configuration)
Jan 27 18:47:47 H4U2S115 Event: SCF:Current PPARs' phase (PPARID 0 POST phase: System Status Check)
Jan 27 18:47:48 H4U2S115 Event: SCF:Current PPARs' phase (PPARID 0 POST phase: Prepare To Start Hypervisor)
Jan 27 18:47:48 H4U2S115 Event: SCF:POST Diag complete from PPAR (PPAR ID 0)
Jan 27 18:47:54 H4U2S115 Event: SCF:SCF sets the active config to PPAR (PPARID 0 SP-Config:factory-default)
Jan 27 18:48:06 H4U2S115 Event: SCF:HV boot from PPAR (PPAR ID 0)
Jan 27 18:48:13 H4U2S115 Event: SCF:PPARID 0 GID 00000000 state change (OpenBoot initializing)
Jan 27 18:48:28 H4U2S115 Event: SCF:PPARID 0 GID 00000000 state change (OpenBoot Running)
Jan 27 18:48:28 H4U2S115 Event: SCF:PPARID 0 GID 00000000 state change (OpenBoot Primary Boot Loader)
Jan 27 18:48:46 H4U2S115 Event: SCF:PPARID 0 GID 00000000 state change (OpenBoot Running OS Boot)
Jan 27 18:49:32 H4U2S115 Event: SCF:PPARID 0 GID 00000000 state change (Solaris booting)
Jan 27 18:49:32 H4U2S115 Event: SCF:PPARID 0 GID 00000000 state change (Solaris booting)
Jan 27 18:49:32 H4U2S115 Event: SCF:PPARID 0 GID 00000000 state change (Solaris running)
XSCF>
  1. Checking error logs
XSCF> showlogs error
Date: Jan 11 16:33:43 JST 2017
Code: 40000000-014f210000ff0000ff-090101040000000000000000
Status: Warning Occurred: Jan 11 16:33:38.921 JST 2017
FRU: /BB#1/CMUL
Msg: A:mpt_sas9:mpt_sas:RAID status error
Date: Jan 11 18:06:55 JST 2017
Code: 80000000-0056000000ff0000ff-01a100020000000000000000
Status: Alarm Occurred: Jan 11 18:06:52.012 JST 2017
FRU: /BB#0/XSCFU
Msg: XSCF hang-up is detected
Date: Jan 11 20:31:31 JST 2017
Code: 80002000-007c20007811007811-019204050000000000000000
Status: Alarm Occurred: Jan 11 20:31:25.098 JST 2017
FRU: /BB#3/XBU#0/CBL#2L,/BB#3/XBU#0,/BB#0/XBU#0
Msg: XB-XB interface fatal error
XSCF>
Checking messages of the predictive self-repairing tool
Check messages of Oracle Solaris Fault Manager, the predictive self-repairing tool, operating on Oracle Solaris. Oracle Solaris Fault Manager has the following functions:
  1. Receiving telemetry information relating to an error
  2. Troubleshooting
  3. Disabling the FRU where an error occurred
  4. Turning on the LED of the FRU where an error occurred and displaying details in a system console message
Table 8-4 lists typical messages generated at error occurrence. These messages indicate that the fault has already been diagnosed. If corrective actions can be taken, it has already been taken. Also, if the system is in operation, corrective actions continue to be taken.
The messages are displayed on the console and recorded in the /var/adm/messages file.
Table 8-4  Predictive Self-repairing Messages
Displayed Output Description
EVENT-TIME: Thu Apr 19 10:48:39 JST 2012 EVENT-TIME: Diagnosis time stamp
PLATFORM: ORCL,SPARC64-X, CSN: PP115300MX, HOSTNAME: 4S-LGA12-D0 PLATFORM: Description of the server where the error occurred
SOURCE: eft, REV: 1.16 SOURCE: Information about the diagnosis engine used to identify the error
EVENT-ID: fcbb42a5-47c3-c9c5-f0b0-f782d69afb01 EVENT-ID: Universally unique event ID for this error
DESC: The diagnosis engine encountered telemetry from the listed devices for which it was unable to perform a diagnosis - ereport.io.pciex.rc.epkt@chassis0/cpuboard0/chip0/hostbridge0/pciexrc0 class and path are incompatible. DESC: Basic description of the error
AUTO-RESPONSE: Error reports have been logged for examination. AUTO-RESPONSE: Corrective actions taken (if any) by the system to alleviate any subsequent problems
IMPACT: Automated diagnosis and response for these events will not occur. IMPACT: Description of assumed impact from the fault
REC-ACTION: Use 'fmadm faulty' to provide a more detailed view of this event. Use 'fmdump -eV' to view the unexpected telemetry. Please refer to the associated reference document at http://support.oracle.com/msg/SUNOS-8000-J0 for the latest service procedures and policies regarding this diagnosis. REC-ACTION: Brief description of the corrective action that the system administrator needs to take
(3) Checking the FRU status
Check the system hardware configuration and status of each FRU. Table 8-5 shows the meaning of each FRU status displayed as a command execution result.
Table 8-5  FRU Status
Display Description
Normal Normal status
Faulted The unit has stopped due to a fault.
Degraded The unit has a fault somewhere but continues to operate.
Deconfigured The unit is degraded as a result of the fault or degradation of another unit.
Maintenance Maintenance work is being performed with the replacefru, addfru, or initbb command.
Checking the hardware configuration and status of each FRU
To check the hardware configuration and FRU status for the entire system, log in to the XSCF shell, and execute the showhardconf command. An asterisk (*) placed at the beginning of an output line indicates a faulty FRU.
XSCF> showhardconf
SPARC M12-2S;
+ Serial:PZ51649002; Operator_Panel_Switch:Service;
+ System_Power:On; System_Phase:Cabinet Power On;
Partition#0 PPAR_Status:Running;
BB#00 Status:Normal; Role:Master; Ver:3015h; Serial:PZ51649002;
+ FRU-Part-Number:CA20369-B17X 005AC/7341758 ;
+ Power_Supply_System: ;
+ Memory_Size:256 GB;
CMUL Status:Normal; Ver:2101h; Serial:PP164804GG ;
+ FRU-Part-Number:CA07855-D301 A5 /7341541 ;
+ Memory_Size:128 GB; Type: C ;
CPU#0 Status:Normal; Ver:4242h; Serial:00070051;
+ Freq:4.250 GHz; Type:0x30;
+ Core:12; Strand:8;
MEM#00A Status:Normal;
+ Code:ce8002M393A2K40BB1-CRC 00-31C04E6E;
+ Type:83; Size:16 GB;
MEM#01A Status:Normal;
+ Code:ce8002M393A2K40BB1-CRC 00-31C04ED1;
+ Type:83; Size:16 GB;
MEM#02A Status:Normal;
+ Code:ce8002M393A2K40BB1-CRC 00-31C0510D;
+ Type:83; Size:16 GB;
MEM#03A Status:Normal;
+ Code:ce8002M393A2K40BB1-CRC 00-31C04F51;
+ Type:83; Size:16 GB;
MEM#04A Status:Normal;
+ Code:ce8002M393A2K40BB1-CRC 00-31C04E6F;
+ Type:83; Size:16 GB;
MEM#05A Status:Normal;
+ Code:ce8002M393A2K40BB1-CRC 00-31C04F50;
+ Type:83; Size:16 GB;
MEM#06A Status:Normal;
+ Code:ce8002M393A2K40BB1-CRC 00-31C04EFB;
+ Type:83; Size:16 GB;
MEM#07A Status:Normal;
+ Code:ce8002M393A2K40BB1-CRC 00-31C051AE;
+ Type:83; Size:16 GB;
CMUU Status:Normal; Ver:2101h; Serial:PP164804GN ;
+ FRU-Part-Number:CA07855-D451 A4 /7341568 ;
+ Memory_Size:128 GB; Type: C ;
CPU#0 Status:Normal; Ver:4242h; Serial:00070043;
+ Freq:4.250 GHz; Type:0x30;
+ Core:12; Strand:8;
MEM#00A Status:Normal;
+ Code:ce8002M393A2K40BB1-CRC 00-31C04EF7;
+ Type:83; Size:16 GB;
MEM#01A Status:Normal;
+ Code:ce8002M393A2K40BB1-CRC 00-31C051AB;
+ Type:83; Size:16 GB;
MEM#02A Status:Normal;
+ Code:ce8002M393A2K40BB1-CRC 00-31C04EFD;
+ Type:83; Size:16 GB;
MEM#03A Status:Normal;
+ Code:ce8002M393A2K40BB1-CRC 00-31C04ED2;
+ Type:83; Size:16 GB;
MEM#04A Status:Normal;
+ Code:ce8002M393A2K40BB1-CRC 00-31C04EF6;
+ Type:83; Size:16 GB;
MEM#05A Status:Normal;
+ Code:ce8002M393A2K40BB1-CRC 00-31C04F57;
+ Type:83; Size:16 GB;
MEM#06A Status:Normal;
+ Code:ce8002M393A2K40BB1-CRC 00-31C04EAC;
+ Type:83; Size:16 GB;
MEM#07A Status:Normal;
+ Code:ce8002M393A2K40BB1-CRC 00-31C04EA8;
+ Type:83; Size:16 GB;
PCI#0 Name_Property:network;
+ Vendor-ID:8086; Device-ID:1521;
+ Subsystem_Vendor-ID:108e; Subsystem-ID:7b18;
+ Model:SUNW,pcie-igb;
PCI#2 Name_Property:network;
+ Vendor-ID:8086; Device-ID:1528;
+ Subsystem_Vendor-ID:108e; Subsystem-ID:7b15;
+ Model:ATO:7070007, PTO:7070005;
PCI#4 Name_Property:network;
+ Vendor-ID:8086; Device-ID:10fb;
+ Subsystem_Vendor-ID:108e; Subsystem-ID:7b11;
+ Model:X1109a-z/1109a-z;
PCI#6 Name_Property:QLGC,qlc;
+ Vendor-ID:1077; Device-ID:2532;
+ Subsystem_Vendor-ID:1077; Subsystem-ID:015d;
+ Model:QLE2562 ;
XBU#0 Status:Normal; Ver:1101h; Serial:PP164601DU ;
+ FRU-Part-Number:CA20369-B18X 004AB/7341570 ;
+ Type: C ;
XBU#1 Status:Normal; Ver:1101h; Serial:PP164601DV ;
+ FRU-Part-Number:CA20369-B18X 004AB/7341570 ;
+ Type: C ;
XSCFU Status:Normal; Ver:0101h; Serial:PP164603JA ;
+ FRU-Part-Number:CA20369-B08X 006AC/7341765 ;
+ Type: A ;
OPNL Status:Normal; Ver:0101h; Serial:PP164702EE ;
+ FRU-Part-Number:CA20365-B35X 006AC/7060922 ;
+ Type: A ;
PSUBP Status:Normal; Ver:1101h; Serial:PP164603HH ;
+ FRU-Part-Number:CA20369-B17X 005AC/7341758 ;
+ Type: C ;
PSU#0 Status:Normal; Ver:303242h; Serial:HWCD1622000551;
+ FRU-Part-Number:CA01022-0850/7334651 ;
+ Power_Status:ON; AC:200 V; Type: C ;
PSU#1 Status:Normal; Ver:303242h; Serial:HWCD1622000586;
+ FRU-Part-Number:CA01022-0850/7334651 ;
+ Power_Status:ON; AC:200 V; Type: C ;
PSU#2 Status:Normal; Ver:303242h; Serial:HWCD1622000524;
+ FRU-Part-Number:CA01022-0850/7334651 ;
+ Power_Status:ON; AC:200 V; Type: C ;
PSU#3 Status:Normal; Ver:303242h; Serial:HWCD1622000496;
+ FRU-Part-Number:CA01022-0850/7334651 ;
+ Power_Status:ON; AC:200 V; Type: C ;
FANU#0 Status:Normal; Type: C ;
FANU#1 Status:Normal; Type: C ;
FANU#2 Status:Normal; Type: C ;
FANU#3 Status:Normal; Type: C ;
FANU#4 Status:Normal; Type: C ;
FANU#5 Status:Normal; Type: C ;
FANU#6 Status:Normal; Type: C ;
FANU#7 Status:Normal; Type: C ;
HDDBP Status:Normal; Type: A ;
XSCF>
Checking for faulty FRUs
To check for faulty FRUs, log in to the XSCF shell, and execute the showstatus command. An asterisk (*) placed at the beginning of an output line indicates a faulty FRU.
XSCF> showstatus
BB#00 Status:Normal;
CMUL Status:Normal;
* MEM#00A Status:Faulted;
Checking the hardware RAID volume status
From the control domain or root domain, execute the sas2ircu command of the SAS2IRCU utility on Oracle Solaris to check for a degraded hardware RAID volume and a faulty HDD/SSD.
root# ./sas2ircu 0 display
LSI Corporation SAS2 IR Configuration Utility.
Version 17.00.00.00 (2013.07.19)
(Omitted)
------------------------------------------------------------------------

IR Volume information (*1)
------------------------------------------------------------------------

(Omitted)
IR volume 2
Volume ID : 286
Volume Name : RAID1-SYS
Status of volume : Degraded (DGD) (*2)
Volume wwid : 0aa6d102f1bf517a
RAID level : RAID1
Size (in MB) : 571250
Physical hard disks :
PHY[0] Enclosure#/Slot# : 2:0
PHY[1] Enclosure#/Slot# : 0:0

------------------------------------------------------------------------

Physical device information (*3)
------------------------------------------------------------------------

Initiator at ID #0

Device is a Hard disk
Enclosure # : 0
Slot # : 0
SAS Address : 0000000-0-0000-0000
State : Failed (FLD) (*4)
Manufacturer : TOSHIBA
Model Number : MBF2600RC
Firmware Revision : 3706
Serial No : EA25PC700855
GUID : N/A
Protocol : SAS
Drive Type : SAS_HDD
(Omitted)
*1 RAID volume information
*2 Degraded operation of the RAID volume
*3 Physical device information
*4 Indicating a fault
Checking the PCI expansion unit status
To check for faulty FRUs in the PCI expansion unit, log in to the XSCF shell and execute the showhardconf and ioxadm commands.
XSCF> showhardconf
(Omitted)
PCI#0 Status:Normal; Name_Property:pci;
+ Vendor-ID:108e; Device-ID:9020;
+ Subsystem_Vendor-ID:0000; Subsystem-ID:0000;
+ Model:;
+ Connection:2003;
PCIBOX#2003; Status:Normal; Ver:1150h; Serial:PZ21332003;
+ FRU-Part-Number:;
IOB Status:Normal; Serial:PP133001CW ;
+ FRU-Part-Number:CA20365-B66X 020AM/7061033 ;

LINKBOARD Status:Normal; Serial:PP140801Z8 ;
+ FRU-Part-Number:CA20365-B60X 009AD/7061035 ;

PCI#11 Name_Property:pci;
+ Vendor-ID:104c; Device-ID:8231;
+ Subsystem_Vendor-ID:0000; Subsystem-ID:0000;
+ Model:;
FANBP Status:Normal; Serial:PP13310038 ;
+ FRU-Part-Number:CA20365-B68X 005AD/7061025 ;

PSU#0; Status:Normal; Serial:FEJD1245001507;
+ FRU-Part-Number:CA01022-0750-D/7060988 ;
* PSU#1; Status:Faulted; Serial:FEJD1245001483; (*1)
+ FRU-Part-Number:CA01022-0750-D/7060988 ;
FAN#0; Status:Normal;
FAN#1; Status:Normal;
FAN#2; Status:Normal;
(Omitted)
*1 Failed
XSCF> ioxadm -v list
Location Type FW Ver Serial Num Part Num State
PCIBOX#2003 PCIBOX - PZ21332003 On
PCIBOX#2003/PSU#0 PSU - FEJD1245001507 CA01022-0750-D/7060988 On
PCIBOX#2003/PSU#1 PSU - FEJD1245001483 CA01022-0750-D/7060988 Off
(*1)
PCIBOX#2003/IOB IOBOARD 1150‍ ‍‍PP133001CW CA20365-B66X 020AM/7061033 On
PCIBOX#2003/LINKBD BOARD - PP140801Z8 CA20365-B60X 009AD/7061035 On
PCIBOX#2003/FANBP FANBP - PP13310038 CA20365-B68X 005AD/7061025 On
BB#00-PCI#00 CARD 1150 PP123300S8 CA20365-B59X 001AA On
*1 Stopped