Skip to main content

4.2.2 Identifying a failure


4.2.2 Identifying a failure
This section describes how to identify a failure. Use the troubleshooting flow described in "4.2.1  Confirming whether there is a failure" to determine an appropriate way to check the failure.
Checking the LED indications
Check the LEDs on the operation panel, rear panel, and each component to identify the FRU requiring maintenance. Check the status of a FRU from its LED before starting maintenance work on the FRU.
  1. Operation panel LEDs
    You can determine the status of the system by checking the LEDs on the operation panel. For details, see "2.4.1  Operation panel LEDs."
  2. Rear panel LED
    You can determine the status of the system by checking the CHECK LED on the rear panel of the chassis, which duplicates the CHECK LED on the operation panel. For details, see "2.4.2  LEDs on the rear panel (System locator)."
  3. LED of each FRU
    If an error occurs in the hardware in the chassis, you can determine the location of the error by checking the LED of the FRU that incorporates the failed hardware. For details, see "2.4.3  LEDs on each component."
    Note that some FRUs, such as memory, do not have mounted LEDs. To check the status of a FRU that does not have an LED, execute XSCF shell commands such as the showhardconf command from the maintenance terminal. For details, see "Checking the FRU status" below.
Checking error messages
Display error messages to check log information and an error overview.
You can use either of the following two methods to check the error messages:
  1. Checking error log information with the XSCF shell
    For details, see "12.1  Checking a Log Saved by the XSCF" in the Fujitsu SPARC M12 and Fujitsu M10/SPARC M10 System Operation and Administration Guide.
  2. Checking messages with Oracle Solaris
    For details, see "12.2  Checking Warning and Notification Messages" in the Fujitsu SPARC M12 and Fujitsu M10/SPARC M10 System Operation and Administration Guide.
Checking the FRU status
Execute XSCF firmware commands to determine the system hardware configuration and the status of each FRU.
- showhardconf command
Execute the showhardconf command to check the information on the FRU list.
  1. Log in to the XSCF shell.
  2. Execute the showhardconf command to check the FRU list.
    A faulty FRU is indicated by an asterisk (*) at the beginning of the line.
XSCF> showhardconf
SPARC M10-4S;
    + Serial:2081229003; Operator_Panel_Switch:Service;
    + System_Power:On; System_Phase:Cabinet Power On;
    Partition#0 PPAR_Status:Running;
    BB#00 Status:Normal; Role:Master; Ver:2050h; Serial:2081229003;
        + FRU-Part-Number:CA07361-D202 A0                         ;
        + Power_Supply_System: ;
        + Memory_Size:320 GB;
------------------------Omitted------------------------

        PCI#0 Status:Normal; Name_Property:pci;
            + Vendor-ID:108e; Device-ID:9020;
            + Subsystem_Vendor-ID:0000; Subsystem-ID:0000;
            + Model:;
            + Connection:7001;
*           PCIBOX#7001; Status:Faulted; Ver:1110h; Serial:2121237001;
                + FRU-Part-Number:;
                IOB Status:Normal; Serial:PP123403JE  ;
                    + FRU-Part-Number:CA20365-B66X 008AG    ;
                LINKBOARD Status:Normal; Serial:PP1234026P  ;
                    + FRU-Part-Number:CA20365-B60X 001AA    ;
                PCI#1 Name_Property:ethernet;
                    + Vendor-ID:1077; Device-ID:8000;
                    + Subsystem_Vendor-ID:1077; Subsystem-ID:017e;
                    + Model:;
------------------------Omitted-----------------------
- showstatus command
Execute the showstatus command to check the FRU status.
  1. Log in to the XSCF shell.
  2. Execute the showstatus command to check the status.
    A faulty FRU is indicated by an asterisk (*) at the beginning of the line.
XSCF> showstatus
    MBU Status:Normal;
*       MEM#0A Status:Faulted;
The FRU status is displayed after the "Status:" string.
Table 4-3 describes the FRU status.
Table 4-3  FRU status
Display Description
Normal The unit is in the normal state.
Faulted The unit is faulty and is not operating.
Degraded A part of the unit has failed or degraded, but the unit is running.
Deconfigured Due to the failure or degradation of another unit, the target unit and components of its underlying layer has been degraded, though there is no problem in them.
Maintenance Maintenance is being performed. The replacefru, addfru, or initbb command is being executed.
Checking the hardware RAID volume status
Check the hardware RAID volume status.
From the control domain or root domain, execute the sas2ircu command of the SAS2IRCU utility on Oracle Solaris to check for a degraded hardware RAID volume and a faulty HDD/SSD.
root# ./sas2ircu 0 display
LSI Corporation SAS2 IR Configuration Utility.
Version 19.00.00.00 (2014.03.17)
   (Omitted)
------------------------------------------------------------------------

IR Volume information  (*1)
------------------------------------------------------------------------

   (Omitted)
IR volume 2
  Volume ID                               : 286
  Volume Name                             : 0
  Status of volume                        : Degraded (DGD) (*2)
  Volume wwid                             : 01a0a262cfe15e62
  RAID level                              : RAID1
  Size (in MB)                            : 571250
  Physical hard disks                     :
  PHY[0] Enclosure#/Slot#                 : 2:0
  PHY[1] Enclosure#/Slot#                 : 2:1
------------------------------------------------------------------------

Physical device information  (*3)
------------------------------------------------------------------------

Initiator at ID #0
   (Omitted)
Device is a Hard disk
  Enclosure #                             : 2
  Slot #                                  : 0
  SAS Address                             : 5000039-7-584a-dada
  State                                   : Failed (FLD) (*4)
  Size (in MB)/(in sectors)               : 572325/1172123567
  Manufacturer                            : TOSHIBA
  Model Number                            : AL13SEB600AL14SE
  Firmware Revision                       : 3703
  Serial No                               : X6N0A01PF7TD
  GUID                                    : 50000397584adad9
  Protocol                                : SAS
  Drive Type                              : SAS_HDD
   (Omitted)
*1 RAID volume information
*2 Degraded RAID volume
*3 Physical device information
*4 Indicates a failure
Checking the status of a PCI expansion unit
If a PCI expansion unit is connected, execute the ioxadm command from the XSCF shell to check the status of the PCI expansion unit.
- ioxadm command
Execute the ioxadm command to check the environmental conditions (temperature, voltage, etc.) or LED indications for the PCI expansion unit.
  1. Log in to the XSCF shell.
  2. Execute the ioxadm command to check the environmental conditions of the specified PCI expansion unit.
    To specify a PCI expansion unit, enter the serial number of the PCI expansion unit after determining it by executing the ioxadm list command.
    The following example shows the environmental conditions for the PCIBOX#2008 "2008" are the last four digits of the serial number of the PCI expansion unit.
XSCF> ioxadm env -te PCIBOX#2008
Location                     Sensor             Value Resolution Units
PCIBOX#2008                  AIRFLOW          180.000      0.000 CHM
PCIBOX#2008                  P_CONSUMPTION     68.000      0.000 W
PCIBOX#2008/PSU#0            FAN             3936.000      0.000 RPM
PCIBOX#2008/PSU#1            FAN             3584.000      0.000 RPM
PCIBOX#2008/FAN#0            FAN             3374.000      0.000 RPM
PCIBOX#2008/FAN#1            FAN             3374.000      0.000 RPM
PCIBOX#2008/FAN#2            FAN             3374.000      0.000 RPM
PCIBOX#2008/IOBT             T_INTAKE          26.000      0.000 C
PCIBOX#2008/IOBT             T_PART_NO0        31.500      0.000 C
PCIBOX#2008/IOBT             T_PART_NO1        30.750      0.000 C
PCIBOX#2008/IOBT             T_PART_NO2        31.500      0.000 C
PCIBOX#2008/IOBT             V_12_0V           12.069      0.000 V
PCIBOX#2008/IOBT             V_3_3_NO0          3.293      0.000 V
PCIBOX#2008/IOBT             V_3_3_NO1          3.295      0.000 V
PCIBOX#2008/IOBT             V_3_3_NO2          3.291      0.000 V
PCIBOX#2008/IOBT             V_3_3_NO3          3.300      0.000 V
PCIBOX#2008/IOBT             V_1_8V             1.804      0.000 V
PCIBOX#2008/IOBT             V_0_9V             0.900      0.000 V
Checking log information
Execute the showlogs command to check error log information.
  1. Log in to the XSCF shell.
  2. Execute the showlogs command to check the log information.
    The log information is listed in the order of date, with the oldest appearing first.
    The following example shows that an Alarm occurred in PSU#1 at 12:45:31 on Oct 20, and the status changed to Warning at 15:45:31 on the same day.
XSCF> showlogs error
Date: Oct 20 12:45:31 JST 2012
    Code: 00112233-445566778899aabbcc-8899aabbcceeff0011223344
    Status: Alarm                  Occurred: Oct 20 12:45:31.000 JST 2012
    FRU: /PSU#1
    Msg: ACFAIL occurred (ACS=3)(FEP type = A1)
Date: Oct 20 15:45:31 JST 2012
    Code: 00112233-445566778899aabbcc-8899aabbcceeff0011223344
    Status: Warning                Occurred: Oct 20 15:45:31.000 JST 2012
    FRU: /PSU#1
    Msg: ACFAIL occurred (ACS=3)(FEP type = A1)
Table 4-4 lists the logs that can be displayed by the showlogs command with an operand specified.
Table 4-4  Operands of the showlogs command and the log to be displayed
Operand Description
error Lists the error log.
event Lists the event log.
power Lists the power log.
env Lists the temperature history.
monitor Lists the monitoring message log.
console Lists the console message log.
ipl Lists the IPL message log.
panic Lists the panic message log.
Checking the messages output by the predictive self-repairing tool
Check the messages output from the Oracle Solaris Fault Manager predictive self-repairing tool, running on Oracle Solaris. Oracle Solaris Fault Manager supports the following functions:
  1. Receives telemetry information about errors.
  2. Performs troubleshooting.
  3. Disables the FRUs that have experienced errors.
  4. Turns on the LED of a FRU that has experienced an error and displays the details in a system console message.
Table 4-5 lists typical messages that are generated if an error occurs. These messages indicate that the fault has already been diagnosed. If corrective actions can be taken by the system, this indicates that they have already been taken. In addition, if the system is running, corrective actions continue to be applied.
Messages are displayed on the console and are recorded in the /var/adm/messages file.
Table 4-5  Predictive self-repairing messages
Output Displayed Description
EVENT-TIME: Thu Apr 19 10:48:39 JST 2012 EVENT-TIME: Time stamp of the diagnosis
PLATFORM: ORCL,SPARC64-X, CSN: PP115300MX, HOSTNAME: 4S-LGA12-D0 PLATFORM: Description of the chassis in which the error occurred
SOURCE: eft, REV: 1.16 SOURCE: Information on the diagnosis engine used to identify the error
EVENT-ID: fcbb42a5-47c3-c9c5-f0b0-f782d69afb01 EVENT-ID: Universally unique event ID for this error
DESC: The diagnosis engine encountered telemetry from the listed devices for which it was unable to perform a diagnosis - ereport.io.pciex.rc.epkt@chassis0/cpuboard0/chip0/hostbridge0/pciexrc0 class and path are incompatible. DESC: Basic description of the error
AUTO-RESPONSE: Error reports have been logged for examination. AUTO-RESPONSE: What the system has done (if anything) to alleviate any subsequent problems
IMPACT: Automated diagnosis and response for these events will not occur. IMPACT: Description of the assumed impact of the failure
REC-ACTION: Use 'fmadm faulty' to provide a more detailed view of this event. Use 'fmdump -eV' to view the unexpected telemetry. Please refer to the associated reference document at http://support.oracle.com/msg/SUNOS-8000-J0 for the latest service procedures and policies regarding this diagnosis. REC-ACTION: Brief description of the corrective action the system administrator should apply
Identifying the location of the chassis requiring maintenance
Execute the setlocator command to identify the location of the chassis requiring maintenance by causing the CHECK LED on the operation panel and the CHECK LED (locator) on the rear panel to blink.
  1. Log in to the XSCF shell.
  2. Execute the setlocator command to identify the location of the chassis requiring maintenance, by causing the CHECK LED of the chassis to blink.
    The CHECK LEDs on the operation and rear panels blink.
  1. The chassis requiring maintenance in the following execution example is the master chassis.
XSCF> setlocator blink
  1. If the chassis requiring maintenance is not the master chassis, set "setlocator -b bb_id blink".
  1. For the locations of the CHECK LEDs and details on how to check them, see "2.4  Checking the LED Indications."