4800经常down机故障信息,大家帮我看看,怀疑sb板子或则fan有问题

ilinch
4800经常down机故障信息,大家帮我看看,怀疑sb板子或则fan有问题

串口输出的
You have new mail.
Jan 31 00:04:58 SF4800-2-SC0 Platform.SC: Notice: /N0/SB4 temperature is approaching warning limit of 100C.
Jan 31 00:04:58 SF4800-2-SC0 Platform.SC: /N0/SB4 SDC 0 Temp. 0 value: 96 Degrees C
EJMAIN2% Jan 31 00:10:53 SF4800-2-SC0 Platform.SC: WARNING: /N0/SB4 temperature is approaching max limit of 93C
Jan 31 00:10:53 SF4800-2-SC0 Platform.SC: /N0/SB4 CPU 3 Temp. 0 value: 88 Degrees C
Jan 31 00:10:53 SF4800-2-SC0 Platform.SC: Check for abnormal environmental operating conditions.
Jan 31 00:10:53 SF4800-2-SC0 Platform.SC: /N0/SB4, sensor status, outside acceptable limits (7,1,0x204040603030000)
Jan 31 00:12:31 SF4800-2-SC0 Platform.SC: WARNING: /N0/SB4 temperature is approaching max limit of 93C
Jan 31 00:12:31 SF4800-2-SC0 Platform.SC: /N0/SB4 CPU 0 Temp. 0 value: 88 Degrees C
Jan 31 00:12:31 SF4800-2-SC0 Platform.SC: Check for abnormal environmental operating conditions.
Jan 31 00:12:31 SF4800-2-SC0 Platform.SC: /N0/SB4, sensor status, outside acceptable limits (7,1,0x204040600030000)
Jan 31 00:22:06 SF4800-2-SC0 Platform.SC: WARNING: /N0/SB4 temperature is approaching max limit of 105C
Jan 31 00:22:06 SF4800-2-SC0 Platform.SC: /N0/SB4 SDC 0 Temp. 0 value: 101 Degrees C
Jan 31 00:22:06 SF4800-2-SC0 Platform.SC: Check for abnormal environmental operating conditions.
Jan 31 00:22:06 SF4800-2-SC0 Platform.SC: /N0/SB4, sensor status, outside acceptable limits (7,1,0x204040200030000)
Jan 31 00:25:59 SF4800-2-SC0 Platform.SC: WARNING: /N0/SB4 temperature is approaching max limit of 93C
Jan 31 00:25:59 SF4800-2-SC0 Platform.SC: /N0/SB4 CPU 2 Temp. 0 value: 88 Degrees C
Jan 31 00:25:59 SF4800-2-SC0 Platform.SC: Check for abnormal environmental operating conditions.
Jan 31 00:25:59 SF4800-2-SC0 Platform.SC: /N0/SB4, sensor status, outside acceptable limits (7,1,0x204040602030000)
logout

EJMAIN2 console login: nari
Password:
Last login: Wed Jan 30 16:35:15 from ejop1
Sun Microsystems Inc.   SunOS 5.9       Generic May 2002
You have new mail.
EJMAIN2% Jan 31 12:05:49 SF4800-2-SC0 Platform.SC: Notice: Shutting down /N0/SB4 as temperature exceeds max limit of 93C
Jan 31 12:05:49 SF4800-2-SC0 Platform.SC: /N0/SB4 CPU 3 Temp. 0 value: 93 Degrees C
Jan 31 12:05:49 SF4800-2-SC0 Platform.SC: Check for abnormal environmental operating conditions.
Jan 31 12:05:50 SF4800-2-SC0 Platform.SC: /N0/SB4: has been queued for power off.
SF4800-2-SC0:A> Jan 31 12:05:50 SF4800-2-SC0 Domain-A.SC: /N0/SB4: powering off active board
Jan 31 12:05:50 SF4800-2-SC0 Domain-A.SC: changing domain A keyswitch position to standby
Jan 31 12:05:50 SF4800-2-SC0 Platform.SC: /N0/SB4, sensor status, shutdown (7,3,0x204040603030000)
Jan 31 12:06:01 SF4800-2-SC0 Platform.SC: /N0/SB4: powered off
Jan 31 12:08:13 SF4800-2-SC0 Platform.SC: FT0, fan speed, Low (4,1)
Jan 31 12:08:13 SF4800-2-SC0 Platform.SC: FT2, fan speed, Low (4,1)
Jan 31 12:08:13 SF4800-2-SC0 Platform.SC: FT1, fan speed, Low (4,1)

You have new mail.
EJMAIN2% Feb 01 10:15:35 SF4800-2-SC0 Platform.SC: Notice: /N0/SB4 temperature is approaching warning limit of 100C.
Feb 01 10:15:35 SF4800-2-SC0 Platform.SC: /N0/SB4 SDC 0 Temp. 0 value: 96 Degrees C
Feb 01 10:20:07 SF4800-2-SC0 Platform.SC: Notice: /N0/SB4 temperature is approaching warning limit of 88C.
Feb 01 10:20:07 SF4800-2-SC0 Platform.SC: /N0/SB4 CPU 1 Temp. 0 value: 83 Degrees C
Feb 01 10:20:15 SF4800-2-SC0 Platform.SC: WARNING: /N0/SB4 temperature is approaching max limit of 93C
Feb 01 10:20:15 SF4800-2-SC0 Platform.SC: /N0/SB4 CPU 3 Temp. 0 value: 88 Degrees C
Feb 01 10:20:15 SF4800-2-SC0 Platform.SC: Check for abnormal environmental operating conditions.
Feb 01 10:20:15 SF4800-2-SC0 Platform.SC: /N0/SB4, sensor status, outside acceptable limits (7,1,0x204040603030000)
Feb 01 10:24:01 SF4800-2-SC0 Platform.SC: WARNING: /N0/SB4 temperature is approaching max limit of 93C
Feb 01 10:24:01 SF4800-2-SC0 Platform.SC: /N0/SB4 CPU 0 Temp. 0 value: 88 Degrees C
Feb 01 10:24:01 SF4800-2-SC0 Platform.SC: Check for abnormal environmental operating conditions.
Feb 01 10:24:01 SF4800-2-SC0 Platform.SC: /N0/SB4, sensor status, outside acceptable limits (7,1,0x204040600030000)
Feb 01 10:33:20 SF4800-2-SC0 Platform.SC: WARNING: /N0/SB4 temperature is approaching max limit of 105C
Feb 01 10:33:20 SF4800-2-SC0 Platform.SC: /N0/SB4 SDC 0 Temp. 0 value: 101 Degrees C
Feb 01 10:33:20 SF4800-2-SC0 Platform.SC: Check for abnormal environmental operating conditions.
Feb 01 10:33:20 SF4800-2-SC0 Platform.SC: WARNING: /N0/SB4 temperature is approaching max limit of 93C
Feb 01 10:33:20 SF4800-2-SC0 Platform.SC: /N0/SB4 CPU 2 Temp. 0 value: 88 Degrees C
Feb 01 10:33:20 SF4800-2-SC0 Platform.SC: Check for abnormal environmental operating conditions.
Feb 01 10:33:20 SF4800-2-SC0 Platform.SC: /N0/SB4, sensor status, outside acceptable limits (7,1,0x204040200030000)
Feb 01 10:33:20 SF4800-2-SC0 Platform.SC: /N0/SB4, sensor status, outside acceptable limits (7,1,0x204040602030000)
Feb 01 14:28:13 SF4800-2-SC0 Platform.SC: Notice: Shutting down /N0/SB4 as temperature exceeds max limit of 93C
Feb 01 14:28:13 SF4800-2-SC0 Platform.SC: /N0/SB4 CPU 3 Temp. 0 value: 93 Degrees C
Feb 01 14:28:13 SF4800-2-SC0 Platform.SC: Check for abnormal environmental operating conditions.
SF4800-2-SC0:A> Feb 01 14:28:13 SF4800-2-SC0 Platform.SC: /N0/SB4: has been queued for power off.
Feb 01 14:28:13 SF4800-2-SC0 Domain-A.SC: /N0/SB4: powering off active board
Feb 01 14:28:13 SF4800-2-SC0 Domain-A.SC: changing domain A keyswitch position to standby
Feb 01 14:28:13 SF4800-2-SC0 Platform.SC: /N0/SB4, sensor status, shutdown (7,3,0x204040603030000)
Feb 01 14:28:24 SF4800-2-SC0 Platform.SC: /N0/SB4: powered off
Feb 01 14:30:36 SF4800-2-SC0 Platform.SC: FT0, fan speed, Low (4,1)
Feb 01 14:30:36 SF4800-2-SC0 Platform.SC: FT2, fan speed, Low (4,1)
Feb 01 14:30:36 SF4800-2-SC0 Platform.SC: FT1, fan speed, Low (4,1)

[[i] 本帖最后由 ilinch 于 2008-2-3 12:12 编辑 [/i]]

ilinch
prtdiag -v看的情况 ,没有发现硬件错误

System Configuration:  Sun Microsystems  sun4u Sun Fire 4800
System clock frequency: 150 MHz
Memory size: 4096 Megabytes

========================= CPUs ===============================================

            CPU      Run    E$   CPU      CPU
FRU Name     ID      MHz    MB   Impl.    Mask
----------  -------  ----  ----  -------  ----
/N0/SB4/P0   16      1200   8.0  US-III+  11.0   
/N0/SB4/P1   17      1200   8.0  US-III+  11.0   
/N0/SB4/P2   18      1200   8.0  US-III+  11.0   
/N0/SB4/P3   19      1200   8.0  US-III+  11.0   

========================= Memory Configuration ===============================

                     Logical  Logical  Logical
               Port  Bank     Bank     Bank         DIMM    Interleave  Interleave
FRU Name        ID   Num      Size     Status       Size    Factor      Segment
-------------  ----  ----     ------   -----------  ------  ----------  ----------
/N0/SB4/P0/B0   16    0       512MB    pass          256MB     8-way       0
/N0/SB4/P0/B0   16    2       512MB    pass          256MB     8-way       0
/N0/SB4/P1/B0   17    0       512MB    pass          256MB     8-way       0
/N0/SB4/P1/B0   17    2       512MB    pass          256MB     8-way       0
/N0/SB4/P2/B0   18    0       512MB    pass          256MB     8-way       0
/N0/SB4/P2/B0   18    2       512MB    pass          256MB     8-way       0
/N0/SB4/P3/B0   19    0       512MB    pass          256MB     8-way       0
/N0/SB4/P3/B0   19    2       512MB    pass          256MB     8-way       0

========================= IO Cards =========================

                                Bus  Max                                             
            IO   Port Bus       Freq Bus  Dev,                                       
FRU Name    Type  ID  Side Slot MHz  Freq Func State Name                              Model
----------  ---- ---- ---- ---- ---- ---- ---- ----- --------------------------------  ----------------------
/N0/IB6/P0  PCI   24   B    0    33   33  1,0  ok    network-pci108e,abba.11           SUNW,pci-ce            
/N0/IB6/P0  PCI   24   A    3    66   66  1,0  ok    pci-pci8086,b154.0/network (netw+ pci-bridge              
/N0/IB6/P0  PCI   24   A    3    66   66  0,0  ok    network-pci108e,abba.20           SUNW,pci-ce            
/N0/IB6/P0  PCI   24   A    3    66   66  1,0  ok    network-pci108e,abba.20           SUNW,pci-ce            
/N0/IB6/P0  PCI   24   A    3    66   66  2,0  ok    scsi-pci1000,b.1000.1000.7/disk +                        
/N0/IB6/P0  PCI   24   A    3    66   66  2,1  ok    scsi-pci1000,b.1000.1000.7/disk +                        
/N0/IB6/P1  PCI   25   A    7    66   66  1,0  ok    SUNW,qlc-pci1077,2300.1077.106.1+ 0x106                  
/N0/IB8/P0  PCI   28   A    3    66   66  1,0  ok    SUNW,qlc-pci1077,2300.1077.106.1+ 0x106                  
/N0/IB8/P1  PCI   29   A    7    66   66  1,0  ok    network-pci108e,abba.11           SUNW,pci-ce            

========================= Active Boards for Domain ===========================

           Board        Receptacle    Occupant                                                        
FRU Name   Type         Status        Status        Condition Info                                    
---------  -----------  -----------   ------------  --------- ----------------------------------------
/N0/SB4    CPU_V2       connected     configured    ok        powered-on, assigned                  
/N0/IB6    PCI_I/O_Boa  connected     configured    ok        powered-on, assigned                  
/N0/IB8    PCI_I/O_Boa  connected     configured    ok        powered-on, assigned                  

========================= Available Boards/Slots for Domain ===========================

           Board        Receptacle    Occupant                                                        
FRU Name   Type         Status        Status        Condition Info                                    
---------  -----------  -----------   ------------  --------- ----------------------------------------
/N0/SB0    unknown      empty         unconfigured  unknown   assigned                              
/N0/SB2    unknown      empty         unconfigured  unknown   assigned                              

========================= Hardware Failures ==================================
No Hardware failures found in System

========================= HW Revisions =======================================

System PROM revisions:
----------------------
OBP 5.20.5 02/07/07 13:51

IO ASIC revisions:
------------------
                          Port
FRU Name    Model            ID  Status Version
----------- --------------- ---- ------ -------
/N0/IB6/P0  SUNW,schizo      24   ok     4      
/N0/IB6/P1  SUNW,schizo      25   ok     4      
/N0/IB8/P0  SUNW,schizo      28   ok     4      
/N0/IB8/P1  SUNW,schizo      29   ok     4      
/N0/IB6/P0  SUNW,sgsbbc      24   ok     2      
/N0/IB8/P0  SUNW,sgsbbc      28   ok     2

ilinch
messages就看到这些
warning

Jan 22 17:30:38 EJMAIN1 genunix: [ID 408789 kern.warning] WARNING: ce0: fault detected external to device; service degraded
Jan 22 17:30:38 EJMAIN1 genunix: [ID 451854 kern.warning] WARNING: ce0: xcvr addr:0x00 - link down
Jan 22 17:30:38 EJMAIN1 in.routed[218]: [ID 238047 daemon.warning] interface ce0 to 172.16.0.130 turned off
Jan 22 17:30:38 EJMAIN1 genunix: [ID 408789 kern.notice] NOTICE: ce0: fault cleared external to device; service available
Jan 22 17:30:38 EJMAIN1 genunix: [ID 451854 kern.notice] NOTICE: ce0: xcvr addr:0x00 - link up 1000 Mbps full duplex
Jan 22 17:30:38 EJMAIN1 in.routed[218]: [ID 300549 daemon.warning] interface ce0 to 172.16.0.130 restored
Jan 22 17:30:53 EJMAIN1 genunix: [ID 408789 kern.warning] WARNING: ce3: fault detected external to device; service degraded
Jan 22 17:30:53 EJMAIN1 genunix: [ID 451854 kern.warning] WARNING: ce3: xcvr addr:0x00 - link down
Jan 22 17:30:53 EJMAIN1 in.routed[218]: [ID 238047 daemon.warning] interface ce3 to 172.16.1.2 turned off
Jan 22 17:30:53 EJMAIN1 genunix: [ID 408789 kern.notice] NOTICE: ce3: fault cleared external to device; service available
Jan 22 17:30:53 EJMAIN1 genunix: [ID 451854 kern.notice] NOTICE: ce3: xcvr addr:0x00 - link up 1000 Mbps full duplex
Jan 22 17:30:53 EJMAIN1 in.routed[218]: [ID 300549 daemon.warning] interface ce3 to 172.16.1.2 restored
Jan 22 17:30:54 EJMAIN1 genunix: [ID 408789 kern.warning] WARNING: ce3: fault detected external to device; service degraded
Jan 22 17:30:54 EJMAIN1 genunix: [ID 451854 kern.warning] WARNING: ce3: xcvr addr:0x00 - link down
Jan 22 17:30:54 EJMAIN1 in.routed[218]: [ID 238047 daemon.warning] interface ce3 to 172.16.1.2 turned off
Jan 22 17:30:54 EJMAIN1 genunix: [ID 408789 kern.notice] NOTICE: ce3: fault cleared external to device; service available
Jan 22 17:30:54 EJMAIN1 genunix: [ID 451854 kern.notice] NOTICE: ce3: xcvr addr:0x00 - link up 1000 Mbps full duplex
Jan 22 17:30:54 EJMAIN1 in.routed[218]: [ID 300549 daemon.warning] interface ce3 to 172.16.1.2 restored
Jan 22 17:30:55 EJMAIN1 genunix: [ID 408789 kern.warning] WARNING: ce3: fault detected external to device; service degraded
Jan 22 17:30:55 EJMAIN1 genunix: [ID 451854 kern.warning] WARNING: ce3: xcvr addr:0x00 - link down
Jan 22 17:30:55 EJMAIN1 in.routed[218]: [ID 238047 daemon.warning] interface ce3 to 172.16.1.2 turned off
Jan 22 17:30:55 EJMAIN1 genunix: [ID 408789 kern.notice] NOTICE: ce3: fault cleared external to device; service available
Jan 22 17:30:55 EJMAIN1 genunix: [ID 451854 kern.notice] NOTICE: ce3: xcvr addr:0x00 - link up 1000 Mbps full duplex
Jan 22 17:30:55 EJMAIN1 in.routed[218]: [ID 300549 daemon.warning] interface ce3 to 172.16.1.2 restored
Jan 22 17:33:32 EJMAIN1 genunix: [ID 408789 kern.warning] WARNING: ce0: fault detected external to device; service degraded
Jan 22 17:33:32 EJMAIN1 genunix: [ID 451854 kern.warning] WARNING: ce0: xcvr addr:0x00 - link down
Jan 22 17:33:32 EJMAIN1 in.routed[218]: [ID 238047 daemon.warning] interface ce0 to 172.16.0.130 turned off
Jan 22 17:33:32 EJMAIN1 genunix: [ID 408789 kern.warning] WARNING: ce3: fault detected external to device; service degraded
Jan 22 17:33:32 EJMAIN1 genunix: [ID 451854 kern.warning] WARNING: ce3: xcvr addr:0x00 - link down
Jan 22 17:33:32 EJMAIN1 in.routed[218]: [ID 238047 daemon.warning] interface ce3 to 172.16.1.2 turned off

ilinch
最后补问一下
prtdiag -v 怎看不到fan的状态,还有就是温度状态信息?

柯雅
主要是说系统板SB4的温度过高,超过预警值,可连接到主SC,运行如下命令查看具体状态:

sc> showboards -v
sc> showenvironment -v

tomboy
SB4板的温度太高了,报警。可能是前面的虑尘网的灰尘太多,通风不好造成的,我曾经遇到过相同的问题。清理一下虑尘网就好了.