V890+ST6140 cluster 重启故障

lem0
V890+ST6140 cluster 重启故障

EAS002 重启
                                                                     
Rebooting with command: boot
Boot device: /pci@8,600000/SUNW,qlc@2/fp@0,0/disk@w500000e018062ad1,0:a  File and args:
SunOS Release 5.10 Version Generic_127111-06 64-bit
Copyright 1983-2007 Sun Microsystems, Inc.  All rights reserved.
Use is subject to license terms.
WARNING: ce2: fault detected external to device; service degraded
WARNING: ce2: xcvr addr:0x01 - link down
NOTICE: ce2: fault cleared external to device; service available
NOTICE: ce2: xcvr addr:0x01 - link up 1000 Mbps full duplex
Hostname: EAS002
/usr/cluster/bin/scdidadm:  Could not stat "../../devices/scsi_vhci/ssd@g600a0b800032c60c0000041448585c4b:c,raw" - No such file or directory.
Warning: Path node loaded - "../../devices/scsi_vhci/ssd@g600a0b800032c60c0000041448585c4b:c,raw".
/usr/cluster/bin/scdidadm:  Could not stat "../../devices/scsi_vhci/ssd@g600a0b800032c6be0000060a48585b78:c,raw" - No such file or directory.
Warning: Path node loaded - "../../devices/scsi_vhci/ssd@g600a0b800032c6be0000060a48585b78:c,raw".
/usr/cluster/bin/scdidadm:  Could not stat "../../devices/scsi_vhci/ssd@g600a0b800032c60c0000041348585af9:c,raw" - No such file or directory.
Warning: Path node loaded - "../../devices/scsi_vhci/ssd@g600a0b800032c60c0000041348585af9:c,raw".
/usr/cluster/bin/scdidadm:  Could not stat "../../devices/scsi_vhci/ssd@g600a0b800032c6be00000608485859e4:c,raw" - No such file or directory.
Warning: Path node loaded - "../../devices/scsi_vhci/ssd@g600a0b800032c6be00000608485859e4:c,raw".

Booting as part of a cluster
NOTICE: CMM: Node EAS001 (nodeid = 1) with votecount = 1 added.
NOTICE: CMM: Node EAS002 (nodeid = 2) with votecount = 1 added.
NOTICE: CMM: Quorum device 2 (/dev/did/rdsk/d12s2) added; votecount = 1, bitmask of nodes with configured paths = 0x3.
WARNING: CMM: Open failed for quorum device 2 with gdevname '/dev/did/rdsk/d12s2'.
NOTICE: clcomm: Adapter ce3 constructed
NOTICE: clcomm: Adapter ce2 constructed
NOTICE: CMM: Node EAS002: attempting to join cluster.
NOTICE: CMM: Node EAS001 (nodeid: 1, incarnation #: 1213861949) has become reachable.
NOTICE: clcomm: Path EAS002:ce3 - EAS001:ce3 online
NOTICE: clcomm: Path EAS002:ce2 - EAS001:ce2 online
NOTICE: CMM: Cluster has reached quorum.
NOTICE: CMM: Node EAS001 (nodeid = 1) is up; new incarnation number = 1213861949.
NOTICE: CMM: Node EAS002 (nodeid = 2) is up; new incarnation number = 1213862900.
NOTICE: CMM: Cluster members: EAS001 EAS002.
WARNING: Received non interrupt heartbeat on EAS002:ce3 - EAS001:ce3 - path timeouts are likely.
WARNING: Received non interrupt heartbeat on EAS002:ce2 - EAS001:ce2 - path timeouts are likely.
NOTICE: CMM: node reconfiguration #8 completed.
NOTICE: CMM: Node EAS002: joined cluster.
ip: joining multicasts failed (18) on clprivnet0 - will use link layer broadcasts for multicast
/usr/cluster/bin/scdidadm:  Could not stat "../../devices/scsi_vhci/ssd@g600a0b800032c60c0000041448585c4b:c,raw" - No such file or directory.
Warning: Path node loaded - "../../devices/scsi_vhci/ssd@g600a0b800032c60c0000041448585c4b:c,raw".
/usr/cluster/bin/scdidadm:  Could not stat "../../devices/scsi_vhci/ssd@g600a0b800032c6be0000060a48585b78:c,raw" - No such file or directory.
Warning: Path node loaded - "../../devices/scsi_vhci/ssd@g600a0b800032c6be0000060a48585b78:c,raw".
/usr/cluster/bin/scdidadm:  Could not stat "../../devices/scsi_vhci/ssd@g600a0b800032c60c0000041348585af9:c,raw" - No such file or directory.
Warning: Path node loaded - "../../devices/scsi_vhci/ssd@g600a0b800032c60c0000041348585af9:c,raw".
/usr/cluster/bin/scdidadm:  Could not stat "../../devices/scsi_vhci/ssd@g600a0b800032c6be00000608485859e4:c,raw" - No such file or directory.
Warning: Path node loaded - "../../devices/scsi_vhci/ssd@g600a0b800032c6be00000608485859e4:c,raw".
/dev/md/rdsk/d50 is clean
/dev/md/rdsk/d40 is clean
/dev/md/rdsk/d70 is clean

EAS002 console login: obtaining access to all attached disks
Jun 19 16:09:17 EAS002 login: ROOT LOGIN /dev/pts/1 FROM 192.168.88.90

此时EAS001 正常 ,重启完了scstat也正常

重启EAS002

EAS001 console login: /etc/rc0.d/K05stoprgm: Calling scswitch -S (evacuate)
Jun 19 16:17:41 EAS001 ip: TCP_IOC_ABORT_CONN: local = 192.168.088.100:0, remote = 000.000.000.000:0, start = -2, end = 6
Jun 19 16:17:41 EAS001 ip: TCP_IOC_ABORT_CONN: local = 192.168.088.101:0, remote = 000.000.000.000:0, start = -2, end = 6
Jun 19 16:17:41 EAS001 ip: TCP_IOC_ABORT_CONN: aborted 0 connection
/etc/rc0.d/K05stoprgm: disabling failfasts
svc.startd: The system is coming down.  Please wait.
svc.startd: 132 system services are now being stopped.
Jun 19 16:17:41 EAS001 last message repeated 1 time
Jun 19 16:17:45 EAS001 cl_eventlogd[1178]: Going down on signal 15.
Jun 19 16:18:02 EAS001 syslogd: going down on signal 15
Jun 19 16:18:02 rpc.metad: Terminated
Jun 19 16:18:23 Cluster.RGM.fed: SCSLM thread WARNING pools facility is disabled
umount: /global/.devices/node@2 busy
umount: /global/.devices/node@1 busy
svc.startd: The system is down.
syncing file systems... done
WARNING: CMM: Node being shut down.
rebooting...
Resetting ...
Software Reset

Enabling system bus....... Done
Initializing CPUs......... Done
Initializing boot memory.. Done
Initializing OpenBoot
Probing system devices
ChassisSerialNumber 014311
Probing I/O buses
screen not found.
keyboard not found.
Keyboard not present.  Using ttya for input and output.
Probing system devices
ChassisSerialNumber 014311
Probing I/O buses

Sun Fire V890, No Keyboard
Copyright 2005 Sun Microsystems, Inc.  All rights reserved.
OpenBoot 4.18.11, 16384 MB memory installed, Serial #75878574.
Ethernet address 0:14:4f:85:d0:ae, Host ID: 8485d0ae.
                                                                     
Rebooting with command: boot
Boot device: disk  File and args:
SunOS Release 5.10 Version Generic_127111-06 64-bit
Copyright 1983-2007 Sun Microsystems, Inc.  All rights reserved.
Use is subject to license terms.
Hostname: EAS001
WARNING: ce2: fault detected external to device; service degraded
WARNING: ce2: xcvr addr:0x01 - link down
WARNING: ce3: fault detected external to device; service degraded
WARNING: ce3: xcvr addr:0x01 - link down
NOTICE: ce3: fault cleared external to device; service available
NOTICE: ce3: xcvr addr:0x01 - link up 1000 Mbps full duplex
NOTICE: ce2: fault cleared external to device; service available
NOTICE: ce2: xcvr addr:0x01 - link up 1000 Mbps full duplex
/usr/cluster/bin/scdidadm:  Could not stat "../../devices/scsi_vhci/ssd@g600a0b800032c60c0000041448585c4b:c,raw" - No such file or directory.
Warning: Path node loaded - "../../devices/scsi_vhci/ssd@g600a0b800032c60c0000041448585c4b:c,raw".
/usr/cluster/bin/scdidadm:  Could not stat "../../devices/scsi_vhci/ssd@g600a0b800032c6be0000060a48585b78:c,raw" - No such file or directory.
Warning: Path node loaded - "../../devices/scsi_vhci/ssd@g600a0b800032c6be0000060a48585b78:c,raw".
/usr/cluster/bin/scdidadm:  Could not stat "../../devices/scsi_vhci/ssd@g600a0b800032c60c0000041348585af9:c,raw" - No such file or directory.
Warning: Path node loaded - "../../devices/scsi_vhci/ssd@g600a0b800032c60c0000041348585af9:c,raw".
/usr/cluster/bin/scdidadm:  Could not stat "../../devices/scsi_vhci/ssd@g600a0b800032c6be00000608485859e4:c,raw" - No such file or directory.
Warning: Path node loaded - "../../devices/scsi_vhci/ssd@g600a0b800032c6be00000608485859e4:c,raw".
Booting as part of a cluster
NOTICE: CMM: Node EAS001 (nodeid = 1) with votecount = 1 added.
NOTICE: CMM: Node EAS002 (nodeid = 2) with votecount = 1 added.
NOTICE: CMM: Quorum device 2 (/dev/did/rdsk/d12s2) added; votecount = 1, bitmask of nodes with configured paths = 0x3.
WARNING: CMM: Open failed for quorum device 2 with gdevname '/dev/did/rdsk/d12s2'.
NOTICE: clcomm: Adapter ce3 constructed
NOTICE: clcomm: Adapter ce2 constructed
NOTICE: CMM: Node EAS001: attempting to join cluster.
NOTICE: CMM: Node EAS002 (nodeid: 2, incarnation #: 1213863689) has become reachable.
NOTICE: clcomm: Path EAS001:ce3 - EAS002:ce3 online
NOTICE: CMM: Cluster has reached quorum.
NOTICE: CMM: Node EAS001 (nodeid = 1) is up; new incarnation number = 1213863656.
NOTICE: CMM: Node EAS002 (nodeid = 2) is up; new incarnation number = 1213863689.
NOTICE: CMM: Cluster members: EAS001 EAS002.
NOTICE: clcomm: Path EAS001:ce2 - EAS002:ce2 online
NOTICE: CMM: node reconfiguration #1 completed.
NOTICE: CMM: Node EAS001: joined cluster.
WARNING: Received non interrupt heartbeat on EAS001:ce3 - EAS002:ce3 - path timeouts are likely.
WARNING: Received non interrupt heartbeat on EAS001:ce2 - EAS002:ce2 - path timeouts are likely.
ip: joining multicasts failed (18) on clprivnet0 - will use link layer broadcasts for multicast
/dev/md/rdsk/d50 is clean
/dev/md/rdsk/d40 is clean
/dev/md/rdsk/d70 is clean

这个时候EAS002 也跟着重启了.
root@EAS002 # scstat -D

-- Device Group Servers --

                         Device Group        Primary             Secondary
                         ------------        -------             ---------
  Device group servers:  oraset              -                   -


-- Device Group Status --

                              Device Group        Status              
                              ------------        ------              
  Device group status:        oraset              Offline


-- Multi-owner Device Groups --

                              Device Group        Online Status
                              ------------        -------------
newfs /dev/did/rdsk/d11s2  
/dev/did/rdsk/d11s2: No such device or address
(d11为阵列上的卷) 阵列上其他卷也一样
但format 和scdidadm -L都能看到所有的卷,而且format 能对卷进行分区

要两台机器重新做一下# devfsadm -Cv
# scgdevs
# scdidadm -c
# scdidadm -C
才能认重新认回来,
root@EAS002 # scstat -q

-- Quorum Summary --

  Quorum votes possible:      3
  Quorum votes needed:        2
  Quorum votes present:       2


-- Quorum Votes by Node --

                    Node Name           Present Possible Status
                    ---------           ------- -------- ------
  Node votes:       EAS001              1        1       Online
  Node votes:       EAS002              1        1       Online


-- Quorum Votes by Device --

                    Device Name         Present Possible Status
                    -----------         ------- -------- ------
  Device votes:     /dev/did/rdsk/d12s2 0        1       Offline

quorum也丢了,要重新删掉再建.哪位大哥可以分享解决办法呀?

lem0
再重启一台,监控另一台发现是quorum device 丢失了,引起系统crash了.
Notifying cluster that this node is panicking
NOTICE: clcomm: Path EAS002:ce2 - EAS001:ce2 being drained
NOTICE: clcomm: Path EAS002:ce3 - EAS001:ce3 being drained
TCP_IOC_ABORT_CONN: local = 000.000.000.000:0, remote = 172.016.004.001:0, start = -2, end = 6

panic[cpu17]/thread=30003d9dc60: CMM: Cluster lost operational quorum; aborting.

000002a102f87550 cl_runtime:__1cZsc_syslog_msg_log_no_args6Fpviipkc0_nZsc_syslog_msg_status_enum__+30 (6000f17f000, 3, 0, 43, 2a102f87750, 7054224c)
  %l0-3: 0000000070541e28 0000000000000047 000006000dc83eeb 0000000000000047
  %l4-7: 0000000003bf8ce2 000006000dc758be 0000000000000000 00000000704c7000
000002a102f87600 cl_runtime:__1cCosNsc_syslog_msgDlog6MiipkcE_nZsc_syslog_msg_status_enum__+1c (6000e2236d0, 3, 0, 7054224c, 1, 2a102f87740)
  %l0-3: 0000000000000009 000000007057dcdc 0000000000000001 0000000000000000
  %l4-7: 0000000000000002 000006000e01b550 0000000000000000 0000000000000001
000002a102f876b0 cl_haci:__1cOautomaton_implbAstate_machine_qcheck_state6M_nVcmm_automaton_event_t__+4a4 (6000de7e008, de, 6000de7ee40, 6000de7ecd8, de0, 70542192)
  %l0-3: 00000000704f2000 0000000070542000 00000000000704f2 0000000000070400
  %l4-7: 0000000070541000 0000000000070541 0000000000070400 000006000de7fc20
000002a102f87930 cl_haci:__1cIcmm_implStransitions_thread6M_v_+f4 (6000de7e008, 1, 1, 6000de7e9d8, 1d354, 926a6e8e4c)
  %l0-3: 000006000de7e078 000006000de9b5a4 000000000001d59c 000000000001d400
  %l4-7: 000000000000002c 0000000000000004 000000000001d354 000006000de7e498
000002a102f879e0 cl_orb:cllwpwrapper+c4 (2a102f87b70, 7b2781f8, 2, 2, 2, 0)
  %l0-3: 0000000000001800 0000000000000007 fffffffffffffffd 000000000000003a
  %l4-7: 00000000704f5000 00000000000704f5 0000000000070400 00000000018bf800
000002a102f87ac0 unix:___const_seg_900002901+3a2c (2a102f87b70, 18, 0, 0, 0, 0)
  %l0-3: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
  %l4-7: 0000000000000000 0000000000000000 0000000000000000 0000000000000000

syncing file systems... done
dumping to /dev/dsk/c1t0d0s1, offset 65536, content: kernel
100% done: 125598 pages dumped, compression ratio 6.73, dump succeeded
rebooting...
Resetting ...

doging
你重起节点2时仲裁是online还是offline状态?》

bencyber
你把118833,120011等 kernel patch也装上吧,觉得有点象是SAN方面的问题。

lem0
已经用EIS-DVD打过最新的补丁了,上面说的两个补丁都有了.重启2时是那仲裁设备是online的.不过我觉得实际上不是.第一台重启完后就没有正常加入到CLUSTER里面.

john.you
quorum的问题,进维护模块改就好了。

lem0
我也知道应该是quorum的问题,请问怎么改呢?
是要boot -x 后做什么步骤吗?

huanglao2002
我个人认为还是需要看到磁盘设备
再确认多路径。

lem0
format 和scdidadm -L 都是可以看得到正确的设备的了.

lem0
重装cluster软件的时候发现检查交换空间是0MB,不知道这样会不会有问题呀.
如附图.

fanliwei20
楼主:这个问题我也曾经遇到过,这个问题的引起就是sun cluster 突然丢失quorum device,这个是一个bug,需要安装solairs 的最新推荐补丁集,才能解决  

       但是具体的版本号我忘记了,
EIS CD 一般几个月更新一次,所以建议去sun 的网站上下载,同时安装sun lcuter core patch 和阵列的patch。问题肯定能解决,并不是你配置上的问题

fanliwei20
楼主:这个问题我也曾经遇到过,这个问题的引起就是sun cluster 突然丢失quorum device,这个是一个bug,需要安装solairs 的最新推荐补丁集,才能解决  

       但是具体的版本号我忘记了,
EIS CD 一般几个月更新一次,所以建议去sun 的网站上下载,同时安装sun lcuter core patch 和阵列的patch。问题肯定能解决,并不是你配置上的问题

whr25
你是第一安装的吧?

两台服务器的两个globaldeivce分区号必需不到才行