lem0
V890+ST6140 cluster 重启故障
EAS002 重启
Rebooting with command: boot
Boot device: /pci@8,600000/SUNW,qlc@2/fp@0,0/disk@w500000e018062ad1,0:a File and args:
SunOS Release 5.10 Version Generic_127111-06 64-bit
Copyright 1983-2007 Sun Microsystems, Inc. All rights reserved.
Use is subject to license terms.
WARNING: ce2: fault detected external to device; service degraded
WARNING: ce2: xcvr addr:0x01 - link down
NOTICE: ce2: fault cleared external to device; service available
NOTICE: ce2: xcvr addr:0x01 - link up 1000 Mbps full duplex
Hostname: EAS002
/usr/cluster/bin/scdidadm: Could not stat "../../devices/scsi_vhci/ssd@g600a0b800032c60c0000041448585c4b:c,raw" - No such file or directory.
Warning: Path node loaded - "../../devices/scsi_vhci/ssd@g600a0b800032c60c0000041448585c4b:c,raw".
/usr/cluster/bin/scdidadm: Could not stat "../../devices/scsi_vhci/ssd@g600a0b800032c6be0000060a48585b78:c,raw" - No such file or directory.
Warning: Path node loaded - "../../devices/scsi_vhci/ssd@g600a0b800032c6be0000060a48585b78:c,raw".
/usr/cluster/bin/scdidadm: Could not stat "../../devices/scsi_vhci/ssd@g600a0b800032c60c0000041348585af9:c,raw" - No such file or directory.
Warning: Path node loaded - "../../devices/scsi_vhci/ssd@g600a0b800032c60c0000041348585af9:c,raw".
/usr/cluster/bin/scdidadm: Could not stat "../../devices/scsi_vhci/ssd@g600a0b800032c6be00000608485859e4:c,raw" - No such file or directory.
Warning: Path node loaded - "../../devices/scsi_vhci/ssd@g600a0b800032c6be00000608485859e4:c,raw".
Booting as part of a cluster
NOTICE: CMM: Node EAS001 (nodeid = 1) with votecount = 1 added.
NOTICE: CMM: Node EAS002 (nodeid = 2) with votecount = 1 added.
NOTICE: CMM: Quorum device 2 (/dev/did/rdsk/d12s2) added; votecount = 1, bitmask of nodes with configured paths = 0x3.
WARNING: CMM: Open failed for quorum device 2 with gdevname '/dev/did/rdsk/d12s2'.
NOTICE: clcomm: Adapter ce3 constructed
NOTICE: clcomm: Adapter ce2 constructed
NOTICE: CMM: Node EAS002: attempting to join cluster.
NOTICE: CMM: Node EAS001 (nodeid: 1, incarnation #: 1213861949) has become reachable.
NOTICE: clcomm: Path EAS002:ce3 - EAS001:ce3 online
NOTICE: clcomm: Path EAS002:ce2 - EAS001:ce2 online
NOTICE: CMM: Cluster has reached quorum.
NOTICE: CMM: Node EAS001 (nodeid = 1) is up; new incarnation number = 1213861949.
NOTICE: CMM: Node EAS002 (nodeid = 2) is up; new incarnation number = 1213862900.
NOTICE: CMM: Cluster members: EAS001 EAS002.
WARNING: Received non interrupt heartbeat on EAS002:ce3 - EAS001:ce3 - path timeouts are likely.
WARNING: Received non interrupt heartbeat on EAS002:ce2 - EAS001:ce2 - path timeouts are likely.
NOTICE: CMM: node reconfiguration #8 completed.
NOTICE: CMM: Node EAS002: joined cluster.
ip: joining multicasts failed (18) on clprivnet0 - will use link layer broadcasts for multicast
/usr/cluster/bin/scdidadm: Could not stat "../../devices/scsi_vhci/ssd@g600a0b800032c60c0000041448585c4b:c,raw" - No such file or directory.
Warning: Path node loaded - "../../devices/scsi_vhci/ssd@g600a0b800032c60c0000041448585c4b:c,raw".
/usr/cluster/bin/scdidadm: Could not stat "../../devices/scsi_vhci/ssd@g600a0b800032c6be0000060a48585b78:c,raw" - No such file or directory.
Warning: Path node loaded - "../../devices/scsi_vhci/ssd@g600a0b800032c6be0000060a48585b78:c,raw".
/usr/cluster/bin/scdidadm: Could not stat "../../devices/scsi_vhci/ssd@g600a0b800032c60c0000041348585af9:c,raw" - No such file or directory.
Warning: Path node loaded - "../../devices/scsi_vhci/ssd@g600a0b800032c60c0000041348585af9:c,raw".
/usr/cluster/bin/scdidadm: Could not stat "../../devices/scsi_vhci/ssd@g600a0b800032c6be00000608485859e4:c,raw" - No such file or directory.
Warning: Path node loaded - "../../devices/scsi_vhci/ssd@g600a0b800032c6be00000608485859e4:c,raw".
/dev/md/rdsk/d50 is clean
/dev/md/rdsk/d40 is clean
/dev/md/rdsk/d70 is clean
EAS002 console login: obtaining access to all attached disks
Jun 19 16:09:17 EAS002 login: ROOT LOGIN /dev/pts/1 FROM 192.168.88.90
此时EAS001 正常 ,重启完了scstat也正常
重启EAS002
EAS001 console login: /etc/rc0.d/K05stoprgm: Calling scswitch -S (evacuate)
Jun 19 16:17:41 EAS001 ip: TCP_IOC_ABORT_CONN: local = 192.168.088.100:0, remote = 000.000.000.000:0, start = -2, end = 6
Jun 19 16:17:41 EAS001 ip: TCP_IOC_ABORT_CONN: local = 192.168.088.101:0, remote = 000.000.000.000:0, start = -2, end = 6
Jun 19 16:17:41 EAS001 ip: TCP_IOC_ABORT_CONN: aborted 0 connection
/etc/rc0.d/K05stoprgm: disabling failfasts
svc.startd: The system is coming down. Please wait.
svc.startd: 132 system services are now being stopped.
Jun 19 16:17:41 EAS001 last message repeated 1 time
Jun 19 16:17:45 EAS001 cl_eventlogd[1178]: Going down on signal 15.
Jun 19 16:18:02 EAS001 syslogd: going down on signal 15
Jun 19 16:18:02 rpc.metad: Terminated
Jun 19 16:18:23 Cluster.RGM.fed: SCSLM thread WARNING pools facility is disabled
umount: /global/.devices/node@2 busy
umount: /global/.devices/node@1 busy
svc.startd: The system is down.
syncing file systems... done
WARNING: CMM: Node being shut down.
rebooting...
Resetting ...
Software Reset
Enabling system bus....... Done
Initializing CPUs......... Done
Initializing boot memory.. Done
Initializing OpenBoot
Probing system devices
ChassisSerialNumber 014311
Probing I/O buses
screen not found.
keyboard not found.
Keyboard not present. Using ttya for input and output.
Probing system devices
ChassisSerialNumber 014311
Probing I/O buses
Sun Fire V890, No Keyboard
Copyright 2005 Sun Microsystems, Inc. All rights reserved.
OpenBoot 4.18.11, 16384 MB memory installed, Serial #75878574.
Ethernet address 0:14:4f:85:d0:ae, Host ID: 8485d0ae.
Rebooting with command: boot
Boot device: disk File and args:
SunOS Release 5.10 Version Generic_127111-06 64-bit
Copyright 1983-2007 Sun Microsystems, Inc. All rights reserved.
Use is subject to license terms.
Hostname: EAS001
WARNING: ce2: fault detected external to device; service degraded
WARNING: ce2: xcvr addr:0x01 - link down
WARNING: ce3: fault detected external to device; service degraded
WARNING: ce3: xcvr addr:0x01 - link down
NOTICE: ce3: fault cleared external to device; service available
NOTICE: ce3: xcvr addr:0x01 - link up 1000 Mbps full duplex
NOTICE: ce2: fault cleared external to device; service available
NOTICE: ce2: xcvr addr:0x01 - link up 1000 Mbps full duplex
/usr/cluster/bin/scdidadm: Could not stat "../../devices/scsi_vhci/ssd@g600a0b800032c60c0000041448585c4b:c,raw" - No such file or directory.
Warning: Path node loaded - "../../devices/scsi_vhci/ssd@g600a0b800032c60c0000041448585c4b:c,raw".
/usr/cluster/bin/scdidadm: Could not stat "../../devices/scsi_vhci/ssd@g600a0b800032c6be0000060a48585b78:c,raw" - No such file or directory.
Warning: Path node loaded - "../../devices/scsi_vhci/ssd@g600a0b800032c6be0000060a48585b78:c,raw".
/usr/cluster/bin/scdidadm: Could not stat "../../devices/scsi_vhci/ssd@g600a0b800032c60c0000041348585af9:c,raw" - No such file or directory.
Warning: Path node loaded - "../../devices/scsi_vhci/ssd@g600a0b800032c60c0000041348585af9:c,raw".
/usr/cluster/bin/scdidadm: Could not stat "../../devices/scsi_vhci/ssd@g600a0b800032c6be00000608485859e4:c,raw" - No such file or directory.
Warning: Path node loaded - "../../devices/scsi_vhci/ssd@g600a0b800032c6be00000608485859e4:c,raw".
Booting as part of a cluster
NOTICE: CMM: Node EAS001 (nodeid = 1) with votecount = 1 added.
NOTICE: CMM: Node EAS002 (nodeid = 2) with votecount = 1 added.
NOTICE: CMM: Quorum device 2 (/dev/did/rdsk/d12s2) added; votecount = 1, bitmask of nodes with configured paths = 0x3.
WARNING: CMM: Open failed for quorum device 2 with gdevname '/dev/did/rdsk/d12s2'.
NOTICE: clcomm: Adapter ce3 constructed
NOTICE: clcomm: Adapter ce2 constructed
NOTICE: CMM: Node EAS001: attempting to join cluster.
NOTICE: CMM: Node EAS002 (nodeid: 2, incarnation #: 1213863689) has become reachable.
NOTICE: clcomm: Path EAS001:ce3 - EAS002:ce3 online
NOTICE: CMM: Cluster has reached quorum.
NOTICE: CMM: Node EAS001 (nodeid = 1) is up; new incarnation number = 1213863656.
NOTICE: CMM: Node EAS002 (nodeid = 2) is up; new incarnation number = 1213863689.
NOTICE: CMM: Cluster members: EAS001 EAS002.
NOTICE: clcomm: Path EAS001:ce2 - EAS002:ce2 online
NOTICE: CMM: node reconfiguration #1 completed.
NOTICE: CMM: Node EAS001: joined cluster.
WARNING: Received non interrupt heartbeat on EAS001:ce3 - EAS002:ce3 - path timeouts are likely.
WARNING: Received non interrupt heartbeat on EAS001:ce2 - EAS002:ce2 - path timeouts are likely.
ip: joining multicasts failed (18) on clprivnet0 - will use link layer broadcasts for multicast
/dev/md/rdsk/d50 is clean
/dev/md/rdsk/d40 is clean
/dev/md/rdsk/d70 is clean
这个时候EAS002 也跟着重启了.
root@EAS002 # scstat -D
-- Device Group Servers --
Device Group Primary Secondary
------------ ------- ---------
Device group servers: oraset - -
-- Device Group Status --
Device Group Status
------------ ------
Device group status: oraset Offline
-- Multi-owner Device Groups --
Device Group Online Status
------------ -------------
newfs /dev/did/rdsk/d11s2
/dev/did/rdsk/d11s2: No such device or address
(d11为阵列上的卷) 阵列上其他卷也一样
但format 和scdidadm -L都能看到所有的卷,而且format 能对卷进行分区
要两台机器重新做一下# devfsadm -Cv
# scgdevs
# scdidadm -c
# scdidadm -C
才能认重新认回来,
root@EAS002 # scstat -q
-- Quorum Summary --
Quorum votes possible: 3
Quorum votes needed: 2
Quorum votes present: 2
-- Quorum Votes by Node --
Node Name Present Possible Status
--------- ------- -------- ------
Node votes: EAS001 1 1 Online
Node votes: EAS002 1 1 Online
-- Quorum Votes by Device --
Device Name Present Possible Status
----------- ------- -------- ------
Device votes: /dev/did/rdsk/d12s2 0 1 Offline
quorum也丢了,要重新删掉再建.哪位大哥可以分享解决办法呀?
lem0
再重启一台,监控另一台发现是quorum device 丢失了,引起系统crash了.
Notifying cluster that this node is panicking
NOTICE: clcomm: Path EAS002:ce2 - EAS001:ce2 being drained
NOTICE: clcomm: Path EAS002:ce3 - EAS001:ce3 being drained
TCP_IOC_ABORT_CONN: local = 000.000.000.000:0, remote = 172.016.004.001:0, start = -2, end = 6
panic[cpu17]/thread=30003d9dc60: CMM: Cluster lost operational quorum; aborting.
000002a102f87550 cl_runtime:__1cZsc_syslog_msg_log_no_args6Fpviipkc0_nZsc_syslog_msg_status_enum__+30 (6000f17f000, 3, 0, 43, 2a102f87750, 7054224c)
%l0-3: 0000000070541e28 0000000000000047 000006000dc83eeb 0000000000000047
%l4-7: 0000000003bf8ce2 000006000dc758be 0000000000000000 00000000704c7000
000002a102f87600 cl_runtime:__1cCosNsc_syslog_msgDlog6MiipkcE_nZsc_syslog_msg_status_enum__+1c (6000e2236d0, 3, 0, 7054224c, 1, 2a102f87740)
%l0-3: 0000000000000009 000000007057dcdc 0000000000000001 0000000000000000
%l4-7: 0000000000000002 000006000e01b550 0000000000000000 0000000000000001
000002a102f876b0 cl_haci:__1cOautomaton_implbAstate_machine_qcheck_state6M_nVcmm_automaton_event_t__+4a4 (6000de7e008, de, 6000de7ee40, 6000de7ecd8, de0, 70542192)
%l0-3: 00000000704f2000 0000000070542000 00000000000704f2 0000000000070400
%l4-7: 0000000070541000 0000000000070541 0000000000070400 000006000de7fc20
000002a102f87930 cl_haci:__1cIcmm_implStransitions_thread6M_v_+f4 (6000de7e008, 1, 1, 6000de7e9d8, 1d354, 926a6e8e4c)
%l0-3: 000006000de7e078 000006000de9b5a4 000000000001d59c 000000000001d400
%l4-7: 000000000000002c 0000000000000004 000000000001d354 000006000de7e498
000002a102f879e0 cl_orb:cllwpwrapper+c4 (2a102f87b70, 7b2781f8, 2, 2, 2, 0)
%l0-3: 0000000000001800 0000000000000007 fffffffffffffffd 000000000000003a
%l4-7: 00000000704f5000 00000000000704f5 0000000000070400 00000000018bf800
000002a102f87ac0 unix:___const_seg_900002901+3a2c (2a102f87b70, 18, 0, 0, 0, 0)
%l0-3: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
%l4-7: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
syncing file systems... done
dumping to /dev/dsk/c1t0d0s1, offset 65536, content: kernel
100% done: 125598 pages dumped, compression ratio 6.73, dump succeeded
rebooting...
Resetting ...