Josh-D. S. Davis

Xaminmo / Omnimax / Max Omni / Mad Scientist / Midnight Shadow / Radiation Master

Previous Entry Share Next Entry
AIX 7.2 crash removing adapters from etherchannel
Josh 201604 KWP
joshdavis
If I remove the first main adapter, and re-add it, then I can add/remove either adapter or IP interface after that.

If I remove the second main adapter, and re-add it, then I cannot remove the first, and dropping the IP interface crashes.

So, assuming adapter_names=ent2,ent6

This works

If I remove the first main adapter, and re-add it, then I can add/remove either adapter or IP interface after that.

If I remove the second main adapter, and re-add it, then I cannot remove the first, and dropping the IP interface crashes.

So, assuming adapter_names=ent2,ent6

This works everywhere:
/usr/lib/methods/ethchan_config -d ent17 ent2
/usr/lib/methods/ethchan_config -a ent17 ent2
/usr/lib/methods/ethchan_config -d ent17 ent6
/usr/lib/methods/ethchan_config -a ent17 ent6
/usr/sbin/rmdev -Rl en17
/usr/sbin/mkdev -l en17
/usr/sbin/cfgmgr
# Can do any combination of the above after remove/readd first adapter in advance.


And this crashes everywhere:
/usr/lib/methods/ethchan_config -d ent17 ent6
/usr/lib/methods/ethchan_config -a ent17 ent6
# crashed here on one server
/usr/lib/methods/ethchan_config -d ent17 ent2
ethchan_config: 0950-021 Unable to delete adapter ent2 from the
EtherChannel because it could not be found, errno = 2
/usr/sbin/rmdev -Rl en17

# crash here on several others

Crash analysis follows:

(96)> stat
SYSTEM_CONFIGURATION:
CHRP_SMP_PCI POWER_PC POWER_8 machine with 160 available CPU(s) (64-bit
registers)

SYSTEM STATUS:
sysname... AIX
nodename.. testnode001
release... 2
version... 7
build date Mar 2 2018
build time 13:02:46
label..... 1809C_72H
machine... 00DEADBEEF00
nid....... FBCAFE4C
time of crash: Wed May 9 04:45:59 2018
age of system: 25 day, 10 hr., 54 min., 41 sec.
xmalloc debug: enabled
FRRs active... 0
FRRs started.. 0

CRASH INFORMATION:
CPU 96 CSA F00000002FF47600 at time of crash, error code for LEDs:
30000000
pvthread+1A0E00 STACK:
[00009324].unlock_enable_mem+000018 ()
[06058D54]shientdd:entcore_disable_tx_timeout_timers@AF123_105+000074
(??, ??)
[060592E8]shientdd:entcore_suspend_nic+000028 (??, ??)
[0605FB20]shientdd:entcore_suspend+0001E0 (??, ??, ??)
[06129A68]shientdd:entcore_close_common+000668 (??)
[0612A0B0]shientdd:entcore_close+000490 (??)
[060103CC]shientdd:shi2ent_close+00000C (??)
[F1000000C04911C0]ethchandd:ethchan_close+0001A0 (??)
[00014D70].hkey_legacy_gate+00004C ()
[0057A914]ns_free+000074 (??)
[00014F50].kernel_add_gate_cstack+000030 ()
[069E503C]if_en:en_ioctl+0002DC (??, ??, ??)
[0057126C]if_detach+0001CC (??)
[0056E1DC]ifioctl+00081C (F00000002FF473D0, 8020696680206966,
00000000066EB8A0)
[005EA764]soo_ioctl+0005C4 (??, ??, ??)
[007A4754]common_ioctl+000114 (??, ??, ??, ??)
[00003930]syscall+000228 ()
[kdb_get_virtual_memory] no real storage @ 2FF22358
[D011C92C]D011C92C ()
[kdb_read_mem] no real storage @ FFFFFFFFFFF5D60

(96)> status | grep -v wait
CPU INTR TID TSLOT PID PSLOT PROC_NAME
96 20E03BF 6670 380324 3128 ifconfig

(96)> vmlog
Most recent VMM errorlog entry
Error id = DSI_PROC
Exception DSISR/ISISR = 000000000A000000
Exception srval = 00007FFFFFFFD080
Exception virt addr = 0000000000000004
Exception value = 00000086 EXCEPT_PROT

0x86:
Protection exception. An attempt was made to write to a protected
address in memory

(96)> th -n ifconfig
SLOT NAME STATE TID PRI RQ CPUID CL WCHAN
pvthread+1A0E00 6670*ifconfig RUN 20E03BF 03E 96 0
shientdd:.entcore_disable_tx_timeout_timers AF123_105+000074
bla < .unlock_enable>
.
2390 ! SUNLOCK(TX_QUEUE_SLOCK, tx_pri);
.

---- NDD INFO ----( F1000B003952B410)----
name............. ent6 alias............ en6
ndd_next......... 0000000000000000
ndd_flags........ 00610812
(BROADCAST!NOECHO!64BIT!CHECKSUM_OFFLOAD)
ndd_2_flags...... 00000930
(IPV6_LARGESEND!IPV6_CHECKSUM_OFFLOAD!LARGE_RECEIVE!ECHAN_ELEM)

(96)> print entcore_acs_t F1000B00393F0000
struct entcore_acs_t
struct entcore_tx_queue_t
< ...>
struct entcore_ras_cb_t *ffdc_ras_cb = 0xF1000B0039537D40;
struct entcore_tx_atomics_t *atomics = 0x0000000000000000;
struct mbuf *overflow_queue = 0x0000000000000000;
struct mbuf *overflow_queue_tail = 0x0000000000000000;
uint64_t ofq_cnt = 0x0000000000000000;
struct entcore_lock_info_t *p_lock_info = 0x0000000000000000;
void *p_acs = 0xF1000B00393F0000; NULL so DSI

(96)> dd F1000B00393F78D0
F1000B00393F78D0: 0000000000000000 < - p_lock_info


</code>(96)> xm F1000B00393F78D0
Page Information:
heap_vaddr = F1000B0000000000
P_allocrange (range of 2 or more allocated full pages)
page........... 00003937 start.. F1000B00393F0000 page_cnt....... 0017
allocated_size. 00170000 pd_size........ 00010000 pinned......... yes
XMDBG: ALLOC_RECORD

Allocation Record:
F1000B00E4306600: addr......... F1000B00393F0000 allocated pinned
F1000B00E4306600: req_size..... 1458712 act_size..... 1507328
F1000B00E4306600: tid.......... 033F0187 comm......... cfgshien
XMDBG: ALLOC_RECORD
Trace during xmalloc() on CPU 00
0604FCB0(.entcore_allocate_acs+000310)
060129C4(.entcore_config_state_machine+
0601A884(.entcore_perform_init+0000A4)

Free History:
105D 40.955808 SHIENTDD GEN: L3 Close__B d1=F1000B00393F0000
105D 40.955808 SHIENTDD GEN: L3 CloseC_B d1=F1000B00393F0000
105D 40.955809 SHIENTDD GEN: L3 HwClos_B d1=F1000B00393F0000
105D 40.955810 SHIENTDD GEN: L3 HwClos_B -HW| d1=0000000000000000
105D 40.955810 SHIENTDD GEN: L3 HwClos10 -HW| d1=0000000000000000
105D 40.955810 SHIENTDD GEN: L3 HwClos_E -HW| d1=0000000000000000
105D 40.955811 SHIENTDD GEN: L3 HwClos_E d1=0000000000000000

< ...>

105D 41.039269 SHIENTDD GEN: L3 CloseC_E d1=F1000B00393F0000
105D 41.039269 SHIENTDD GEN: L3 Close__E d1=0000000000000000
105D 41.039273 SHIENTDD GEN: L3 Close__B d1=F1000B00393F0000

another close ? >>

105D 41.039273 SHIENTDD GEN: L3 CloseC_B d1=F1000B00393F0000
105D 41.039274 SHIENTDD GEN: L3 HwClos_B d1=F1000B00393F0000
105D 41.039275 SHIENTDD GEN: L3 HwClos_B -HW| d1=0000000000000000
105D 41.039275 SHIENTDD GEN: L3 HwClos10 -HW| d1=0000000000000000
105D 41.039276 SHIENTDD GEN: L3 HwClos_E -HW| d1=0000000000000000
105D 41.039276 SHIENTDD GEN: L3 HwClos_E d1=0000000000000000
105D 41.039276 SHIENTDD GEN: L3 Suspnd_B d1=F1000B00393F0000
105D 41.039279 SHIENTDD GEN: L3 MctSyn_B d1=F1000B00393F0000
105D 41.039281 SHIENTDD GEN: L3 MctSyn_E d1=0000000000000000
END


It seems that 2 closes happened, which would have leaded to a double free, and the crash.

Debug efix was tested for 2 weeks on 24 systems and problem was resolved, patch was stabl.

APAR IJ06720 was generated, and a public efix will be released for that./

http://omnitech.net/reference/2018/05/29/aix-7-2-crash-removing-adapters-from-etherchannel/

This entry was originally posted at https://xaminmo.dreamwidth.org/1506880.html. Please comment there using OpenID.

Comments Disabled:

Comments have been disabled for this post.

?

Log in

No account? Create an account