We recently implemented Brocade MAPS (Monitoring and Alerting Policy Suite) across all our SAN directors. MAPS replaces the Fabric Watch capabilities starting in FOS v7.4. It's much simpler to implement and manage compared to FW and provides well over 200 rules by default.

Shortly after implementation we observed selected directors reporting CPU (Control Processor) utilization consistently at 100%. While this does not impact FC performance, it does impact managing the directors and even logging in was taking minutes at times. There is also a potential impact if the director panics as a result of this high utilization state.

After several sessions with support, we were making no progress. We had even done an ha failover, but the problem quickly returned on the alternate CP. Logging in as root and running top, it was quickly found that the snmpd process was consuming the vast majority of CPU cycles. Why? And just select directors? We were advised to upgrade FOS, but we are running a target path version of 7.4.1e and there was no evidence an upgrade was the resolution to this problem.

It seemed the problem may be remote systems/tools polling the directors and causing the problem, but we had reviewed all storage tools and systems and did not find anything of note.

The goal at this point was to see if we could restrict snmp access to Network Advisor and try to eliminate the problem with other tools polling. The Brocade Command Reference Guide offered the following information on access control via snmpconfig.

Access Control Configuration Parameters
The ACL check is as follows: there are six ACLs to restrict SNMP get, set, and trap operations to hosts under an host-subnet-area. The host-subnet-area is defined by comparing nonzero IP octets. For example, an ACL of 192.168.64.0 enables access by any hosts that start with the specified octets. The connecting host is enabled to set each host-subnet-area to be read-write or read-only. The closest match out of six entries is given access. The ACL check is turned off when all six entries contain 0.0.0.0. The default values of all six entries are 0.0.0.0. For IPv6 subnets, the format is specified by an IPv6 address followed by the number of fixed bits in the address.

Our directors had no entries, so any external device could issue snmp requests. This is the default behavior.

SANA_DEFAULT:FID128:admin> snmpconfig --show accessControl

SNMP access list configuration:
Entry 0:  No access host configured yet
Entry 1:  No access host configured yet
Entry 2:  No access host configured yet
Entry 3:  No access host configured yet
Entry 4:  No access host configured yet
Entry 5:  No access host configured yet

So I ran the following command and restricted all snmp access to the BNA server only, with RW access.

SANA_DEFAULT:FID128:admin> snmpconfig --set accessControl

SNMP access list configuration:
Access host subnet area : [0.0.0.0] xxx.xx.xxx.xxx (BNA IP here)
Read/Write? (true, t, false, f): [false] t
Access host subnet area : [0.0.0.0]
Read/Write? (true, t, false, f): [true]
Access host subnet area : [0.0.0.0]
Read/Write? (true, t, false, f): [true]
Access host subnet area : [0.0.0.0]
Read/Write? (true, t, false, f): [true]
Access host subnet area : [0.0.0.0]
Read/Write? (true, t, false, f): [true]
Access host subnet area : [0.0.0.0]
Yead/Write? (true, t, false, f): [true]
Committing configuration.....done.

SANA_DEFAULT:FID128:admin> snmpconfig --show accessControl

SNMP access list configuration:
Entry 0:  Access host subnet area xxx.xx.xxx.xxx (rw)
Entry 1:  No access host configured yet
Entry 2:  No access host configured yet
Entry 3:  No access host configured yet
Entry 4:  No access host configured yet
Entry 5:  No access host configured yet

After a few minutes... Bingo! The CPU utilization went down drastically. It was staying in the 20-30% range consistently. I watched the CPU use with the following command.

SANA_DEFAULT:FID128:admin> mapssam --show cpu
Showing Cpu Usage:
   CPU Usage   : 21.0%

Going forward we will be restricting access to snmp services on our SAN devices. While we were fortunate there was no impact, this could have caused problems similar to a DOS attack, even if not intentional.

If you haven't implemented MAPS yet in your Brocade SAN, I highly recommend it! It's super easy to do and brings a tremendous amount of visibility into the health status of your fabrics and switches.

Comments

There are no comments on this post.

Recent Posts

Archives


Contact Cecil