The Cisco Wireless Controller nightmare™

We have a wireless network with about 80 APs and 2 4404 WLC’s running, however, we ran into several issue when working with the Wireless Controllers.

When we first put the network in production in spring 2008, we lost the management of both controllers after some hours uptime. Of course the 2 Wireless Controllers were installed in remote-sites and we hadn’t direct access to the sites (is this part of Murphy’s Law?).

TAC: The problem you are hitting is bug CSCsm98250.

CSCsm98250: Webauth stop working after upgrade to 5.0.

  • Severity: 1 – catastrophic
  • Symptom: Webauth and controller access via HTTP or telnet/SSH stop working.
  • Conditions: After the controller was upgrade to 5.0, randomly webauth, and controller access via HTTP or telnet/SSH stop working.
  • Workaround: Reboot controller.
  • Affects: 5.0(148.0)
  • Fixed in: 5.0(158.0)

TAC was providing a development-image, as the bugfix was release pending.  Issue resolved. Shit happens.

The 2 controllers worked with this development-image more than a year without any major problems. However, in summer 2009 we had to upgrade to 5.2 train, because we needed the Cisco Mesh feature (everyone knows the saying about never touching a running system…).

We upgraded both controllers to latest 5.2 release and the nightmare began.

  • One of the controllers suddenly stopped responding to DHCP packets, (‘handshake incomplete’ when debugging dhcp). Reboot didn’t help (!), so we simply put the ports of the controller in shutdown.
  • After serving a few hours, the (working) controller stopped forwarding the external WebAuth Page. Reboot resolved the problem only for a few hours.
  • Under high load, when clicking on DHCP Allocated Leases on the active controller, the controller crashes. I stopped clicking on it.

Great, 3 different issues with the new release train. When I opened (one of the) SR, I asked if can downgrade and return to the old (5.0) release. TAC said no, 99% probability that the configuration becomes corrupt. I couldn’t even downgrade, wtf? I had to track down all of the 3 issues.

So here we go:

Problem #1: One of the WLCs didn’t respond to DHCP packets.

Opened SR, after lot of debugs and configuration checks, I checked the bug toolkit by myself and found CSCsy96551.

CSCsy96551: Internal WLC-DHCP not sending out NAK

  • Severity: 2 – severe
  • Symptom: When a PC tries to renew an IP address it had before (for example from home network), the WLC-internal DHCP server does not send an NAK frame. Instead the controller sends a DHCP ignore message and does not assign a IP address to the machine.
  • Conditions: Using WLC internal DHCP.
  • Workaround: Manually release/renew on the PC
  • Affects: 4.2(176.0)
  • Fixed in: 5.2(185.0)

Problem #2: Webauth didn’t worked under load.

I opened a second SR for this one. Guess what, it’s a bug!

CSCsx07878: Webauth: web page not displayed under heavy load

  • Severity: 2 – severe
  • Symptom: Clients intermittently cannot log into WLAN with web authentication (webauth)
  • Conditions: Controller version 4.2.176.0 with WLAN that has webauth confgured and over 400 clients
  • Workaround: Rebooting the controller may stop the problem temporarily.
  • Affected:
    • 5.1(151.0)
    • 4.2(176.0)
    • 5.2(157.0)
    • 6.0(120.0)
    • 6.0(140.0)
  • Fixed in:
    • 6.0(127.0)
    • 5.2(183.0)
    • 4.2(196.0)
    • 6.0(144.0)
    • 4.2(198.0)
    • 4.2(130.191)
    • 6.0(182.0)
    • 5.2(193.0)
    • 4.2(176.51)
    • 3.2(214.0)

Ok, now that’s interesting, another bug which triggers only under load…

However , its incomprehensible to me why they didn’t fixed this bug in the 6.0 train. The 6.0 train was not released when i opened the SR and the bug was already known. The bug still affects 2 public releases from 6.0 train, while first fixed in 3.2. The developers failed here miserably – don’t they do any regression tests?

Problem #3: Controllers crashes when view of dhcp scope leases.

Needless to say, its a bug.

CSCsw38078: 5.2.157 anchor WLC crashes on gui view of dhcp scope leases.

  • Severity: 2 – severe
  • Symptom: Anchor WLC running 5.2.157 crashes when dhcp leases viewed from the gui.
  • Workaround: Avoid viewing dhcp leases using the gui. The CLI can be used to view the leases in seconds. Hours & minutes don’t work due to CSCsw34627.
  • Affects: 5.2(157.0)
  • Fixed-In: 5.2(163.0)

I got another (my second, hehe) development image from on of the SR’s, which resolved each of the 3 bugs.

After some weeks, 6.0 came out and a TAC engineer notified me asking whether I wanna upgrade. LOL…

1 thought on “The Cisco Wireless Controller nightmare™

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.