At least 90% of my customers utilize their existing VMware environment to run Cisco ISE instead of buying hardware (SNS-3xx5) servers. There are issues you need to be aware of when utilizing a VM environment. Here are the two most common issues I’ve seen in the field.
The first issue is enabling VMware snapshots to backup the ISE nodes. The answer to “Should I use snapshots?” is a huge “NO!”. There is a small blurb in the Cisco ISE installation guide that I’m certain 99.9% of users never read. Here is the entry in its entirety:
Cisco ISE does not support VMware snapshots for backing up ISE data because a VMware snapshot saves the status of a VM at a given point in time. In a multi-node Cisco ISE deployment, data in all the nodes are continuously synchronized with current database information. Restoring a snapshot might cause database replication and synchronization issues. Cisco recommends that you use the backup functionality included in Cisco ISE for archival and restoration of data.
Using VMware snapshots to back up ISE data results in stopping Cisco ISE services. A reboot is required to bring up the ISE node.
Informing customers about this is one of my first steps during an installation. You will run into issues where the ISE node suddenly goes offline. You can utilize the backup/restore function built into ISE. A distributed deployment with two (2) admin nodes already has redundancy because both admin nodes retain the configuration for the deployment. The only reason I’ve found that you would need to do a restore is either a) a bad configuration was applied and you need to roll back or b) you lose both admin nodes. Either scenario is easily rectified by restoring an ISE backup or rebuilding an admin node and then restoring a backup.
The second issue is trying to resize an ISE node. When you install the VM, the ISE installation determines what configuration (small or large) you are using and writes that configuration to the underlying ADE-OS. There are two scenarios that I see most often:
- You installed using the 3515 OVA which has 6 cores. Your deployment grows and it’s determined you really need the 3595 sized appliance CPU and RAM because utilization is high. Your VM team says “No problem. We’ll just shutdown the server and add the extra CPU and memory.” They proceed to do just that and everything comes back up but the performance doesn’t really improve. The reason is because even though ISE sees the extra resources it is not configured to actually use them.
- Your Monitor node is running out of space because you set it up with 300GB of space. You ask the VM admin to increase the drive space. The VM admin does suddenly the server no longer boots. Why? Because you’ve changed the underlying configuration and ADE-OS doesn’t know how to handle the drive parameter change. This is also true for customers that see drives are underutilized and think “I don’t need all that space so I’ll shrink the drive in order to use it for other servers.” so they shrink the drive.
If you need to change the hardware configuration, the best option is to delete the VM and rebuild it with the new settings. Backup the SSL certificates for that node, remove it from the deployment, rebuild it, add the SSL certificates back, and then join it back into the deployment.
Update: The above information about resizing the ISE node was what I saw up through ISE release 2.2. With ISE release 2.3 and up, you can increase the CPU and memory in order to change the VM from a 3×15 to 3×55. Make sure you shut the node down first. After the node is started back up, check the ISE Counters to verify the changes took effect. Hard drive size still cannot be changed without a reinstall.
I have an issue with one of my clients regarding snapshots.. Someone made a snapshot of our Primary Admin Node this Easter (deployment of four nodes, two Admin and Monitoring, and two Policy Nodes).
The synchronization with the other nodes stopped, obviously, when I performed the manual synchronization everything went good with the Secondary Admin node but with the Policy nodes it failed, with both of them, after two hours.
Do you ever see anything like this before? Any suggestions? I have already opened a TAC case but any help would be appreciated.