Backing out Patches from an Unbootable Server

patching, Solaris, UNIX

Sometimes when patching gets interrupted – by a user, power outage, hardware failure, etc, you can end up with incomplete or mis-installed patches. If these patches are important ones – like kernel patches, your system may not even boot from disk. Many times this will cause an endless reboot cycle of kernel panics.

Of course you could have prevented this by breaking your root mirror before installing the patches, or by using LiveUpgrade. But I know sometimes we just don’t do these things, for various reasons.

One solution to this is to boot from the network into single-user mode, mount your root disk, and back out the patch or patches on disk, hopefully repairing the damage and returning you to a bootable state. Of course this is assuming you have a jumpstart server on your network as well. NOTE: I tested this with a Solaris 10 06/06 boot image on the jumpstart server – I haven’t tested earlier versions)

Where I am, all the root disks are mirrored with SVM. An procedure I’d used in the past was to boot from the network, run patchrm on the first disk in the mirror to back out the patch, and then disable the mirror, so the second disk would not be used when rebooted. Then re-mirror later. This was rather tedious and error-prone, especially with multiple metadevices, soft partitions, etc. I found a new way to accomplish this task: keep the mirror intact and back out the patches while booted from the network. Our systems also use zones, and their zonepaths are on soft partitions. This procedure will also back out the patches from the zones. Here is the exact procedure:

1. Boot from network into single-user mode
ok> boot net -s
2. Mount root file system READ ONLY from the first disk in the mirror:
# mount -o ro /dev/dsk/c1t0d0s0 /mnt
3. Copy the SVM configuration to the running OS:
# cp /mnt/kernel/drv/md.conf /kernel/drv/md.conf
4. Unmount the root disk
# umount /mnt
5. Update the SVM driver to load the new configuration (ignore error messages)
# update_drv -f md
6. Set up metadevices in configuration
# metainit -r
7. Run metasync on root mirror metadevice
# metasync d10
8. Mount root metadevice on /mnt
# mount /dev/md/dsk/d10 /mnt
9. If the system has zones, run metasync on the metadevice containing the soft partitions, and mount all zone root file systems
# metasync d40
# mount /dev/md/dsk/d53 /mnt/zones/zonepath1
# mount /dev/md/dsk/d56 /mnt/zones/zonepath2
10. Rollback the failed patch.
# patchrm -R /mnt $patch 2>&1 | tee -a /mnt/backout.log
11. umount /mnt and reboot server

Share

Solaris Patching Strategy & Tools

UNIX

There are many strategies for patching your servers. I’ve worked in environments where we followed the “if it ain’t broke, don’t patch” philosophy, and 10 years ago this seemed reasonable in Solaris. Other environments try to apply every patch available from their vendors. I think the best approach is of course somewhere in between.

Sun makes available several patch bundles that many customers use for patching. The most popular bundle is the Recommended patch cluster for your Solaris version. This patch cluster provides the latest revisions of any patch that addresses a Sun Alert, which is any fix for security, data corruption or system availability. This is well-tested combination of patches, known to be stable and compatible.

A second patch cluster is also available from Sun – the Sun Alert patch cluster. This cluster is similar to the recommended cluster, but with one difference. It contains the minimum revision of any patch that addresses a Sun Alert. Applying this cluster will fix all Sun Alert issues, while introducing the least amount of change to your systems.

Once you have more that 50 servers, it can take months to completely patch them, depending on maintenance windows and uptime requirements. Sometimes it’s difficult to know if a particular server is patched yet or not, or when the last patching took place. One trick I learned a while back was to create a nearly-empty Solaris package, with only a text file, containing the patch level installed. This package is installed or updated with each patch cycle so you know exactly which set of patches is installed at any time.

Sun has come up with several patch management tools over the years, some better than others (PatchDiag, PatchCheck, PatchPro, smpatch). The most recent uses the “Patch Update Manager” in Solaris 10 or the xVM Ops Center product. If you’re looking for something low cost (free), then it might be worth looking at Patch Check Advanced. This tool analyzes your system, automatically downloads and installs patches fairly easily. It’s a great tool.

Share

Oracle Dynamic SGA in Solaris Zones

performance, Solaris, UNIX

Since 9i, Oracle has included a feature to dynamically resize the SGA of the database when needed, without the need to restart the database. It utilizes Solaris “Dynamic Intimate Shared Memory” (DISM) to accomplish this.

DISM provides dynamically resizable shared memory. Any process that uses a DISM segment can lock and unlock parts of a memory segment, and by doing so, the application can dynamically adjust to the addition (or removal) of physical memory from a server.

In the initial releases of Solaris 10, DISM was unavailable within Solaris Zones, because the ability for processes to lock memory segments was not available. If you do try to run Oracle with DISM in Zones before 11/06, you’ll see completely awful database performance (I’ve seen it). The fix for this was to disable DISM in Oracle, by setting the Oracle parameters sga_max_size and sga_target to the same value, so the SGA would not resize at all.

Solaris 10 update 3 (11/06) introduced a new zone privilege: proc_lock_memory, which gives processes within the zone the ability to lock memory segments. So DISM will now work if this privilege is enabled. To enable it, just turn it on in the zone config and reboot the zone:

# zonecfg -z oraclezone
zonecfg:oraclezone> set limitpriv=default,proc_lock_memory
zonecfg:oraclezone> commit
zonecfg:oraclezone> exit
# zoneadm -z oraclezone reboot

If you see an error after the “set limitpriv” line when you try, make sure you have Solaris 11/06 or later (or the patched equivilant).

Share

Cleaning out /var in Solaris

Solaris, UNIX

Since your Solaris 10 installation, your data in the /var file system will grow each time you apply a patch. Depending on your patching strategy, over time you could find yourself running out of space, if you use a dedicated /var partition. This, in addition to mail, and logging from all kinds of applications can worsen the problem.

I’d say the best strategy is to increase the size of /var. If you’re using the standard UFS file system with no volume management, this means backing up, re-creating the partition, and restoring the data. If you do have some sort of volume management, sometimes the answer is a simple metattach/growfs or vxresize command.

If you want another option, just to get your by until you have the time to increase /var, there is another easy method. When patchadd adds any patch to the system, the files being replaced get saved off in case you need to remove the patch later, restoring these files. These files are compressed and stored in /var/sadm/pkg/ /save/ and in /var/sadm/pkg/ /save/pspool/ /save/. The files are called undo.Z.

Note: It is completely safe to delete these .Z files, as long as you are sure you will never need to back out its associated patch! Doing this can free up significant space.

I’ve even done things like this in a pinch: (the shotgun approach)

#find /var -name undo.Z -exec rm {} \;

Share

Link-based IPMP setup with VCS

network, Solaris, vcs

With Solaris 10 came a nice feature – Link-based IP Multipathing (IPMP). It determines NIC availability solely on the NIC driver reporting the physical link status – UP or DOWN. Previous versions used “probe-based” IPMP, where connectivity is tested by pinging something on the network from each interface. While probe-based is actually a more thorough test (tests network layer 3 as well as 2), it is much more cumbersome to configure, and you need an extra IP address for each interface for “test” addresses. IMO Link-based IPMP is sufficient for most applications.

For some reason, configuring link-based IPMP in VCS is somewhat tricky, and the documentation doesn’t seem to help much. It seems all the default values for VCS are for probe-based IPMP only.

To achieve link-based IPMP, here’s how I’ve configured my MultiNICB resource:

Link-based IPMP MultiNICB properties

Link-based IPMP MultiNICB properties


These are the values you must change from the defaults:

UseMpathd: 1
Tells VCS to use mpathd for network link status
MpathCommand: /usr/lib/inet/in.mpathd -a
The default, /usr/sbin/in.mpathd is just incorrect – it doesn’t live there.
ConfigCheck: 0
If you leave this at 1, it will overwrite your /etc/hostname.xxx files with probe-based IPMP configuration
Device: (your IPMP interfaces here)
The “interface alias” for each device is not needed, leave them blank.
IgnoreStatus: 0
You want VCS to NOT ignore link status, since this is how link-based IPMP works.
GroupName:
Do not use your IPMP group name here, it’s not needed. VCS is not monitoring the group, mpathd is.

Here’s how it looks in main.cf:

MultiNICB csgmultinic (
UseMpathd = 1
MpathdCommand = “/usr/lib/inet/in.mpathd -a”
ConfigCheck = 0
Device = { ce0 = “”, ce4 = “” }
IgnoreLinkStatus = 0
)

Share

Changing timeouts on SMF services

Solaris, UNIX

I’ve run into an issue where the default timeout value (120 seconds) was not long enough for the start methods to run for my system services. In particular, the psncollector service.

The psncollector service runs a ‘prtfru -x’ command, which can take several minutes to complete on a large server like an E2900. With the 120 second timeout, the start method fails:

# svcs -x
svc:/application/psncollector:default (Product Serial Number Collector)
State: maintenance since Sun 25 Jan 2009 10:01:34 AM PST
Reason: Start method failed repeatedly, last died on Killed (9).
See: http://sun.com/msg/SMF-8000-KS
See: /var/svc/log/application-psncollector:default.log
Impact: This service is not running.

# tail /var/svc/log/application-psncollector:default.log
[ Jan 25 08:59:51 Executing start method ("/lib/svc/method/svc-psncollector") ]
Using /var/run
[ Jan 25 09:02:01 Method or service exit timed out. Killing contract 48 ]
[ Jan 25 09:02:05 Method or service exit timed out. Killing contract 48 ]
[ Jan 25 09:02:18 Method "start" failed due to signal KILL ]

The easy fix was to increase the service start timeout value:

# svccfg -s psncollector setprop start/timeout_seconds=480
# svccfg -s psncollector setprop restart/timeout_seconds=480
# svcadm refresh psncollector
# svcadm clear psncollector

Once cleared, the service started up, taking its usual 3+ minutes.

Share
« Older Posts