Browsing the archives for the Solaris tag

Cleaning out /var in Solaris

Solaris, UNIX

Since your Solaris 10 installation, your data in the /var file system will grow each time you apply a patch. Depending on your patching strategy, over time you could find yourself running out of space, if you use a dedicated /var partition. This, in addition to mail, and logging from all kinds of applications can worsen the problem.

I’d say the best strategy is to increase the size of /var. If you’re using the standard UFS file system with no volume management, this means backing up, re-creating the partition, and restoring the data. If you do have some sort of volume management, sometimes the answer is a simple metattach/growfs or vxresize command.

If you want another option, just to get your by until you have the time to increase /var, there is another easy method. When patchadd adds any patch to the system, the files being replaced get saved off in case you need to remove the patch later, restoring these files. These files are compressed and stored in /var/sadm/pkg/ /save/ and in /var/sadm/pkg/ /save/pspool/ /save/. The files are called undo.Z.

Note: It is completely safe to delete these .Z files, as long as you are sure you will never need to back out its associated patch! Doing this can free up significant space.

I’ve even done things like this in a pinch: (the shotgun approach)

#find /var -name undo.Z -exec rm {} \;

Share

Link-based IPMP setup with VCS

network, Solaris, vcs

With Solaris 10 came a nice feature – Link-based IP Multipathing (IPMP). It determines NIC availability solely on the NIC driver reporting the physical link status – UP or DOWN. Previous versions used “probe-based” IPMP, where connectivity is tested by pinging something on the network from each interface. While probe-based is actually a more thorough test (tests network layer 3 as well as 2), it is much more cumbersome to configure, and you need an extra IP address for each interface for “test” addresses. IMO Link-based IPMP is sufficient for most applications.

For some reason, configuring link-based IPMP in VCS is somewhat tricky, and the documentation doesn’t seem to help much. It seems all the default values for VCS are for probe-based IPMP only.

To achieve link-based IPMP, here’s how I’ve configured my MultiNICB resource:

Link-based IPMP MultiNICB properties

Link-based IPMP MultiNICB properties


These are the values you must change from the defaults:

UseMpathd: 1
Tells VCS to use mpathd for network link status
MpathCommand: /usr/lib/inet/in.mpathd -a
The default, /usr/sbin/in.mpathd is just incorrect – it doesn’t live there.
ConfigCheck: 0
If you leave this at 1, it will overwrite your /etc/hostname.xxx files with probe-based IPMP configuration
Device: (your IPMP interfaces here)
The “interface alias” for each device is not needed, leave them blank.
IgnoreStatus: 0
You want VCS to NOT ignore link status, since this is how link-based IPMP works.
GroupName:
Do not use your IPMP group name here, it’s not needed. VCS is not monitoring the group, mpathd is.

Here’s how it looks in main.cf:

MultiNICB csgmultinic (
UseMpathd = 1
MpathdCommand = “/usr/lib/inet/in.mpathd -a”
ConfigCheck = 0
Device = { ce0 = “”, ce4 = “” }
IgnoreLinkStatus = 0
)

Share

Changing timeouts on SMF services

Solaris, UNIX

I’ve run into an issue where the default timeout value (120 seconds) was not long enough for the start methods to run for my system services. In particular, the psncollector service.

The psncollector service runs a ‘prtfru -x’ command, which can take several minutes to complete on a large server like an E2900. With the 120 second timeout, the start method fails:

# svcs -x
svc:/application/psncollector:default (Product Serial Number Collector)
State: maintenance since Sun 25 Jan 2009 10:01:34 AM PST
Reason: Start method failed repeatedly, last died on Killed (9).
See: http://sun.com/msg/SMF-8000-KS
See: /var/svc/log/application-psncollector:default.log
Impact: This service is not running.

# tail /var/svc/log/application-psncollector:default.log
[ Jan 25 08:59:51 Executing start method ("/lib/svc/method/svc-psncollector") ]
Using /var/run
[ Jan 25 09:02:01 Method or service exit timed out. Killing contract 48 ]
[ Jan 25 09:02:05 Method or service exit timed out. Killing contract 48 ]
[ Jan 25 09:02:18 Method "start" failed due to signal KILL ]

The easy fix was to increase the service start timeout value:

# svccfg -s psncollector setprop start/timeout_seconds=480
# svccfg -s psncollector setprop restart/timeout_seconds=480
# svcadm refresh psncollector
# svcadm clear psncollector

Once cleared, the service started up, taking its usual 3+ minutes.

Share

What is Load Average in Solaris?

Solaris, UNIX

What is load average? I’ve heard all kinds of vague explanations over the years, and it bothers me to continue hearing all the absolutely wrong descriptions of the term and what are “high” values for this number. I’ve heard things like “anything higher than 3X your number of CPUs is bad”, or “as long as it’s under 10 everything should be fine.” Not so.

Some of the misconceptions come from other UNIX and Linux OS’s, which measure the value differently. So an incorrect definition doesn’t necessarily demonstrate a lack of knowledge, but some amount ignorance to the way Solaris does it. Linux for example, also includes in its calculation the threads waiting for I/O, not just threads waiting for CPU.

In previous versions of Solaris (2.3-2.9), load average was a simple calculation. It was the average number of runnable and running threads. In other words, it was the number of threads running on the CPUs, plus the number of threads in the run queue, waiting for CPUs, averaged over time.

In Solaris 10, load average is calculated slightly differently than in previous versions.

The calculation is made by summing high-resolution user time, system time, and thread wait time, then processing this total to generate averages with exponential decay.

This calculation is slightly more comprehensive (and complex), because it takes into account CPU latency – the time taken to move a thread from the run queue onto a CPU. However, the older way of calculating this will yield almost identical results, so either definition I’d call “correct”. I still use the older definition because it is just easier to understand.

So what is a “high” number for load average? Well, first it depends on how many CPUs you have on your system, since the calculations do not take that into account. If you have one CPU, then a load average of 1.0 would mean you are, on average, consuming exactly 100% of that one CPU over the measurement period. If your number climbs above 1.0, then you have threads in the run queue at some point, waiting for CPU time. Solaris actually handles CPU saturation very well, so this may not mean your performance will degrade; it just means your CPU is well-used.

On the other hand, if you have 8 CPUs and a load average of 32, you may be seeing a performance degradation, as your system is somewhat CPU-bound. Each CPU is, on average, 100% utilized by running threads, and there are, on average, 24 more threads in the run queue. Depending on the application, this may be acceptable – it just depends on the expected response-time or expected processing time for your application.

Share

Silencing/Automating Solaris Package Installs

Solaris, UNIX

If you’ve ever been faced with the chore of installing many packages across many hosts, you’ve either 1) spent all day hitting the ‘Y’ key on your keyboard to pkgadd’s questions, 2) gotten someone else to hit the ‘Y’ key all day, or 3) you’ve given pkgadd the proper information so it can proceed without your input.

pkgadd takes a -n argument, that tells it to operate in non-interactive mode. However, this alone will not let you install much of anything, because if the pkgadd command needs any input from the user, it will just exit and your package will not be installed. To give pkgadd the information to act on its own and install your package, you have to provide the -a option and specify an “installation administration file”.

This “admin” file contains all the parameters pkgadd will need to operate. The default file exists in /var/sadm/install/admin/default. Copy it to your home directory and take a look at it.

mail=
instance=unique
partial=ask
runlevel=ask
idepend=ask
rdepend=ask
space=ask
setuid=ask
conflict=ask
action=ask
networktimeout=60
networkretries=3
authentication=quit
keystore=/var/sadm/security
proxy=
basedir=default

You can get information on all of the parameters in the file with:

# man -s 4 admin

What I usually do, to forcefully install the packages without asking anything, is just replace all the occurences of “ask” to “nocheck”. This will take the default file, and create a new one, changing ask to nocheck.

# sed 's/ask/nocheck/' < /var/sadm/install/admin/default > /home/user/admin.file

Now you can do your pkginstall without any questions being asked:

# pkgadd -n -a admin.file SUNWblah

Another handy parameter in the admin file, especially when you are installing packages across multiple hosts, is the “mail” parameter. When you set this with your email, you will be notified when the package installs on each system.

Share

Sparse vs. Whole-root Zones patching times

Solaris, UNIX

Someone had a question for me – does patching a sparse zone take less time than patching a whole-root zone? My first instinct was “yes”, because there would be much less data to copy into each zone, as some file systems like /usr are shared with the global zone. I decided to test both scenarios.

Here is the procedure I used:

Rebooted server to clear inode cache, memory pages, etc. With 3 whole-root zones installed, used PCA (http://www.par.univie.ac.at/solaris/pca/), and patched, adding 106 patches to the system. Rebooted, removed the zones and backed out the 106 patches. Rebooted and installed 3 sparse zones and performed the same patching again.

Patching whole-root zones: 88 minutes to install 106 patches

Patching sparse zones: 72 minutes to install 106 patches

Hmm not all that different. Not quite what I expected, but good to know.

Setup : SunFire V440, using all local disk/ SVM mirrored, 4GB RAM, Solaris 10 update 4, patching to the EIS May 2008 level.

UPDATE: I patched the server without any zones also, and it took only 30 minutes. My sub-par math tells me each sparse zone took 14 minutes to patch, and each whole root took 19 minutes.

Share
« Older Posts