Wednesday, September 21, 2011

ArcSDE / Oracle Cannot Initialize Shared Memory

Someone (not me! ha ha!) rebooted a server running ArcSDE 9.3 with local Oracle 10g, and afterward ArcSDE was giving these errors on startup:
ERROR in clearing lock and process tables.
Error: -51
DBMS error code: -12704
Error PL/SQL block to clean up hanging entries 
ORA-12704: character set mismatch
 
ERROR: Cannot Initialize Shared Memory (-51)

... so... I guess... the character set is wrong in Oracle? BUT NO IT'S CORRECT.
Or there's some permissions issue? NO THAT'S NOT IT EITHER.
The issue was fixed, though this is not the recommended solution, by deleting all the files owned by the sde user in /tmp/ - in this case, just the helpful socket files:
/tmp/SDE_9.3_esri_sde_iomgr_shared_semaphore  /tmp/sde_server_to_client_FIFO_esri_sde_0
/tmp/sde_client_to_server_FIFO_esri_sde_0     /tmp/sde_server_to_client_FIFO_esri_sde_1
/tmp/sde_client_to_server_FIFO_esri_sde_1     /tmp/s.esri_sde.iomgr

Users experienced some instability until ArcSDE was able to regenerate all the files, which happened after about 20 minutes. So even though this was a bad thing to do and should not have worked anyway, I offer it to you, o interwebz, in case you get as desperate as I did. At the very least, perhaps it will be reassuring to be able to find "DBMS error code: -12704" in Google. It's not just you!

Useful references I found: Administering ArcSDE for Oracle (pdf)

Thursday, September 15, 2011

Xen "blocked for 120 seconds" I/O issues on G7 blades

The issue manifests as crazy load during times of high I/O, with a very high amount of "dirty" memory in /proc/meminfo, and messages like "INFO: task syslogd:1500 blocked for more than 120 seconds" in dmesg that reference fsync further down the stack. It seems to primarily affect HP CCISS disk arrays. I'm mostly concerned about it in RedHat/CentOS, but it may be a bug for users of other distros too. When I wrote about this issue before, I was using G6 HP BL460c Blades - now I'm using G7 blades as well.

RedHat and others claim that you just need to update the driver to the latest from HP, but this emphatically did not work for me. What does seem to work, for both the G6s and G7s, is a band-aid solution that puts a ceiling on I/O throughput and uses the noop scheduler. Additionally, for G7s, the cciss kernel module available from EL Repo seems to help (though the HP driver is still ineffective, as far as I can tell). With a 50MB/sec max in place, xen guests seem to consistently be able to write at about 47-49MB/sec. Without the max in place, xen guest I/O can be as low as 2MB/sec.

As of this moment, the stable configuration for Blade G7s seems to be:

Blade
CentOS 5.7
kernel - 2.6.18-238.12.1.el5xen (Tue May 31 2011! So old!)
with added cciss module (explained below)
using noop for scheduler (explained below)
/proc/sys/dev/raid/speed_limit_max set to 50000

Xen Guest
CentOS 5.7
kernel - 2.6.18-274.el5xen
using noop for scheduler (explained below)
No additional kernel modules, no need to set /proc/sys/dev/raid/speed_limit_max

How to... Add cciss module from EL Repo
Previously, changing the driver never seemed to help anything, but a particular version of the cciss driver that ultimately derives from this project, available at EL Repo, seems to help with the G7 blades.

modinfo cciss  # see what you have now
rpm --import http://elrepo.org/RPM-GPG-KEY-elrepo.org
rpm -Uvh http://elrepo.org/elrepo-release-5-3.el5.elrepo.noarch.rpm
yum --enablerepo=elrepo-testing install kmod-cciss-xen.x86_64
modinfo cciss
 # info should have changed

My G6s are currently running without this module added and no I/O issues, so I have no advice one way or the other whether to use this on your G6s.

How to... set /proc/sys/dev/raid/speed_limit_max
To set it temporarily, just until you reboot:
echo "50000" > /proc/sys/dev/raid/speed_limit_max

To set it permanently, taking effect after you reboot, add the line
dev.raid.speed_limit_max = 50000
to /etc/sysctl.conf.

How to... Use noop for the scheduler on both blade and Xen guests
You can temporarily set the I/O scheduler on your machine with:
echo noop > /sys/block/[your-block-device-name]/queue/schedule

Long-term, edit /etc/grub.conf with elevator=noop so that the scheduler is always set on startup:
        title CentOS (2.6.18-274.el5xen)
        root (hd0,0)
        kernel /vmlinuz-2.6.18-274.el5xen ro root=/dev/VolGroup00/LogVol00 elevator=noop console=xvc0
        initrd /initrd-2.6.18-274.el5xen.img

What's the deal with the scheduler? Noop performs fewer transactions per second in exchange for being less of a burden on the system. Wikipedia lovingly calls it "the simplest"I/O scheduler. It's not clear to me if the reason this works is that it has fewer moving parts, as it were, to foul up with the driver, or if it's just slower, so it's working like the ceiling on raid/speed_limit_max.