Work Stuff – MuchTall.com

SourceForge Spam (or How I Learned to Stop Worrying and Just Cancel My Account)

After having a SourceForge account for about 10 years, I'm canceling it for one reason and one reason alone: Spam. SourceForge has this "feature" where anyone can email you at yourusername@users.sourceforge.net, and they will happily forward it on to the email address you have on file with SF. That's great, except, I don't want spam bots emailing me. It would be great if you could either opt out of this behavior, or even specify a human-readable anti-spam address as the forwarding address, but SF doesn't offer either of these options. It's either you have an account and all the crap that comes with it, or you have nothing. So I choose nothing.

Thanks for nothing SourceForge.

So I guess if I actually want to use the SF site for forum activity, I'll have to re-create my account each time.

Update: Ok I lied, sort of. Turns out that if you delete your SF account, you can't come back and re-create it. This sucks because I like my persona, and should I come back, I'd like to re-use it. So now I'm willing to give feedback a chance. Vote for Solution 3!: SourceForge Spam Solution

Update 10/15/2009: You complained, and SF listened. I got a Tweet last week from @sourceforge informing me that they have added an option to allow you control over your alias behavior. You can find this setting at https://sourceforge.net/account/. Go there. Now! That is all.

PROC Machine Chk processor sensor transitioned to non-recoverable

Just making a quick post to help out any Googler's out there looking for the solution to this problem. I had a Dell SC1425 pizza box server with an error/alarm light on it's face recently. The alarm showed up when the server had been rebooted after a full lockup. I downloaded the Dell DSET utility and ran a report, which found this error:

PROC Machine Chk processor sensor transitioned to non-recoverable

Basically this means that the mainboard detected that the processor stopped responding (i.e., locked up), which of course we already knew.

To get rid of the alarm light, just run DSET again and tell it to clear the log. After you do so, your server lights should once again be a calming, cool blue.

Resolving a Persistent spoolsv.exe Error/Crash

Today I was troubleshooting a couple of network printers. One of them, I believed, had an incompatibility with "Standard TCP/IP" printing, via RAW communications. I tired other printer port options, including the HP TCP/IP port. I found out later on that the printer itself (a multifunction copier actually) did not have printing capability enabled. In the process of installing a printer driver later in the day, I ran into this error:

The instruction at "0x00000000" referenced memory at "0x00000000". The memory could not be "written".

The error wouldn't go away. Reboots did nothing. Clearing out all registry references of the printers and drivers that I had been troubleshooting did nothing. Deleting my printer drivers folder only made things worse (as you'll read later) Even a reverting to an earlier System Restore point failed. Finally, after finding tidbits of information online here and there, nothing with a complete solution, I tried removing this key:

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Print\Monitors\HP Standard TCP IP port

I then restarted "Print Spooler" in the Services control panel (services.msc), and my Printer control panel finally displayed properly, without crash. Unfortunately for me, I then had to delete all my printers, as I had deleted all of my printer drivers and none of the installed printers were functional.

So if you run into this error, try deleting the key above before attempting anything else.

Cisco: Error opening tftp://url (Undefined error)

Though I still know my way around, it's been a while since I messed with a Cisco router to any great degree. But today at work I had to configure a Cisco 1750 to reply to one of our offices. I like backing up configs before I blow away a router, just so that I have a copy of what was on there. I knew that I could TFTP it to my PC, assuming I set up a TFTP server. So I went and installed Tftpd32 to my PC, set up a secondary IP on the Cisco router on my local network, and fired up the following command on the Cisco console:

copy startup-config tftp://172.16.1.142/config.txt

I then saw on Tftpd32 that the router connected, and created the file "cisco.txt", however the transfer failed and the resulting file was 0 bytes. This is the error I got on the console:

%Error opening tftp://172.16.1.142/config.txt (Undefined error)

Clearly, the Cisco was connecting, it just wasn't sending the content of the file. After running a standard FTP test, I noticed that the Cisco showed up to the server as the PRIMARY interface IP, not the secondary one. Once I re-assigned the primary IP address to the local IP range, the transfer succeeded.

So, long story short: If you see this error, check that the primary IP on the interface facing your network is in the same subnet as you (or the next hop), otherwise the Cisco might show up as coming from an unreachable IP.

Why "root" isn't a Domain Admin on Fedora w/smbldap-tools

For about 2 years now at work, our "root" (aka, Administrator) account hasn't been showing up as being part of the "Domain Admins" group within Windows, or when running "id root" or "net rpc user info root". It used to, but for whatever unknown reason, stopped working.

The root account in our LDAP directory was, admittedly, messed up. However, it worked on our local server, who were talking with our PDC directly. It just didn't work on our remote BDC-connected systems.

Back around this time, I'm pretty sure we made a change to our enterprise-wise /etc/ldap.conf config: We added "root" to the nss_initgroups_ignoreusers list. The effect is that the local auth mechanisms only use the local files (passwd and group) for users in this list, skipping LDAP checks. Therefore root will never get the "Domain Admins" group membership in this configuration.

Now, the question is, is this normal, or am I missing something? I really don't care at this point. The workaround for me is to simply create an admin user that gets treated as THE admin account. The alternative is to remove root for the ignore list. However, I would suggest against this as it could create startup and login delays if the LDAP database doesn't start for any reason.

Hope this helped somebody. I struggled with it and searched for a solution long enough that I figure it merits a quick post.

Squid: The request or reply is too large.

Today at work we had a complaint that one of our remote offices couldn't make a new job posting to Monster.com. Many of our offices have transparent squid implemented, and this is one of them.

Then error that the user would see was:

The following error was encountered: The request or reply is too large.

According to a few places, the problem potentially lied with the reply_body_max_size setting. However, in most recent versions of Squid, this is set to unlimited by default. After some poking around in the Squid docs, I noticed that upping the reply_header_max_size setting from the default of 20 KB to 40 KB seems to resolve the issue. The applicable setting is:

reply_header_max_size 40 KB

If this setting does not work for you, try upping the request_header_max_size as well, which would cause failure for similar reasons. Good luck!

100% / High CPU usage by udev and/or nscd

At work we had an FC3 system that had apparently undergone an abrupt reboot or power outage. After it came back up, the CPU usage was at 100%, so bad that I could not run top. Stopping httpd and nscd seemed to help make the system responsive, though udev kept working hard.

Deleting /var/run/nscd and /var/db/nscd, then rebooting seemed to help. Oddly enough, I restarted nscd after removing these files, and that seemed to have no effect. It wasn't until I rebooted that everything seemed to go back to normal.

Weird. Just wanted to throw that out to anyone out there having the same problem. Just delete your nscd DB files or folders and re-create the folders if necessary (Fedora's nscd init script does this automatically), then reboot.

Simple HOWTO: Linux Source-Based Routing

At work we had a network that has no internet access, except what we provide via their VPN connection that had been allowed through the firewall. Simple, until a small problem arose where our VPN server, which is not configured for NAT, has it's default gateway set for it's direct connection to the internet, and not the router which serves as default gateway for the rest of the network.

Put simply, we needed to set up a source route on the VPN server that took any packets coming from 192.168.76.0/24 and redirected them to an alternate default gateway of 172.16.1.100 on eth1, instead of the default gateway on eth0.

Here's a quick description on how to do that:

# Create a custom route table
echo 200 remotesite >> /etc/iproute2/rt_tables
# Add your source network
ip rule add from 192.168.76.0/24 table remotesite
# Set the default route
ip route add default via 172.16.1.100 dev eth1 table remotesite
# Flush the route cache to immediately apply the change
ip route flush cache

200 = A table number you come up with (200 is fine, unless you have already created a 200 table)
192.168.76.0/24 = The network from which you want to redirect traffic
172.16.1.100 = The gateway that you wish to send 192.168.76.0/24 traffic to
eth1 = The interface that's local to 172.16.1.100
remotesite = A table name you come up with

Do this and, tada! You've redirected traffic from a specfic network to an alternate network, a.k.a, source-based routing.

Now, before you go, make sure you place these lines (all but the first) in /etc/rc.local to make it persistent across reboots.

Broadcom BCM5754 NIC on FC5

At work we got in a brand spankin' new Dell PowerEdge SC440. And what's the problem with brand spankin' new hardware? Driver support. Fedora Core 5 didn't install drivers for our ethernet interface. As seen in lspci...

05:00.0 Ethernet controller: Broadcom Corporation Unknown device 167a (rev 02)

How do you solve this? Well, first off, you need to make sure that you have kernel 2.6.18 or higher on your system. If you don't have this yet, and you have some other means to network access, run yum update. Otherwise you'll probably have to use sneakernet to get the kernel RPM file on there. Then, after installing and rebooting into the new kernel, add this line to /etc/modprobe.conf

alias eth0 tg3

Reboot and run netconfig.

UPDATE: You can alternatively let kudzu try to find the device itself, after you do the kernel update. First you need to make kudzu believe that this is the first time the device has been inserted. Remove these lines for /etc/sysconefig/hwconf:

-
class: NETWORK
bus: PCI
detached: 0
device: dev1804289383
driver: tg3
desc: "Broadcom Corporation Unknown device 167a"
network.hwaddr: 00:1a:a0:18:aa:98
vendorId: 14e4
deviceId: 167a
subVendorId: 1028
subDeviceId: 01df
pciType: 1
pcidom: 0
pcibus: 5
pcidev: 0
pcifn: 0

Run "rmmod tg3" to uload the module, then run "/etc/init.d/kudzu start" and "modprobe tg3". Run "ifconfig -a" and you should now see the ethernet interface. Run "netconfig" or netconfig -d eth1" (if you have this as a secondary interface) to configure it.

MuchTallWare: Cron Runonce

I wrote up a small perl script that will run any executable file in /etc/cron.runonce once and remove that file. It's handy when we need to remotely deploy a change to multiple servers and make sure that we don't leave remnants of those scripts in cron. Download it and evaluate the variables for your needs. Currenly it assumes you have set up a cron.1min directory.

cron-runonce

MuchTallWare: fixmyself.pl

Today at work we had a situation where a server wasn't coming back on the network after a reboot. Normally this wouldn't be a big deal, but the server was across the ocean in a vastly different timezone, so troubleshooting normally has to be done in a 2-hour window.

It turned out that the sk98lin module that we were using for our nic has been superceeded/deprecated by skge. So I wanted to test out the new module on the new kernel version, but didn't have someone on the other end to reboot and types things in on the console should something go wrong. I needed a way to make the server roll back the changes and reboot if it did not see me come back to the server after 10 minutes. I didn't have any handy script to do this, so I wrote one up.

fixmyself.pl checks for a condition that you specify in the subroutine test_condition() and if the test fails (such as not finding any processes running on pts0 through 9), then it executes a response_action() subroutine. In this case it finds the changes I made, changes them back, and reboots the server.

I hope you find it useful: fixmyself.pl

MuchTallWare: winfax2pyla.pl

I recently designed a deployment of Pyla/HylaFax for one of our offices. Part of this deployment required that we convert their WinFax Address Book(s) to Pyla's address book. To do this I wrote up a short perl script. You can get it here:

winfax2pyla.pl

Do me a favor and let me know if it helped you out!

Installing Fedora Core 6 using XFS

At work we use XFS primarily on / to allow us greater flexibility with file size, filesystem size, and inode limits. It's been working out great until FC6 came out. For some reason when you install an FC6 system using xfs (boot the install with "linux xfs"), the install goes great, but the system can't seem to write to the drive after it reboots. I'm not sure what the bug is all about, but it's been reported and is being discussed on Redhat's Bugzilla (XFS on FC6)

I think I've found a workaround that seems to do the job. If you install the system with selinux disabled (linux selinux=0 xfs), the system will boot up just fine. If you really want re-enable selinux, you can re-enable it after first boot (edit /etc/selinux/config) and reboot to apply the change.

Adding SATA Controller Support in initrd

Today we had a problem at work where, after adding two 400G drives and a SATA controller to a server and adding the two drives to the root logical volume, the system would not boot. It would kernel panic because LVM could not find the drives. Turns out the SATA controller driver wasn't being loaded on boot, before the drives are accessed.

The solution was to run mkinird with SATA device probing. It was further surmised that RPM kernel upgrades handle this automatically by reading in the required devices from modprobe.conf.

This solution saved nearly a full day of labor by avoiding a complete server re-install.

Roundup of Work Accomplishments

I've been meaning to create blog of work accomplishments and have never quite gotten around to it, so I'm just going to start dropping them here. Here's a round up of things that I have accomplished so far:

cleanupsmbd.pl: Wrote a workaround script to kill off smbd processes before they start to bring a server to a crawl
FC5 Deployment: Readied and bugtested FC5 installation procedures for enterprise deployment
Overcame LDAP integration issues on FC5
Identified kernel panic cause on FC5 systems (running kernel 2.6.17-1.2174 w/e1000 NIC)
Configured anaconda and prepared Kickstart for FC5 (near hands-free install)
Preliminary work on fully automating branch server configuration (from fresh install to deployment)
Wrote runonce script useful for creating things like cron.1min and cron.5min
Corrected a long-standing shorewall startup issue relating to linefeeds
Wrote a patch to fix the LDAP BDB database on startup if corruption is detected (Redhat Bugzilla 207821)
Troubleshooting and adapted Samba configuration procedures to follow new printer permissions standards (deprecated printer admin option)
Identified shorewall slow (10 minute) startup cause (Redhat Bugzilla 211338)