If you’re using raw access to storage LUNs with VMWare, and you’re using Windows, you can use the LSI Logic SAS virtual SCSI adapter option and create virtual drives. This is better than using the Microsoft iSCSI initiator because you can edit the drive mappings with the machine powered off and you can clone the machine and easily redirect all of your storage before powering the machine on.
The correct driver to be using is the LSI SAS 1068 driver. You’ll need to make a floppy image using an image tool — if you’ve access to a linux box, just use DD to create the image and then mount it and write files to it. If you’re on Windows, the venerable WinImage and other utilities exist. Either way, you’ll need to rename the file with a .flp extension and mount it on boot in your Windows VM in order to load the driver to see the drives.
I’ve been getting some pretty darned good performance with that and iSCSI LUNs on my Solaris server. I haven’t (yet) put together a decent test and some metrics to back it, but the machines on raw device LUNs feel a *lot* snappier than the machines that are on a 400GB VMFS. A good basic tutorial with iSCSI and ZFS is here: Running ZFS Over iSCSI as a vmware VMFS store — but note that I’m using raw LUNs after not being happy with the VMFS performance with a half-dozen hosts doing heavy I/O.
VMWare explorations…
GhettoVCB – VCB for free. Doesn’t get better than that.
Understanding VMWare Snapshots – Also, it’s probably a good idea to learn this stuff.
Biggest Problem in Airbus A380: Software
Software problems are the #1 thing that will keep an Airbus A380 on the ground. Yes, airplanes are complicated things … but at the same time, not much is required to keep most of them in the air.
The thing that speaks volumes to me about these problems are a few key quotes.
Clark says that the problem with the nuisance warnings has been their diverse nature, but “the common thread” is the software. He says Airbus executive vice-president programmes Tom Williams and his team “have sat in my office many times and said they can’t identify trends, which is the worst possible thing”.
Clark blames the software’s design. “There was a philosophy of utopia – I suspect that Airbus was blessed with some boffins who said ‘we’ve got to make this absolutely perfect – no flexibility’. The slightest surge causes one [sensor] to trip and then six more as they’re all linked,” he says.
Anyone willing to take guesses about the type of architecture and software developers at Airbus?
In the last issue of my current consulting saga, Detecting and Resolving LAMP Stack Performance Problems, we talked about a Drupal site that was being brought offline every few hours due to poor tuning of the LAMP stack. With the default settings, a site isn’t going to take much before it just falls flat on it’s face.
After triaging and addressing the main issues based on the logs, we were left with two more issues. The first was the inability of Drupal to perform well in an environment where it had to rebuild every page from source for every page view. This is well documented in the drupal community; there are many pages inn the documentation area of Drupal that deal with caching and performance optimization. The second issue was MySQL performance and the long table lock/scan times we were seeing on some queries that could not be further optimized.
We scheduled a 2 hour downtime with the customer to install some tools. Our checklist was installing memcached and PHP-APC. I also wanted to take the time to back up the MySQL database and run a good check_table on each of the MyISAM tables. (Yes, I know. MyISAM. More on that later.)
Side note: I would typically prefer xcache, which in my mind is superior to APC because I have an easier time working with it and prefer it’s management interface and tuning parameters. However, APC was available as a binary package for the platform we were on, and xcache was not. To make things faster and easier, we chose APC. Despite the endless debate about which is superior, both are usable and work. I have not run into problems using APC on an 8-core system, despite oft-reported-but-never-proven flock() issues.
APC was fast to install and required minimal tuning. It produced a noticeable performance improvement. However, the number of deadlocked apache threads (and total number of apache threads) went up, and the other Apache errors that dealt with clients timing out did not cease.
We installed the Drupal Memcache implementation along with the appropriate PECL module. We configured two pools, both using up to 1 GB of RAM (which we had to spare on the web server.) The ‘hot’ pool would mostly handle cached pages for non-logged-in users, and the other one would handle some higher volume caching for users that are logged in, as well as some internal/custom functionality to go along with specialized RSS feed parsing. (Side note: We found that the Cache and Cacherouter plugins did not work as expected. Rather than waste downtime troubleshooting them, we used what worked.)
Again, we saw a huge performance boost. We needed to do some tuning (changing certain cache settings and analyzing performance, but that was essentially everything that we could find to do from a single-server web server side of things.
While we’re on the topic of drupal: Don’t forget that Drupal has a ‘cron’ program that should be getting called remotely. It’s sort of a poor man’s cron solution, but it works. It was causing our load to spike every 20 minutes. We occasionally disabled it during testing to be sure we understood it’s effects.
The next beast to tackle was the database. As previously mentioned, it was on MyISAM tables. Obviously, this isn’t ideal. We found that node lookups, statistics lookups, and searches were taking up a disproportionate amount of server time because they were both The weirdest part was that we were seeing some full table scans in the slow query log (i.e. 3 million rows scanned) but a later ‘explain’ statement couldn’t replicate the performance recorded in the slow query log.
We batted around adding indexes. The issue was that Drupal’s search and nodes tables are frequently altered, which means the indexes become scrambled quickly. And really, what was taking time was the size of the table we were dealing with — the table wouldn’t fit in memory, so it was copying it to a disk temporary table and then doing a filesort.
Running check_table did the trick to re-sort the indexes and ‘defrag’ the files, but the benefits only lasted so long.
What we ended up doing was taking the database down, dumping everything out to a SQL file, and re-importing everything to InnoDB. Make sure that innodb_files_per_table is enabled, or you might end up with some unexpectedly big files — this depends on your architecture and filesystem. Remember that InnoDB files can not currently shrink. (Also: You can do the table changes online, but it’s really not recommended. It takes a long time, especially when some of your tables are larger than 1gb.) Don’t forget to switch to set innodb_buffer_pool_size appropriately.
The change to InnoDB, the implementation of both PHP engine-level opcode and actual built pages, and the careful tuning of Apache and MySQL parameters led to stability for this client.
There were some further problems, but they were with an unrelated product that causes a nightly load spike on the database machine. Tomorrow night I’ll covering the cleanup work: NFS iops vs. local disk, binary logging and the lack of backups in the original configuration, and building some redundancy into the system so that it can tolerate faults more smoothly.
As a sysadmin, we sometimes run into performance problems with multiple angles and portions. It’s sometimes not particularly obvious where the actual performance problem is, and resolving one problem that you can see might bring another couple of problems to the surface.
The below comes from a consulting gig that I’ve been working on recently. The parties will remain nameless. I’m going to break this into several parts, since it took over three weeks to resolve all of the immediate problems with the site, and we’re still not all the way done with the task list.
Going in, I knew that we were dealing with a heavily loaded Drupal site that shared a mysql database with a wiki and a forum. The site would go down at random times — sometimes multiple times per hour. Upon logging into the server the first time, it seemed slow — so I immediately called ‘uptime’ and the answer came back with all three time period load averages over 90 on an 8-core server. There were 125 Apache processes running, but most of them were in Deadlocked state. The very second command I ran on the server was killall -9 httpd, which is never the way you want to start out a consulting gig…
While that was busy killing off processes, I checked the Apache configuration. Sure enough, it was still at the stock settings. I immediately cranked up the requests per process to 20,000 and upped the server limit to 300. (Remember, we’re dealing with prefork here.) I restarted Apache and watched it churn. It handled the load far more gracefully with some room to move around, and I quickly saw the number of Apache processes spike, and then sink down to about 80 and stay there.
The next step was looking through the logs. A quick aside about logs: I like my logs to be clean. I don’t like debug messages, I don’t like status messages, and I don’t want to see either of them. If I have a lot of a certain type of status message that I *do* want to trap, I make sure that syslog puts it into it’s own file or I handle the problem that’s causing it. In this case, /var/log/messages had a bunch of SNMP messages logging each get, and some messages about martian packets. The martian packets issue could be (and was) resolved with a quick firewall tweak to reject packets from an illegal source. The snmp issue was resolved by editing snmpd’s startup configuration to log to local1 instead of the default (check your man file for snmpd to make sure you get the right flags, it’s changed…), and then editing syslog’s configuration to log everything on local1 to /var/log/snmpd — and don’t forget to add it to logrotate!
Now we were down to two classes of errors. The first was obvious and sort of easy to troubleshoot: “MySQL server has gone away.” Log into the MySQL server. See if there’s slow-running queries. Nope? Well, double check the timeout that’s set in /etc/my.cnf — on this server, slow-query-time was set to twenty seconds, but timeout was set to ten seconds. Well, that’s not very useful. Also, check your caches and table types. In this case, everything was MyISAM. More on that later — for now, just make sure we’re using the right kind of caching strategy for your table type and system specs, which in this case is MyISAM key cache (and lots of it!). Try to fit all of your most-used tables in memory.
On this gig, we got the site back on it’s feet with these things. Downtime went from multiple events an hour down to one or two events per six hour period. Unfortunately, we were also out of easy things to change. Next time I post, we’ll start to get into fixes that will cause downtime.
Sun/Oracle OpenWorld & Flash Storage
At the Sun OpenWorld conferene keynote today, there were a few new products listed in the Flash storage arena — most notably the F5100 that everyone’s jibber-jabbering about.
As a smaller customer, I’m far more interested in the SunFlash F20 PCIe card — which I don’t see many people blogging about. Looks like I could add that to not only my existing systems, but non-Sun systems that can make use of that sort of storage. That, ladies and germs, is something worth the name “OpenWorld” — as in, a world of open wallets.
Reading/Googling List
- vSphere, SAN or iSCSI-related:
- Using iSCSI with vSphere – Pretty much the bible, they covered it all.
- 2TB drives are here, but Stephen Foskett identifies the issues with bringing them to the enterprise. In the same vein, he covers the death of RAID as a storage technology, and what lies beyond.
- I need to research if our iSCSI TOE cards are supported by vSphere…
- Ubiquitous Talk might be my new favorite high-quality techie blog.
- Other:
- Streaming live webcams to your iPhone
- This has been linked all over, but the Ghost Fleet of the Recession is anchored just off of Singapore, and it doesn’t look like it’s going anywhere soon. Sis wondered why she hadn’t seen the Florida in port recently; she ships a lot of containers with Maersk.
Dude, you’re not getting a Dell.
Despite the recent pot-banging around the Sun/Oracle merger and the allegations that Sun’s getting it’s customer base stolen out from under it, I just pushed the button on a fairly large cluster with Sun as the hardware vendor.
Simply put, I couldn’t find machines with better stats for the money. Even with the academic matching grant program tabled for now, we STILL got amazing promotional pricing on the x4150. I can’t even find anything that can compare to an x4250 for on-board storage — 16 on-board drives. Dell’s MD1000 chassis supports only .. 15 drives. There’s no better hardware to run Solaris on. The Sun ILOM support is leagues better than Dell’s DRAC or even HP’s ILO. All the machines come with at least four on-board ethernet ports. The storage array options are also superior. No one else sells a 24 slot SATA chassis with hot-swap drives backed with three controllers.
Simply put, the Sun option was the fastest, most scalable option. The hardware is put together well, with the same sort of build quality you’ve come to expect from HP… far superior to Dell or IBM. And the management and tuning options are awesome. I’m really, really excited to see the hardware racked in a few weeks. They also maintain and stock a parts “locker”/cache on our campus so that a technician has access to all the parts they might need for our systems without having to courier them or drive for them.
Am I concerned about Sun going away? Not now that they’ve been bought by Oracle. They’ve got so many compelling offerings, and I hope that IW and other tech rags stop trashing Sun — I’m a fanboy from here on out.
Struggling with Budget SAN Speed
It’s Monday morning. Your boss strolls into your office. You just finished with the trouble tickets from the weekend, and this is his favorite time to ruin your entire week. He says, “I have a project for you. I need a cluster with a primary and backup SAN that is going to store about 8TB of infrequently accessed images and it will also need to host virtual machines and an Oracle database. You’ll have to fit a budget for two sites in there, but the second site is a cold, hourly-synch backup. And it has to scale. And we’d prefer if you used a vendor solution and didn’t homebrew things.”
Talk about a list of contradictory feature requests! You’ve got a limited budget, it’s hard to squeak 8 usable TB out of your average entry level 12-disk arrays (i.e. HP MSA60 or Sun J4200 disk array, with Dell’s MD1000 15 disks and AC&NC’s 516-series with 16 disks being notable exceptions) when you factor in a double parity stripe and a couple of hot spares.
In most cases, you’ll do just fine. What happens when the load on the infrequently accessed (slow) portion of the array is ‘peaky’ though? During one of those peaks, you’ll max out a gig per second line in – depending on what you’ve got driving the array, that might be your entire bandwidth budget. What’s Oracle, which is also running in one of the VMs, going to do then? It doesn’t like having slow access to it’s log files, which means it’ll be consuming RAM and swapping heavily on it’s VM, which means the VM image will also be trying to write to disk. Triple-whammy until something gives — either load decreases or something fails.
The obvious choice is ZFS and Solaris. And the obvious choice for hardware is also Sun; you get four NICs by default on Sun hardware, with management ports and ILO ports out of band on their own interfaces. (Side note: When you have 7 Cat5e cables, a KVM dongle, and 2 power cords running to a 1u chassis, yes, you really do want the cable management arm.) ZFS support with Sun is excellent. Their storage products are also excellent.
By the time you get done buying storage, you’re through most of your budget — those 1TB disks aren’t cheap, and neither are the arrays themselves. Your maximum speed across the SAS backplane for the J4200 or J4400 series is going to be 3 or 6 GB/S, and your input is only going to be 1 gb/s actual even with bonded ports, but you’d probably rather not skip all over the place on the array as you try to write 3,000 10GB (compressed) images and then try to write to the Oracle logs. The question still remains: how do you squeeze in a budget for some faster storage for the VM images and database storage while still paying for the bulk storage you need and some room to grow?
Answer: What are you using to drive that array? Buy a bigger chassis, and put it inboard. The 2.5″ 10k SAS drives aren’t hideously expensive, and the additional grand for a larger chassis beats the hell out of buying an entire extra J4200. Note that you can’t mix the 10k SAS disks in an array with the 7200 SATA disks… on any vendor that I know of, at least. But inboard on the system’s backplane, you can run SAS and then run SATA on the outside.
Bonus points: This may not last, and it might just be the academic pricing that we get at work, but right now I can buy a half-full J4400 (24 disks) for less than I can buy a fully-loaded J4200. Guess which we’re getting? It’ll be half full of blanks, but those are free. As our 8TB grows over the next year, we’re going to just slot additional disks in. ZFS’s ability to add disks to pools relatively painlessly has made this a realistic goal. ZFS also has a built-in management server (which we’ll restrict to our private network and people will have to VPN in, but that’s trivial…) which makes management’s acceptance of the technology dead simple.
Also, don’t forget that if you can acquire some SSDs, you’ll be able to drive your storage even faster by offloading the ZFS log writes to the much-faster SSD. They have a limited lifespan, though, so consider if it’s really worth it to you and make sure that you plan for their obsolescence and replacement considering that a log buffer is a r/w-intensive application.
Our total server budget for this project (a compute-/storage-intensive academic project where data loss is not acceptable) was only $70k total. We managed to squeeze an insanely fast cluster out of a paltry budget.
I usually wait a few weeks, until the first .1 patch is released, to update to a new version of OS X. That’s been my procedure for a while, and it seems that there are the usual growing pains with 10.6. I don’t know if I could keep myself from giving in to 10.7′s charms right away…