As the SAN environment contains many moving parts from the storage subsystem, fibre channel HBA’s, SFP’s, optical wire, the FC and Ethernet switch all the way to the metadata disks and the StorNext file system from Quantum it cannot really be called trivial. Every now and then you may struggle with performance issues that might be caused by any part mentioned above.
In such a complex environment you need to have some troubleshooting skills or you will end up ripping your hair out. Although there are many parts, I am going to focus on the StorNext metadata side in this article as I have seen it a few times now and users ended up remaking the file system.
Your SAN is working as expected with a decent performance; however, suddenly you hear from your users that browsing a folder takes a loooong time. It’s not repeatable by clicking through the directories but later you hear another user complaining about the same issue.
Where to start with troubleshooting? Where to look first?
Usually, in a clustered file system environment the metadata is transferred via the Ethernet protocol and there are no I/O’s generated when browsing a directory tree, hence the FC fabric can be ruled out as the root cause.
Certainly, graphical desktops like Windows Explorer, OS/X Finder etc may try to read the first few bytes from a file to get the header information which generates I/O’s. In order to also rule out an issue with the file manager on that particular client, a good way to check is via the command line/terminal. Since the OS caches the entries, you will have to browse through a few different directories to flush out the cached information. Listing the slow directory again should prove that there is either an issue with the file manager – or not.
If the file manager has been ruled out, the next step should be to run the same listing on your MDC. If it shows the same behavior like on the client – although there is no Ethernet component (aside from the loopback device) involved – it all points to the metadata and not the Ethernet connection which leaves us with 2 possibilities:
- something is wrong with the metadata itself
- a slow or bad disk/array
You most likely can rule out file fragmentation at this point as even a heavily fragmented file will not have such an impact on a directory listing because only the inode for the file name, location and time stamps has been requested and not in how many extents a file has been divided.
- A raw read test with dd or similar tools against the metadata device should prove that the performance is consistent.
- So it is the metadata itself. You can try a stop and start of the file system but that won’t fix the issue in general. You are facing (for a lack of a better term let’s call it) metadata fragmentation/.
I urge the reader to read the user manual before attempting this procedure. And as always, a backup is always a good thing.
1. unmount the active file system from all clients
2. run 'cvadmin stop <FsName>'
3. run 'cvfsck -j <FsName>' // flushing the journal, should not be required but ...
4. run 'cvfsck <file system name>' expecting no errors; stop here on errors and get in touch with support
5. create a snapshot of the metadata using snmetadump to a location where you have space
6. being very cautious helps, check the metadump with snmetadump
7. make a copy of that healthy metadump
8. run the optimization process against the dump: snmetadump -am <file system name> -f <metadata file>
That will perform an optimization (maintenance) process on the snapshot file.
-a Apply updates or optimize a metadata
-m Perform SNMS format recovery
In certain circumstances you have to run 2 separate tasks, one with –m and after -a
Assuming the optimization process was successful and no errors showed, restore the optimized dump.
You are now overwriting the actual metadata disks. Be certain that you have a good backup of your metadata dump file!
1. snmetadump -r <FsName> -f <metadata file>
2. run cvadmin -e "start <FsName>"
3. Run cvadmin -e "activate <FsName>"
4. Mount file system and verify content
List the directories again and browse through the tree from the client site. In most cases the issue will have vanished.
Key here is the SNMS format recovery which seems to realign and optimize the layout of the inodes.
It is acknowledged by anyone who chooses to utilize this procedure that all risks taken in performing these steps are performed on your own and that I am not reliable for any damage.