views:

252

answers:

9

This might be more of a serverfault.com question but a) it doesn't exist yet and b) I need more rep for when it does :~)

My employer has a few hundred servers (all *NIX) spread across several locations. As I suspect is common we don't really know how many servers we have: more than once I've been surprised to find a server that's been up for 5 years, apparently doing nothing but elevating the earth's temperature slightly. We have a number of databases that store bits of server information -- Puppet, Cobbler, Nagios, Cacti, our load balancers, DNS, various internal spreadsheets and so on but it's all very disparate, incomplete and overlapping. Maintaining this mess costs time and money.

So, I'd like to come up a single database which holds details of what each server is (hardware specs, role, etc) and replaces (or at least supplies data for) the databases mentioned above. The database and web interface are likely to be a Rails app as this is what I have most experience with. I'm more of a sysadmin than a coder.

Has this problem already been solved? I can't find any open source software that really fits the bill and I'm generally not too keen on bloaty, GUI vendor-supplied solutions.

How should I implement the device information collection bit? For instance, it'd be great to the database update device records when disks are added or removed, or when the server serial number changes because HP replace the board. This information comes from many different sources: dmidecode, command-line disk tools, SNMP against the server or its onboard lights-out card, and so on. I could expose all this through custom scripts and net-snmp, or I could run a local poller that reported the information back to the central DB (maybe via a RESTful interface or something). It must be easily extensible.

Have you done this? How? Tell me your experiences, discoveries, mistakes and recommendations!

A: 

I'm contemplating how to go about this myself, and I think this is one of the key pieces of infrastructure that not having around keeps us in the dark ages. Hopefully this will be a popular question on serverfault.com. :)

It's not just that you could install a single tool to collect this data, because that's not possible cheaply, but ideally you want everything from the hardware up to the applications on the network feeding into this thing.

I think the only approach that makes sense is a modular one. The range of devices and types of information is too disparate to come under a single tool. Also the collection of data needs to be as passive and asynchronous as possible - the reality of running infrastructure means that there will be interruptions and you can't rely on being able to get the data at all times.

I think the tools you've pointed out form something of an ecosystem that could work together - Cobbler can install from bare-metal and hand over to Puppet, which has support for generating Nagios configs, and storing configs in a database; for me only Cacti is a bit opaque in terms of programmatically inserting new devices, templates etc. but I know this is possible.

Ultimately you have to sit down and work out which pieces of information are important for the business you work for, and design a db schema around that. Then, work out how to get the information you need into the db, whether it's from Facter, Nagios, Cacti, or direct snmp calls.

Since you asked about collection of data, I think if you have quite disparate kit (Dell, HP etc.) then it makes sense to create a library to abstract away as much as possible the differences between them, so your scripts just make standard calls such as "checkdiskhealth". When you add new hardware you can add to the library rather than having to write a completely new script.

Andrew Hodgson
Thanks for your reply; it sounds like you're thinking along the same lines as I am. By the lack of responses it seems this question might be more appropriate for serverfault.com and that I should just start making stuff and return when I have specific coding problems :~)Re: Cacti: I find it hard to add things manually, let alone programmatically. Munin is a lot easier to automate, if a little less pretty.
markdrayton
Re: Cacti; yep, I hear ya! It needs a bit of work but I think they are taking it in new direction as it approaches 1.0. I haven't played with Munin much but isn't it host-based? That's fine except when the box is down and you want to look at recent stats.
Andrew Hodgson
A: 

Sounds like a common problem that larger organizations would have. I know our (50 person company) sysadmin has a little access database of information about every server, license, and piece of hardware installed. He's very meticulous, but when it comes time to replace or repair hardware, he knows everything about it from his little db.

You and your organization could sponsor an open source project to get oyu what you need, and give back to the community so that additional features (that you may not need now) can be developed at no cost to you.

Jeff Fritz
+2  A: 

My team have been dumping all out systems in to RDF for a month or two now, we have the systems implementation people create the initial data in excel, which is then transformed to N3 (RDF) using Perl.

We view the data in Gruff (http://www.franz.com/downloads.lhtml) and keep the resulting RDF in Allegro (a triple store from the same guys that do Gruff)

It's incredibly simple and flexible - no schema means we simply augment the data on the fly and with a wide variety of RDF viewers and reasoning engines the presentation options are enless.

The best part for me? no coding, just create triples and throw them in the store then view them as graphs.

A: 

Maybe a simple web service? Just something that accepts a machine name or IP address. When the service gets input, it sticks it in a queue and kicks off a task to collect the data from the machine that notified it. The nature of the task (SNMP interrogation, remote call to a Perl script, whatever) could be stored as part of the machine information in the database. If the task fails, the machine ID stays in the queue and the machine is periodically re-polled until the information is collected. Of course, you also have to have some kind of monitor running on your servers to notice that something has changed and send the notification; hopefully this is easily accomplished with whatever server monitoring software you've already got in place.

TMN
+3  A: 

This sounds like a great LDAP problem looking for a solution. LDAP is designed for this kind of thing: a catalog of items that is optimized for data searches and retrieval (but not necessarily writes). There are many LDAP servers to choose from (OpenLDAP, Sun's OpenDS, Microsoft Active Directory, just to name a few ...), and I've seen LDAP used to catalog servers before. LDAP is very standardized and a "database" of information that is usually searched or read, but not frequently updated, is the strong-suit of LDAP.

Ogre Psalm33
Hm. I don't have any experience with LDAP but maybe now's the time to learn. Thanks!
markdrayton
LDAP is simply a hierarchical database, as opposed to relational. It does not provide any support for actively monitoring machines. It might be a good choice to store info acquired from monitoring, but that is subject to the situation
Todd Stout
A: 

There are some solutions from the big vendors for managing monstrous sets of machines - such as some of the Tivoli stuff from IBM. That is probably, however, overkill for mere hundreds of machines.

Jonathan Leffler
A: 

There are some free software server database solutions but I do not know if they provide hooks to update information automatically from the machines with dmidecode or SNMP. One I heard about (but no personal experience, sorry), is GLPI.

bortzmeyer
+1  A: 

The collection of detailed machine information is a very frustrating problem (many vendors want to keep it this way). Even if you can spend a large amount of money, you probably will not find a simple solution to this problem. IBM and HP offer products that achieve what you are seeking, but they are very, very, expensive, and will leave a bad taste in your mouth once you realize that probably all you needed was 40-50% of the functionality they offer. You say that you need to monitor *Nix servers...most (if not all) unices support RFC 1514 (windows also supports this RFC as of windows 2000). The Host MIB support defined by RFC 1514 has its drawbacks however. Since it is SNMP based, it requires that SNMP be enabled on the machine, which is typically not the default for unix and windows machines. The reason for this is that SNMP was created before the entire world was using the Internet, and thus the old, crusty nature of its security is of concern. In many environs, this may not be acceptable for security reasons. However, if you are only dealing with machines behind the firewall, this might not be an issue (I suspect this is true in your case). Several years ago, I was working on a product that monitored hundreds of unix and windows machines. At the time, I did extensive research into the mechanics of how to acquire detailed information from each machine such as disk info, running processes, installed software, up-time, memory pressure, CPU and IO load (Including Network) without running a custom client on each machine. This info can be collected in a centralized fashion. As of three or four years ago, the RFC-1514 Host MIB spec was the only "standard" for acquiring detailed real-time machine info without resorting to OS-specific software. Sun and Microsoft announced a WebService based initiative many years ago to address some of this, but I suspect it never received any traction since I cannot at the moment even remember its marketing name.

I should mention that RFC 1514 is certainly no panacea. You are at the mercy of the OS-provided SNMP service, unless you have the luxury of deploying a custom info-collecting client to each machine. The RFC-1514 spec dictates that several parameters are optional, and if your target OS does not implement it, then you are back to custom code to provide the information.

Todd Stout
If you don't want to spend the big bucks, and have a little time to do a small amount of custom dev, look into the RFC-1514 MIB and the plethora of free SNMP stacks available. I dislike SNMP, but the IETF never did anything about this decades ago, so we are stuck with it as a "standard".
Todd Stout
Experts in this area have informed me (via serverfault) that RFC-1514 has been replaced by RFC 2790.
Todd Stout
A: 

I believe you are looking for Zabbix. It's open source, easy to install and use. I've installed for a client a few years ago, and if I remember right it has a client application that connects to the zabbix server to update it with the requested information. I really recommend it: http://www.zabbix.com

Eduardo