In our Fuzzing book (by Takanen, DeMott, Miller) we have several chapters dedicated for metrics and coverage in negative testing (robustness, reliability, grammar testing, fuzzing, many names for the same thing). Also I tried to summarize most important aspects in our company whitepaper here:
http://www.codenomicon.com/products/coverage.shtml
Snippet from there:
Coverage can be seen as the sum of two features, precision and accuracy. Precision is concerned with protocol coverage. The precision of testing is determined by how well the tests cover the different protocol messages, message structures, tags and data definitions. Accuracy, on the other hand, measures how accurately the tests can find bugs within different protocol areas. Therefore, accuracy can be regarded as a form of anomaly coverage. However, precision and accuracy are fairly abstract terms, thus, we will need to look at more specific metrics for evaluating coverage.
The first coverage analysis aspect is related to the attack surface. Test requirement analysis always starts off by identifying the interfaces that need testing. The number of different interfaces and the protocols they implement in various layers set the requirements for the fuzzers. Each protocol, file format, or API might require its own type of fuzzer, depending on the security requirements.
Second coverage metric is related to the specification that a fuzzer supports. This type of metric is easy to use with model-based fuzzers, as the basis of the tool is formed by the specifications used to create the fuzzer, and therefore they are easy to list. A model-based fuzzer should cover the entire specification. Whereas, mutation-based fuzzers do not necessarily fully cover the specification, as implementing or including one message exchange sample from a specification does not guarantee that the entire specification is covered. Typically when a mutation-based fuzzer claims specification support, it means it is interoperable with test targets implementing the specification.
Especially regarding protocol fuzzing, the third-most critical metric is the level of statefulness of the selected Fuzzing approach. An entirely random fuzzer will typically only test the first messages in complex stateful protocols. The more state-aware the fuzzing approach you are using is, the deeper the fuzzer can go in complex protocols exchanges. The statefulness is a difficult requirement to define for Fuzzing tools, as it is more a metric for defining the quality of the used protocol model, and can, thus, only be verified by running the tests.
I hope this was helpful. We also have studies in other metrics such as looking at code coverage and other more or less useless data. ;) Metrics is a great topic for a thesis. Email me at [email protected] if you are interested to get access to our extensive research on this topic.