Validation of TSV file in Java

that depends on your definition of a TSV file.

Do they all have the same amount of columns ? or is it possible to omit the last empty columns ?

If they all have the same amount of columns then you can do a first validation on that. If it fails then you know the file is not valid.

Do they all have a header row ? if so you can use it to answer the above question and validate the file parsing.

Is quoting allowed ? if so is it allowed to place carriage returns or tabs on the quotes ? (will not necessarily help in validation but you'll have to think about it when parsing)

Is your text strictly text ? you can test for non printable characters and reject it on that basis. Again be careful here on the character encoding used for the file (UTF vs ASCII etc). this can be placed in the code that does the first parsing from flat files into a data structure (list of map for example).

Further drilling in the file itself, if it is fixed format or the type of some data is known you can make a secondary parser to validate this data (dates, timestamps or other fixed format strings). This second level can be done when you have discovered more about the content and are processing the data from the above structure.

The above are all empirical analysis as such you must expect false positives to fall though, though a false negative should not happen if you pick rules for which your entry files MUST adhere. Therefore all along the processing stack expect to encounter invalid data and be prepared to invalidate the whole file input, in other words never assume that the tests done give complete assurance that the file is correct.

I hope this helps.

ansaurus

tags:

views:

answers:

Validation of TSV file in Java

related questions