I want to write a Java tool to assess HTML pages of an existing site and if any image has no alt attribute, the tool will insert alt="" to that image. One approach is using an HTML parser (like HtmlCleaner) to generate the DOM then adding the alt attribute to the images in the DOM before writing back the HTML.
However, this approach won't keep the original HTML intact and probably cause some unpredictable side effects, esp. when the existing amount of HTML pages is huge and there is no guarantee about their being well-formed.
Is there any safer way to accomplish this (i.e. should keep the original HTML intact and only add the alt attribute)?