I've been thinking of using a picture-diff tool such as PDiff to test OpenGL code, by taking snapshots, saving them to disk and comparing with previous regression output. That way, really bad stuff (missing textures) pop up but humanly unnoticable things (such as the above mentioned small diffs between implementation) go through fine.
Also, for automating user interaction, either the GUI classes should be sufficiently open for you to send events or call 'clicked' on a button, or you have to inject OS events manually to your apps. This is possible, but much more troublesome. Might be easier to open up the GUI layer, if it's Open Source.