I have two sets of files with newspaper articles in them; about 20 files with about 2000 articles, and 1 file with about 100 articles.
The 100 articles in the single file should be disjoint from the others, but are in fact duplicated randomly throughout the 20 files (once each).
Any ideas for an easy way to find and remove the 100 duplicate articles from the 20 files without going through them by hand?