views:

86

answers:

1

I have two sets of files with newspaper articles in them; about 20 files with about 2000 articles, and 1 file with about 100 articles.

The 100 articles in the single file should be disjoint from the others, but are in fact duplicated randomly throughout the 20 files (once each).

Any ideas for an easy way to find and remove the 100 duplicate articles from the 20 files without going through them by hand?

A: 

I have two questions: 1.Does each article has a start tag and a end tag? If no, how can we know the start position and end position of an article? 2. Can we copy all the articles to one file and try to find duplicate articles?

Dracoder
there are start and end tags. we could copy everything to one file, but then we would have to put them back where they belong afterwards.i.e., there is one file named consumer, one named sports, etc., and all the sports articles need to go in the sports file.
Tom Hagen
I am sorry, I can't find a good way to solve this issue.Maybe you need to write a little program to do it.
Dracoder