How to remove duplicate paragraphs in multiple files? | ansaurus

tags:

views:

86

answers:

1

Q:

How to remove duplicate paragraphs in multiple files?

I have two sets of files with newspaper articles in them; about 20 files with about 2000 articles, and 1 file with about 100 articles.

The 100 articles in the single file should be disjoint from the others, but are in fact duplicated randomly throughout the 20 files (once each).

Any ideas for an easy way to find and remove the 100 duplicate articles from the 20 files without going through them by hand?

A:

I have two questions: 1.Does each article has a start tag and a end tag? If no, how can we know the start position and end position of an article? 2. Can we copy all the articles to one file and try to find duplicate articles?

Dracoder 2009-10-21 01:22:20

there are start and end tags. we could copy everything to one file, but then we would have to put them back where they belong afterwards.i.e., there is one file named consumer, one named sports, etc., and all the sports articles need to go in the sports file.

Tom Hagen 2009-10-21 07:12:52

I am sorry, I can't find a good way to solve this issue.Maybe you need to write a little program to do it.

Dracoder 2009-10-23 00:51:48

related questions

What is the best way to change text contained in an XML file using Python?

Parsing Performance (If, TryParse, Try-Catch)

How to remove accents and tilde in a C++ std::string

newline character(s)

Windows batch command(s) to read first line from text file

XML vs Text for Non-web development applications

How do I modify a text file in Python?

python regular expression to split paragraphs.

Custom Text Wrapping in WPF

SQL strip text and convert to integer

What's the canonical way to store arbitrary (possibly marked up) text in SQL?

Text message receiving API - UK and USA

How can I detect the encoding/codepage of a text file

Keyboard scancodes?

How do I duplicate a whole line in Emacs?

Font rendering libraries for C# / dot-NET?

How to programmatically normalize music tags?

Best way to convert text files between character sets?

Highlight parents in xml string

A good algorithm similar to Levenstein but weighted for Qwerty keyboards?

Most elegant way to force a TEXTAREA element to line-wrap, *regardless* of whitespace

Adapt Replace all strings in all tables to work with text

Formatting text in WinForm Label

Text Editor For Linux (Besides Vi)?

In HTML, how to word-break on a dash?