tags:

views:

32

answers:

1

I'm writing a program to delete duplicate consecutive words from a text file, then replaces that text file without the duplicates. I know that my current code does not handle the case where a duplicate word is at the end of one line, and at the beginning of the next line since I read each line into an ArrayList, find the duplicate, and remove it. After writing it though, I wasn't sure if this was an 'ok' way to do it since now I don't know how to write it back out. I'm not sure how I can keep track of the punctuation for beginning and end of line sentences, as well as the correct spacing, and when there are line returns in the original text file. Is there a way to handle those things (spacing, punctuation, etc) with what I have so far? Or, do I need to do a redesign? The other thing I thought I could do is return an array of what indices of words I need deleted, but then I wasn't sure if that's much better. Anyway, here is my code: (thanks in advance!)

/** Removes consecutive duplicate words from text files.  
It accepts only one argument, that argument being a text file 
or a directory.  It finds all text files in the directory and 
its subdirectories and moves duplicate words from those files 
as well.  It replaces the original file. */

import java.io.*;
import java.util.*;

public class RemoveDuplicates {

    public static void main(String[] args) {


        if (args.length != 1) {
            System.out.println("Program accepts one command-line argument.  Exiting!");
            System.exit(1);
        }
        File f = new File(args[0]);
        if (!f.exists()) {
            System.out.println("Does not exist!");
        }

        else if (f.isDirectory()) {
            System.out.println("is directory");

        }
        else if (f.isFile()) {
            System.out.println("is file");
            String fileName = f.toString();
            RemoveDuplicates dup = new RemoveDuplicates(f);
            dup.showTextFile();
            List<String> noDuplicates = dup.doDeleteDuplicates();
            showTextFile(noDuplicates);
            //writeOutputFile(fileName, noDuplicates);
        }
        else {
            System.out.println("Shouldn't happen");
        }   
    }

    /** Reads in each line of the passed in .txt file into the lineOfWords array. */
    public RemoveDuplicates(File fin) {
        lineOfWords = new ArrayList<String>();
        try {
            BufferedReader in = new BufferedReader(new FileReader(fin));
            for (String s = null; (s = in.readLine()) != null; ) {
                lineOfWords.add(s);
            }
        }
        catch (IOException e) {
            e.printStackTrace();
        }
    }

    public void showTextFile() {
        for (String s : lineOfWords) {
            System.out.println(s);
        }
    }

    public static void showTextFile(List<String> list) {
        for (String s : list) {
            System.out.print(s);
        }
    }

    public List<String> doDeleteDuplicates() {
        List<String> noDup = new ArrayList<String>(); // List to be returned without duplicates
        // go through each line and split each word into end string array
        for (String s : lineOfWords) {
            String endString[] = s.split("[\\s+\\p{Punct}]");
            // add each word to the arraylist
            for (String word : endString) {
                noDup.add(word);
            }
        }
        for (int i = 0; i < noDup.size() - 1; i++) {
            if (noDup.get(i).toUpperCase().equals(noDup.get(i + 1).toUpperCase())) {
                System.out.println("Removing: " + noDup.get(i+1));
                noDup.remove(i + 1);
                i--;
            }
        }
        return noDup;
    }

    public static void writeOutputFile(String fileName, List<String> newData) {
        try {
            PrintWriter outputFile = new PrintWriter(new BufferedWriter(new FileWriter(fileName)));
            for (String str : newData) {
                outputFile.print(str + " ");
            }
            outputFile.close();
        }
        catch (IOException e) {
            e.printStackTrace();
        }
    }

    private List<String> lineOfWords;
}

An example.txt:

Hello hello this is a test test in order
order to see if it deletes duplicates Duplicates words.
A: 

How about something like this? In this case, I assume it is case insensitive.

    Pattern p = Pattern.compile("(\\w+) \\1");
    String line = "Hello hello this is a test test in order\norder to see if it deletes duplicates Duplicates words.";

    Matcher m = p.matcher(line.toUpperCase());

    StringBuilder sb = new StringBuilder(1000);
    int idx = 0;

    while (m.find()) {
        sb.append(line.substring(idx, m.end(1)));
        idx = m.end();
    }
    sb.append(line.substring(idx));

    System.out.println(sb.toString());

Here's the output:-

Hello this a test in order
order to see if it deletes duplicates words.
limc
Can you explain your code more, starting with the sb.append part. I'm not sure how it works exactly. Thx.
Crystal
The "1" in m.end(1) represents the group in the regex (surrounded by parentheses). m.end(1) returns the last index of that matching group while m.end() returns the last index of the entire string that matches the provided pattern ("(\\w+) \\1"). Basically, I'm ignoring anything between m.end(1) and m.end() because it is the duplicate of the string between m.start(1) and m.end(1). I don't use m.start(1) in this case because I don't see a need to. Hope this helps.
limc