views:

78

answers:

4

Hey everyone,

First of all, I understand that it's unusual that I want to up-convert like this, but please bear with me. We get these csv files via website export and we have no options to get it in any other form.

Now, onto the question:

I have this old code that will do this process for me. It basically reads each line, then picks out each value between the ,s. This worked great for some samples that I converted, but when it came down to working with the samples given, some values were out of place.

I opened the files in Notepad++ and realized that some of the cells, themselves, contained ,s. CSV files work around this by surrounding these cells with "s. See examples below:

.

This would work fine:

John,Smith,[email protected],burgers

This, however, would not:

John,Smith,[email protected],"burgers, french fries"

.

Unfortunately, my code (String strar[] = thisLine.split(",");) doesn't take into account that some cells contain commas, and will separate them into different columns like: "burgers and french fries".

.

How can I get my program to efficiently treat text surrounded by "s as a single value, not two separate ones?

.

Let me know if I can clear anything else up for you guys.

Many thanks for the help,

Justian

A: 

Rather than depend on split(), write your own parser to handle this situation. Have your grammar treat all the characters between a pair of " " or ' ' as a single token.

duffymo
+3  A: 

Just check the CSV char by char and set a toggle whenever a quote occurs. Here's a kickoff example:

public static List<List<String>> parseCsv(InputStream input, char separator) 
    throws IOException 
{
    BufferedReader reader = null;
    List<List<String>> csv = new ArrayList<List<String>>();
    try {
        reader = new BufferedReader(new InputStreamReader(input, "UTF-8"));
        for (String record; (record = reader.readLine()) != null;) {
            boolean quoted = false;
            StringBuilder fieldBuilder = new StringBuilder();
            List<String> fields = new ArrayList<String>();
            for (int i = 0; i < record.length(); i++) {
                char c = record.charAt(i);
                fieldBuilder.append(c);
                if (c == '"') {
                    quoted = !quoted;
                }
                if ((!quoted && c == separator) || i + 1 == record.length()) {
                    fields.add(fieldBuilder.toString().replaceAll(separator + "$", "")
                        .replaceAll("^\"|\"$", "").replace("\"\"", "\"").trim());
                    fieldBuilder = new StringBuilder();
                }
            }
            csv.add(fields);
        }
    } finally {
        if (reader != null) try { reader.close(); } catch (IOException logOrIgnore) {}
    }
    return csv;
}

You can however also just grab any 3rd party Java CSV API which may have some more features and so on.

BalusC
Nice code, but won't deal with embedded newlines inside a field... +1 to the 'get a 3rd-party implementation' suggestion, where hopefully the 3rd-party has dealt with all this stuff so you don't have to.
Cowan
This doesn't look very efficient, but it does look like the only way, doesn't it =/. I'll take a look at this tomorrow and make my final decision. Thanks!
Justian Meyer
Try to write something more efficient and let us know :)
BalusC
A: 

I know this doesn't really help you with your immediate problem, but my advice is: don't do it at all. You'll be very lucky if you get away with just dealing with embedded commas. What about embedded double quotes? Embedded line breaks? etc. etc...

Quite honestly? The answer is to find a library which parses CSVs and use that. I'm pretty sure nearly every single developer in the world has fallen into the "oh, CSV is such a simple format, I'll parse it myself" trick. I know I have.

There's a great post about the problems with roll-your-own CSV Parsers which I love referring people to (I'm cruel like that). It's a .NET-related post, but it still applies to your situation. Note that you're only up to step #2 of 5... there's a lot to go.

Cowan
Haha! Funny article. It's true though, there's no point in re-inventing the wheel. Unfortunately, all my searches have lead to a dead end and I need to get this done :P.
Justian Meyer
+1  A: 

Managed to answer my own question. With a bit of searching, I managed to find this little pdf here:

http://www.objectmentor.com/resources/articles/tfd.pdf

From there, I managed to adopt the code on page 35 to work with my program. All credit goes to Jeff Langr, 2001. All I did was make it work with some of Java's current standards.

Here's the code for all the people who may encounter this problem in the future.

import java.io.BufferedReader;
import java.io.IOException;
import java.util.ArrayList;

public class CSVReader {

    private BufferedReader reader;
    private String line;
    private static final String DOUBLE_QUOTE = "\"";
    private static final String COMMENT_SYMBOL = "#";
    private static final char stateINIT = 'S';
    private static final char stateCOMMENT = '#';
    private static final char stateQUOTED_DATA = 'q';
    private static final char stateQUOTE_IN_QUOTED_DATA = 'Q';
    private static final char stateDATA = 'D';
    private static final char stateNEW_TOKEN = 'N';
    private static final char stateWHITESPACE = 'W';

    public CSVReader(String filename) throws IOException {
        reader = new BufferedReader(new java.io.FileReader(filename));
        loadNextNonCommentLine();
    }

    public ArrayList<String> next() throws IOException {
        if (line == null)
            throw new IOException("Read past end of file");
        ArrayList<String> columns = columnsFromCSVRecord(line);
        loadNextNonCommentLine();
        return columns;
    }

    public boolean hasNext() {
        return line != null;
    }

    void loadNextNonCommentLine() throws IOException {
        do
            line = reader.readLine();
        while (line != null && line.startsWith(COMMENT_SYMBOL));
        if (line == null)
            reader.close();
    }

    public ArrayList<String> columnsFromCSVRecord(String line) throws IOException {
        char state = stateINIT;
        char ch;
        int i = 0;
        ArrayList<String> tokens = new ArrayList<String>();
        StringBuffer buffer = new StringBuffer();
        char[] charArray = line.toCharArray();
        while (i < charArray.length) {
            ch = charArray[i++];
            switch (state) {
            case stateINIT:
                switch (ch) {
                case '"':
                    buffer.append(ch);
                    state = stateQUOTED_DATA;
                    break;
                case ',':
                    state = stateNEW_TOKEN;
                    tokens.add(clean(buffer));
                    buffer = new StringBuffer();
                    break;
                case '\t':
                case ' ':
                    break;
                case '#':
                    state = stateCOMMENT;
                    break;
                default:
                    state = stateDATA;
                    buffer.append(ch);
                    break;
                }
                break;
            case stateCOMMENT:
                break;
            case stateQUOTED_DATA:
                switch (ch) {
                case '"':
                    buffer.append(ch);
                    state = stateQUOTE_IN_QUOTED_DATA;
                    break;
                default:
                    buffer.append(ch);
                    break;
                }
                break;
            case stateQUOTE_IN_QUOTED_DATA:
                switch (ch) {
                case '"':
                    state = stateQUOTED_DATA;
                    break;
                case ',':
                    state = stateNEW_TOKEN;
                    tokens.add(clean(buffer));
                    buffer = new StringBuffer();
                    break;
                case ' ':
                case '\t':
                    break;
                case '#':
                    tokens.add(clean(buffer));
                    state = stateCOMMENT;
                    break;
                default:
                    throw new IOException("badly formed CSV record:" + line);
                }
                break;
            case stateDATA:
                switch (ch) {
                case '#':
                    tokens.add(clean(buffer));
                    state = stateCOMMENT;
                    break;
                case ',':
                    state = stateNEW_TOKEN;
                    tokens.add(clean(buffer));
                    buffer = new StringBuffer();
                    break;
                default:
                    buffer.append(ch);
                    break;
                }
                break;
            case stateNEW_TOKEN:
                switch (ch) {
                case '#':
                    tokens.add(clean(buffer));
                    state = stateCOMMENT;
                    break;
                case ',':
                    tokens.add(clean(buffer));
                    buffer = new StringBuffer();
                    break;
                case ' ':
                case '\t':
                    state = stateWHITESPACE;
                    break;
                case '"':
                    buffer.append(ch);
                    state = stateQUOTED_DATA;
                    break;
                default:
                    state = stateDATA;
                    buffer.append(ch);
                    break;
                }
                break;
            case stateWHITESPACE:
                switch (ch) {
                case '#':
                    state = stateCOMMENT;
                    break;
                case ',':
                    state = stateNEW_TOKEN;
                    // ACCEPT NEW EMPTY COLUMN HERE??
                    break;
                case '"':
                    buffer.append(ch);
                    state = stateQUOTED_DATA;
                    break;
                case ' ':
                case '\t':
                    break;
                default:
                    state = stateDATA;
                    buffer.append(ch);
                    break;
                }
                break;
            default:
                break;
            }
        }
        if (state == stateQUOTED_DATA)
            throw new IOException("Unmatched quotes in line:\n" + line);
        if (state != stateCOMMENT)
            tokens.add(clean(buffer));
        return tokens;
    }

    public String clean(StringBuffer buffer) {
        String string = buffer.toString().trim();
        if (string.startsWith(DOUBLE_QUOTE))
            return string.substring(1, string.length() - 1);
        return string;
    }
}
Justian Meyer