tags:

views:

2037

answers:

17

I have an Excel spreadsheet being converted into a CSV file in C#, but am having a problem dealing with line breaks. For instance:

"John","23","555-5555"

"Peter","24","555-5
555"

"Mary,"21","555-5555"

When I read the CSV file, if the record does not starts with a double quote (") then a line break is there by mistake and I have to remove it. I have some CSV reader classes from the internet but I am concerned that they will fail on the line breaks.

How should I handle these line breaks?

+1  A: 

Maybe you could count for (") during the ReadLine(). If they are odd, that will raise the flag. You could either ignore those lines, or get the next two and eliminate the first "\n" occurrence of the merge lines.

Freddy
string.IsNullOrEmpty(value.Trim()) would likely be safer.
John Fisher
+4  A: 

Rather than check if the current line is missing the (") as the first character, check instead to see if the last character is a ("). If it is not, you know you have a line break, and you can read the next line and merge it together.

I am assuming your example data was accurate - fields were wrapped in quotes. If quotes might not delimit a text field (or new-lines are somehow found in non-text data), then all bets are off!

Doug
Some CSV apps do not wrap every field with quotes when generating CSV files, so this might be a buggy solution.
John Fisher
I was, of course, making the assumption that his example data was accurate - fields were wrapped in quotes. If quotes might not delimit a text field (or new-lines are somehow found in non-text data), then all bets are off!
Doug
Doug, maybe put the assumption in your answer
phsr
A: 

I might be misunderstanding but are you parsing the excel file into a csv and then when you try to read it having a problem? If that is the case what does the code you use to parse the excel file into a csv look like?

Zman101
A: 

There is an example parser is c# that seems to handle your case correctly. Then you can read your data in and purge the line breaks out of it post-read. Part 2 is the parser, and there is a Part 1 that covers the writer portion.

Doug
A: 

Read the line.
Split into columns(fields).
If you have enough columns expected for each line, then process.
If not, read the next line, and capture the remaining columns until you get what you need.
Repeat.

FlappySocks
The split can be dangerous if there is a comma between the quotes. A well-crafted regular expression would be safer.
John Fisher
+4  A: 

CSV has predefined ways of handling that. This site provides an easy to read explanation of the standard way to handle all the caveats of CSV.

Nevertheless, there is really no reason to not use a solid, open source library for reading and writing CSV files to avoid making non-standard mistakes. LINQtoCSV is my favorite library for this. It supports reading and writing in a clean and simple way.

Alternatively, this SO question on CSV libraries will give you the list of the most popular choices.

Michael La Voie
+1 for LINQtoCSV
Graham Miller
A: 

A somewhat simple regular expression could be used on each line. When it matches, you process each field from the match. When it doesn't find a match, you skip that line.

The regular expression could look something like this.

Match match = Regex.Match(line, @"^(?:,?(?<q>['"](?<field>.*?\k'q')|(?<field>[^,]*))+$");
if (match.Success)
{
  foreach (var capture in match.Groups["field"].Captures)
  {
    string fieldValue = capture.Value;
    // Use the value.
  }
}
John Fisher
A: 

Thanks everybody very much for your help.

heres is what ive done so far, my records have fixed format and all start with JTW;...;....;...;

JTW;...;...;....

JTW;....;...;..

..;...;... (wrong record, line brak inserted)

JTW;...;...

so i checked for the ; in the [3] position of each line. if true i write, if false ill apend on the last *removing the linebrak)

Im having problems now because im saving the file as a txt.

By the way, i am converting the excell spreadshit to csv by saving as csv in excell. but im not sure if the client is doing that.

So the file as a TXT is perfect. ive checked the records and totals. But now i have to convert it back to csv and i would really like to do it in the program. Does anybody know how ?

here is my code:

namespace EditorCSV { class Program { static void Main(string[] args) { ReadFromFile("c:\source.csv"); }

    static void ReadFromFile(string filename)
    {
        StreamReader SR;
        StreamWriter SW;
        SW = File.CreateText("c:\\target.csv");
        string S;
        char C='a';
        int i=0;
        SR=File.OpenText(filename);
        S=SR.ReadLine();
        SW.Write(S);
        S = SR.ReadLine();
        while(S!=null)
        {
            try { C = S[3]; }
            catch (IndexOutOfRangeException exception){
                bool t = false;
                while (t == false)
                {
                    t = true;
                    S = SR.ReadLine();
                    try { C = S[3]; }
                    catch (IndexOutOfRangeException ex) { S = SR.ReadLine(); t = false; }

                }
            }
            if( C.Equals(';'))
            {
                SW.Write("\r\n" + S);
                i = i + 1;
            }
            else
            {
                SW.Write(S);

            }
            S=SR.ReadLine();
        }
        SR.Close();
        SW.Close();
        Console.WriteLine("Records Processed: " + i.ToString() + " .");
        Console.WriteLine("File Created SucacessFully");
        Console.ReadKey();


    }




    }
}
+1  A: 

What I usually do is read the text in character by character opposed to line by line, due to this very problem.

As you're reading each character, you should be able to figure out where each cell starts and stops, but also the difference between a linebreak in a row and in a cell: If I remember correctly, for Excel generated files anyway, rows start with \r\n, and newlines in cells are only \r.

John
A: 

SW = File.CreateText("c:\target.txt")

if it was SW = File.CreateText("c:\target.csv") i am able to rad the file but i get format errors when opening in excell

+1  A: 

Heed the advice from the experts and Don't roll your own CSV parser.

Your first thought is, "How do I handle new line breaks?"

Your next thought is, "I need to handle commas inside of quotes."

Your next thought will be, "Oh, crap, I need to handle quotes inside of quotes. Escaped quotes. Double quotes. Single quotes..."

It's a road to madness. Don't write your own. Find a library with an extensive unit test coverage that hits all the hard parts and has gone through hell for you. For .NET, use the free FileHelpers library.

Judah Himango
A: 

im also having problems with chars like ó í á etc..

A: 

I've used this piece of code recently to parse rows from a CSV file (this is a simplified version):

private void Parse(TextReader reader)
    {
        var row = new List<string>();
        var isStringBlock = false;
        var sb = new StringBuilder();

        long charIndex = 0;
        int currentLineCount = 0;

        while (reader.Peek() != -1)
        {
            charIndex++;

            char c = (char)reader.Read();

            if (c == '"')
                isStringBlock = !isStringBlock;

            if (c == separator && !isStringBlock) //end of word
            {
                row.Add(sb.ToString().Trim()); //add word
                sb.Length = 0;
            }
            else if (c == '\n' && !isStringBlock) //end of line
            {
                row.Add(sb.ToString().Trim()); //add last word in line
                sb.Length = 0;

                //DO SOMETHING WITH row HERE!

                currentLineCount++;

                row = new List<string>();
            }
            else
            {
                if (c != '"' && c != '\r') sb.Append(c == '\n' ? ' ' : c);
            }
        }

        row.Add(sb.ToString().Trim()); //add last word

        //DO SOMETHING WITH LAST row HERE!
    }
A: 

The LINQy solution:

string csvText = File.ReadAllText("C:\\Test.txt");

var query = csvText
    .Replace(Environment.NewLine, string.Empty)
    .Replace("\"\"", "\",\"").Split(',')
    .Select((i, n) => new { i, n }).GroupBy(a => a.n / 3);
Yuriy Faktorovich
A: 

Have a look at FileHelpers Library It supports reading\writing CSV with line breaks as well as reading\writing to excel

+1  A: 

To followup on the regex solution, the common one that can be found on the net can be first used to split the lines:

public static string[] SplitSV(char separator, string values)
{
    if (values == null)
     return new string[] { };

    Regex regex = new Regex((separator == '^' ? @"\^" : "[" + separator + "]") + "(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))");
    string[] result = regex.Split(values);

    return Array.ConvertAll<string, string>(result, delegate(string s)
    {
     //splitting a file into multiple lines on line break works but will fail validation below and should not have chars unescaped
     if (separator == '\n' || separator == '\r')
      return s;

     //check that 
     if (!(s.StartsWith("\"") && s.EndsWith("\"")) && (s.Contains("\"") || s.Contains(separator) || s.Contains('\r') || s.Contains('\n')))
      throw new Exception("Invalid CSV, contains unescaped characters.");

     //remove start and end quote if it exists
     if (s.StartsWith("\""))
     {
      s = s.Substring(1, s.Length - 2);

      //Check that the remaining string doesn't contain an unescaped "
      if (new Regex("\"+").Matches(s).Cast<Match>().Any(m => m.Value.Length % 2 == 1))
       throw new Exception("Invalid CSV, contains unescaped characters.");
     }

     //unescape quotes
     return s.Replace("\"\"", "\"");
    });
}

Then on your input call:

string[][] result = SplitSV('\n', input).Select(l => SplitSV(',', l.TrimEnd('\r'))).ToArray();
Michael
A: 

Try CsvHelper http://github.com/JoshClose/CsvHelper. It ignores empty rows. I believe there is a flag you can set in FastCsvReader to have it handle empty rows also.

Josh Close