tags:

views:

155

answers:

8
+1  Q: 

Text manipulation

I have a malformed tab delimited csv file

Name  AA BB CC AA BB CC
XX5            2  7  8b
YY4            2  6  2
ZZ3            8  21 9
RR2   1  2  6
SS1            6  7  23

It should be like this

Name  AA BB CC
XX5   2  7  8b
YY4   2  6  2
ZZ3   8  21 9
RR2   1  2  6
SS1   6  7  23

I can't do this manually because there are way too many rows. Is there an algorithm that can automate this?

the first row is the header.

This is just an example the actual file has 50 columns and over 10000 rows

+2  A: 

I'd consider a different approach. If the data's going to end up in a table in a database upon which you can perform SQL queries, import into a table that looks like:

mytable:
NAME nvarchar(10) PRIMARY KEY NOT NULL
AA nvarchar(10)
BB nvarchar(10)
CC nvarchar(10)
AA2 nvarchar(10)
BB2 nvarchar(10)
CC2 nvarchar(10)

After importing the data, try the following SQL:

UPDATE mytable SET AA = AA2 WHERE AA2 IS NOT NULL
UPDATE mytable SET BB = BB2 WHERE BB2 IS NOT NULL
UPDATE mytable SET CC = CC2 WHERE CC2 IS NOT NULL

... which will copy the values from the second set of "columns" into the first.

Then simply drop the columns AA2, BB2 and CC2.

Another option, again, I'm making assumptions here, bring it into a text editor and replace every occurrence of three consecutive tab characters with nothing.

Bob Kaufman
D'OH!!! Good catch. Fixing now.
Bob Kaufman
You should be careful, because in columns AA2, BB2, CC2 there can be NULLs or empty strings and you don't want to overwrite values in fields AA, BB, CC then.
Lukasz Lysik
Damnit, I already fixed it twice. You keep overwriting my edits! ;)
John Gietzen
+2  A: 

I don't love the string.Format, but perhaps something like below; note that the Length == 7 test assumes no more \t after the end of the data, but you could replace this with a test for blank strings etc...

    static void Main() {
        var qry = from line in ReadLines("data.tsv")
                  let cells = line.Split('\t')
                  let format = cells.Length == 7 ? "{0}\t{4}\t{5}\t{6}"
                     : "{0}\t{1}\t{2}\t{3}"
                  select string.Format(format, cells);
        using (var writer = File.CreateText("new.tsv")) {
            foreach(string line in qry) {
                writer.WriteLine(line);
            }
        }
    }
    static IEnumerable<string> ReadLines(string path) {
        using (var reader = File.OpenText(path)) {
            string line;
            while ((line = reader.ReadLine()) != null) {
                yield return line;
            }
        }
    }


Edit; to simply remove blanks:

    static string Join(this IEnumerable<string> data, string delimiter) {
        using (var iter = data.GetEnumerator()) {
            if (!iter.MoveNext()) return "";
            StringBuilder sb = new StringBuilder(iter.Current);
            while (iter.MoveNext()) {
                sb.Append(delimiter).Append(iter.Current);
            }
            return sb.ToString();
        }
    }
    static void Main() {
        var qry = from line in ReadLines("data.tsv")
                  let cells = line.Split('\t').Where(s => s != "")
                  select cells.Join("\t");
        using (var writer = File.CreateText("new.tsv")) {
            foreach(string line in qry) {
                writer.WriteLine(line);
            }
        }
    }
Marc Gravell
how can i customize this to support more columnsthe actual file has 50 columns and over 10000 rows
newbie
+1 for the elegant (imho) reading of the text file
andyp
The 10k rows is fine, as the ReadLines method approach only handles one row at a time; no chance of blowing the memory. I've added an edit re the 50 columns.
Marc Gravell
Is there a way to quickly convert the header to a sql server table? assume nvarchar(1000) for all columns. don't want to enter all 50 column names manually.
newbie
where did the question mention anything about a database?
Marc Gravell
no where. but that's my next question :) I need to import this file to a database. want to know if there's a quick way to do it.
newbie
+3  A: 

Quick trick!

Depending on the exact pattern found in the input file, it may also be possibly to fix this with a simple text editor (or with sed), essentially replacing any sequence of 3 tabs by nothing.

mjv
A: 
Wim Hollebrandse
+1  A: 

Hi,

this works too (without thinking that much):

        string csv = @"
Name  AA BB CC AA BB CC
XX5            2  7  8b
YY4            2  6  2
ZZ3            8  21 9
RR2   1  2  6
SS1            6  7  23";

        string[] lines = csv.Split(new string[]{Environment.NewLine}, 
            StringSplitOptions.RemoveEmptyEntries);
        foreach (string line in lines)
        {
            string[] fields = Regex.Split(line, @"\s+");
            foreach (string field in fields)
            {
                Console.Write(field);
                Console.Write('\t');
            }
            Console.Write(Environment.NewLine);
        }
andyp
+1  A: 

Assuming you read the file into a string you could do something like this:

var newFile = new StringBuilder();
newFile.AppendLine("Name\tAA\tBB\tCC");
string oldFile = "data";
var rows = oldFile.Split(new char[] { '\n' }, StringSplitOptions.RemoveEmptyEntries).Skip(1).ToList();
foreach (var row in rows)
   newFile.AppendLine(string.Join("\t", row.Split(new char[] { '\t' }, StringSplitOptions.RemoveEmptyEntries).ToArray()));
return(newFile.ToString());
jarrett
A: 

Looks like you can just OR the columns with the same name together.

AdamBT
A: 

Hello, you can also try this regular expression:

(?\w{3})\t*(?\w*)\t*(?\w*)\t*(?\w*)

Here is a code sample:

static void Main(string[] args)
    {
        string input = @"XX5       2 7 8b

YY4 2 6 2 ZZ3 8 21 9 RR2 1 2 6 SS1 6 7 23 "; string pattern = @"(?\w{3})\t*(?\w*)\t*(?\w*)\t*(?\w*)";

        try
        {

            if (Regex.IsMatch(input, pattern))
            {
                Regex r = new Regex(pattern);
                StringBuilder sBuilder = new StringBuilder();
                Match m;
                int i = 0;
                for (m = r.Match(input); m.Success; m = m.NextMatch())
                {
                    //sBuilder.Append(String.Format("Match[{0}]: ", i));
                    for (int j = 1; j < m.Length; j++)
                    {
                        sBuilder.Append(String.Format("{0} ", m.Groups[j].Value));
                    }
                    sBuilder.AppendLine("");
                    i++;
                }
                Console.WriteLine(sBuilder.ToString());
            }
            else
            {
                Console.WriteLine("No match");

            }
            Console.ReadLine();
        }
        catch (Exception e)
        {
            StringBuilder sBuilder = new StringBuilder();
            sBuilder.Append("Error parsing: \"");
            sBuilder.Append(pattern);
            sBuilder.Append("\" - ");
            sBuilder.Append(e.ToString());
            Console.WriteLine(sBuilder.ToString());
        }
    }
Arturo Molina