tags:

views:

202

answers:

1

I've got a csv with 35K rows with, among other, the following collumns: articleID, description, class1, class2, class 3.
the class collumns represent the categories to which the products belong. class1 is the main category, class2 is a subcategory of class1 and class3 is a subcategory of class2.
Now i want to extract the categories in a tree structure, but i'm kind of lost.

The only thing I could come up with is the following linq query to get a distinct list. (I am not an expert in either linq nor c#/.Net in general...
The ParseStream function returns a list of rows, with an array of collumn values. i[3], [4] and[5] represent class 1, 2 and 3

List<string[]> infoList = ParseStream(infoFile);
            List<string> categories = (from i in infoList 
                                       select new StringBuilder().Append(i[3]).Append(";").Append(i[4]).Append(";").Append(i[5]).ToString())
                                       .Distinct().ToList();

This just gives me a separated list of all category paths...
What i the best datatype to store a hiarchical list in? and how do I select this with linq?

+1  A: 

This can be done with LINQ but i could not find way with good performance.

A simple way to do it is based on Dictionary and HashSet:

IList<string[]> infoList = ParseStream(infoFile);
var dictionary = new Dictionary<string, Dictionary<string, HashSet<string>>>();
foreach (var articeInfo in infoList)
{
    string class1 = articeInfo[3];
    string class2 = articeInfo[4];
    string class3 = articeInfo[5];

    Dictionary<string, HashSet<string>> class1Categories;
    if (!dictionary.TryGetValue(class1, out class1Categories))
    {
        class1Categories = new Dictionary<string, HashSet<string>>();
        dictionary[class1] = class1Categories;
    }

    HashSet<string> class2Categories;
    if (!class1Categories.TryGetValue(class2, out class2Categories))
    {
        class2Categories = new HashSet<string>();
        class1Categories[class2] = class2Categories;
    }

    class2Categories.Add(class3);
}

The result is hierarchical data, where first level is class1, 2nd is class2 and last is class3. Dictionary keys are distinct by default and so are HashSet values so it's safe from having duplicate values.

For example, to print all values in hierarchical indented way:

var classes = new Dictionary<string, Dictionary<string, HashSet<string>>>();

foreach (var class1 in classes)
{
    Console.WriteLine(class1.Key);
    foreach (var class2 in class1.Value)
    {
        Console.WriteLine("\t{0}", class2.Key);
        foreach (var class3 in class2.Value)
        {
            Console.WriteLine("\t\t{0}", class3);
        }
    }
}
Elisha
Nice, working answer. Dictionary and HashSet are a good way to go to store this. But I would like to select it in 1 statement using linq...
Nicky De Maeyer