views:

203

answers:

3

I have a string of attribute names and definitions. I am trying to split the string on the attribute name, into a Dictionary of string string. Where the key is the attribute name and the definition is the value. I won't know the attribute names ahead of time, so I have been trying to somehow split on the ":" character, but am having trouble with that because the attribute name is is not included in the split.

For example, I need to split this string on "Organization:", "OranizationType:", and "Nationality:" into a Dictionary. Any ideas on the best way to do this with C#.Net?

Organization: Name of a governmental, military or other organization. OrganizationType: Organization classification to one of the following types: sports, governmental military, governmental civilian or political party. (required) Nationality: Organization nationality if mentioned in the document. (required)


Here is some sample code to help:

private static void Main()
{
    const string str = "Organization: Name of a governmental, military or other organization. OrganizationType: Organization classification to one of the following types sports, governmental military, governmental civilian or political party. (required) Nationality: Organization nationality if mentioned in the document. (required)";

    var array = str.Split(':');
    var dictionary = array.ToDictionary(x => x[0], x => x[1]);

    foreach (var item in dictionary)
    {
        Console.WriteLine("{0}: {1}", item.Key, item.Value);
    }

    // Expecting to see the following output:

    // Organization: Name of a governmental, military or other organization.
    // OrganizationType: Organization classification to one of the following types sports, governmental military, governmental civilian or political party.
    // Nationality: Organization nationality if mentioned in the document. (required)
}

Here is a visual explanation of what I am trying to do:

http://farm5.static.flickr.com/4081/4829708565_ac75b119a0_b.jpg

+1  A: 

Considering that each word in front of the colon always has at least one capital (please confirm), you could solve this by using regular expressions (otherwise you'd end up splitting on all colons, which also appear inside the sentences):

var resultDict = Regex.Split(input, @"(?<= [A-Z][a-zA-Z]+):")
                 .ToDictionary(a => a[0], a => a[1]);

The (?<=...) is a positive look-behind expression that doesn't "eat up" the characters, thus only the colon is removed from the output. Tested with your input here.

The [A-Z][a-zA-Z]+ means: a word that starts with a capital.

Note that, as others have suggested, a "smarter" delimiter will provide easier parsing, as does escaping the delimiter (i.e. like "::" or "\:" when you are required to use colons. Not sure if those are options for you though, hence the solution with regular expressions above.

Edit

For one reason or another, I kept getting errors with using ToDictionary, so here's the unwinded version, at least it works. Apologies for earlier non-working version. Not that the regular expression is changed, the first did not include the key, which is the inverse of the data.

var splitArray = Regex.Split(input, @"(?<=( |^)[A-Z][a-zA-Z]+):|( )(?=[A-Z][a-zA-Z]+:)")
                            .Where(a => a.Trim() != "").ToArray();

Dictionary<string, string> resultDict = new Dictionary<string, string>();
for(int i = 0; i < splitArray.Count(); i+=2)
{
    resultDict.Add(splitArray[i], splitArray[i+1]);
}

Note: the regular expression becomes a tad complex in this scenario. As suggested in the thread below, you can split it in smaller steps. Also note that the current regex creates a few empty matches, which I remove with the Where-expression above. The for-loop should not be needed if you manage to get ToDictionary working.

Abel
`An item with the same key has already been added`.
Darin Dimitrov
Very cool and good to know about positive look-behind expression. I tried it, however I am still faced with the issue of the attribute name being left in "next" see this image:http://farm5.static.flickr.com/4093/4830339922_1945af206d_b.jpg
Paul Fryer
@Darin and @Paul: aha, I notice that. Fixing...!
Abel
I personally think its a bit complicated to try to split it all in one go. Much easier to split first by your name/value pairs and then split those pairs apart. Its goign to be much more understandable at the end of the day than if you're trying to split on both.
Chris
@Chris, I wholeheartedly agree. Feel free to split as much as needed. The first solution above splits in key+data per item. After that it becomes trivial. The second solution does it all in one go.
Abel
@Paul Fryer: that error is gone now, sorry for posting before testing..
Abel
+1  A: 

You would need some delimiter to indicate when it is the end of each pair as opposed to having one large string with sections in between e.g.

Organization: Name of a governmental, military or other organization.|OrganizationType: Organization classification to one of the following types: sports, governmental military, governmental civilian or political party. (required) |Nationality: Organization nationality if mentioned in the document. (required)

Notice the | character which is indicating the end of the pair. Then it is just a case of using a very specific delimiter, something that is not likely to be used in the description text, instead of one colon you could use 2 :: as one colon could possibly crop up on occassions as others have suggested. That means you would just need to do:

// split the string into rows
string[] rows = myString.Split('|');
Dictionary<string, string> pairs = new Dictionary<string, string>();
foreach (var r in rows)
{
    // split each row into a pair and add to the dictionary
    string[] split = Regex.Split(r, "::");
    pairs.Add(split[0], split[1]);
}

You can use LINQ as others have suggested, the above is more for readability so you can see what is happening.

Another alternative is to devise some custom regex to do what you need but again you would need to be making a lot of assumptions of how the description text would be formatted etc.

James
It would be great if I can add a delimiter, unfortunately I don't have control over the input string (it comes from a 3rd party). I'm trying to do some normalization of it to make a structured model. I'm afraid I might have to do a special split like you mentioned. The assumptions I can make are:1) The attribute name will have no spaces in it.2) The attribute name will be immediately followed by a ":".
Paul Fryer
@Paul: Ah ok, I noticed aswell that each description only ever includes 1 fullstop. You could even possibly use that as the delimiter for each row? Although it is a big assumption...
James
+3  A: 

I'd do it in two phases, firstly split into the property pairs using something like this:

Regex.Split(input, "\s(?=[A-Z][A-Za-z]*:)")

this looks for any whitespace, followed by a alphabetic string followed by a colon. The alphabetic string must start with a capital letter. It then splits on that white space. That will get you three strings of the form "PropertyName: PropertyValue". Splitting on that first colon is then pretty easy (I'd personally probably just use substring and indexof rather than another regular expression but you sound like you can do that bit fine on your own. Shout if you do want help with the second split.

The only thing to say is be carful in case you get false matches due to the input being awkward. In this case you'll just have to make the regex more complicated to try to compensate.

Chris
Nice Chris! That is splitting the right way (at least the way that matches my expectations). That basically gives me an array of 3 items. I can then simply split each item on ':', index 0 is the attribute name and index 1 is the definition.
Paul Fryer
+1 Splitting into rows first was definitely the best solution.
James