views:

199

answers:

3

Let's say you need to funnel random, related data given to you into more succinct categories.

Example - You're given the following data. NOTE - There could be any number of other related, columnar data:

 Customer      Product                Category
==========    =========    =================================
Customer A    Product A                 Cat 1
 CustomerA    Product B               Category 1
  Cust-A      Product C    Totally Lame & Unrelated Grouping

Task - Consolidate and normalize the above into clean, pre-defined groupings:

CustomerA
  Category1
    ProductA
    ProductB
    ProductC

Please don't worry about how the finished data will be persisted. But rather focus on how you'll persist and manage the rules for grouping.

Only one assumption: You can't use a database to persist your grouping rules. So when we say "normalize", we're not speaking in terms of relational database normalization rules. But rather we're wanting to remove inconsistencies from data inputs (as seen above) to bring the random data into a consistent state.

So what are the available options? Remain technology agnostic:

XML?

Config files?

Settings file (compiled or not)?

Ini File?

Code?

etc.

List pros & cons for each answer. And though this is indeed an excersize, it's a real-world problem. So assume your client/employer has tasked you with this.

A: 

This seems like a data cleansing exercise, perfection is pretty impossible. Issues:

1). Can you specify up front the categories, or must you deduce from the data?

2). What rules can we use to accept equivalence?

"Cat 1" is the same as "Category 1" ? and "Category one" ?

is

"Cat 1." als "Cat 1"? what about "Cat 1?" ? and "Cat 12" ?

Just getting a good set of rules in a challenge.

2). How would you capture those rules? Code or config? If config how would you express it? Do you end up just writing a new specilaised programming language?

djna
Boydski
A: 

This seems like a data cleansing exercise, perfection is pretty impossible. Issues:

1). Can you specify up front the categories, or must you deduce from the data?

2). What rules can we use to accept equivalence?

"Cat 1" is the same as "Category 1" ? and "Category one" ?

is

"Cat 1." als "Cat 1"? what about "Cat 1?" ? and "Cat 12" ?

Just getting a good set of rules in a challenge.

3). How would you capture those rules? Code or config? If config how would you express it? Do you end up just writing a new specilaised programming language?

djna
A: 
  1. A dictionary mapping for each value. 'Cat1' => 'Category1', 'Category 2' => 'Category2'. This is easy to store, and has no unintended consequences. The disadvantage is that creating all those mappings by hand is actual work.
  2. A series of regular expressions. That way, you're able to capture nearly all rules using relatively little work. The disadvantage is that regular expressions 'misfire' relatively easily, and the order of evaluation matters (i.e. when values match more than one 'rule'.

As for how to persist them? I can't think of a more uninteresting question. You just use whatever's easiest in your preferred programming language.

Michiel Buddingh'