views:

319

answers:

6

In the past we have been using a summary of historical performance by "Vendor" to decide how we allocate new business to each vendor.

Now what we want to do is to break that summary performance into smaller subsets such geography, size, age, etc to better allocate the new business we hand out to our vendors.

For example, Lets say that right now I have three Vendors and 9 pieces of business to hand out. Currently, I just hand out the pieces of business based their summary of performance. However, this does not take into consideration each of their strengths by geography, size, age, etc.

So, lets say Vendor-A is really good at handling small Texas accounts while Vendor-B is good at large Florida accounts. I want to make sure that each vendor gets what they are good at and doesn't get business that they are bad at.

We have all this information in the database, but no real algorithm or approach to extract it and base our decisions about how much and of what kind of business we hand out to who. We been looking into SQL Server Analysis Services for this task, but we could use some very general pointers about what algorithm we should use to do this.

Does anyone happen to know an approach or algorithm designed to handle the segmentation of data to make better predictions?

+5  A: 

If you have a fairly large amount of historical data to work with, you can use a naive Bayes classifier. Using the features of the accounts as input, you can train a family of classifiers to predict the performance of each vendor.

Free implementations probably already exist for your language of choice, though you'll want to do some reading about the underlying theory, to be sure you're using it properly and to give you a sense of what sorts of outputs are reasonable.

David Seiler
A: 

One way to do it is to list all the characteristics that you are interested in, for example geographical location, size of account, price, delivery time ...

Then score each vendor on each characteristic, for example 1 to 10.

Then give a priority to each characteristic, could also be 1 to 10.

This priority could be general or for each job.

Then you "simply" multiply vendor score * priortity for each desired characteristic. The one with the highest total gets the job.

Shiraz Bhaiji
+2  A: 

Converting Generic Quantities to Numeric Values: (added in edit)

You NEED some way to get a quality metric (measure of "goodness") from the data, and to convert a lot of these things into numeric quantities. Possible metrics might be time to complete a task (versus initial estimate, to see if they lie about this), cost, cost vs. quoted cost (to check if they give you truthful quotes), and some sort of conglomerate for customer satisfaction related to the project they worked on.

Regions/areas of previous contracts can be coded as a GPS coordinates with X,Y for center of region for a given transaction (IE "California" or "United States" and the width-x,width-y for that area (error bounds). To see if geography matters, you find all previous business within a given distance of your new contract (simple form: just check distance for all their prior transactions, complex form with lots of data: some sort of clustering algorithm or 3-d surface extraction).

Then you do a weighted average of (quality metric for previous business deal), reduced by (distance between current desired deal location, and previous deal location). Complicated, but it WILL give you regional business metrics.

Analyzing Data to Extract Meaning

I suggest you convert the data into standard, CSV tuples (company id, region code/coordinates, account size, performance metric), and then do statistical regressions & numerical analysis techniques. Your goal is a set of equations describing a company's performance in terms of each characteristic. Principal Components Analysis can simplify this process, by identifying the most important characteristics for a company's performance.

Software Choices:

Honestly, this sort of analysis is properly the domain of statistical or data mining software, not SQL analysis engines, since they are pure statistics and math. If your metrics & characteristics are numeric & clearly quantifiable, R, Matlab, Mathematica, etc are good choices. If your dataset is very large, you'll want to invest in the most robust package you can get.

Picking the best candidate company: simple case

The problem of assigning business to the best candidate is a classic optimization problem. If relationships between variables and performance are linear, you've got a really handy case. Linear programming techniques were designed to optimally solve exactly that kind of problem (assuming you can extract the relationships).

Picking the best candidate company: complex case

If the performance-vs-characteristics equations are nonlinear, you need fuzzy logic to find the best solution. This will use very,very advanced algorithms to find a very good (not necessarily best) solution for business when there are a LOT of variables. From experience, Mathematica has a top-notch set of optimization tools (based on genetic algorithms) for nonlinear optimization, and Matlab also has fuzzy logic/genetic algorithms for solving that problem. For further tools, look into "global nonlinear optimization."

BobMcGee
+3  A: 

Naive bayes is a good choice, but I'd also consider decision trees. The major pros to using decision trees are these:

  1. It's easy for a human to understand what's going on and tweak.
  2. It does well even if you only have a small set of data to train against.

But whatever you decide, the hardest part will be transforming your data to certain attributes. This is to both tailor it for a particular algorithm as well as to keep the training from becoming too specific.

For example: Instead of age as a number, you may be more interested in specifying an age range (0-10, 11-20, ... ) for decision purposes. Likewise with geography, city, etc.

hythlodayr
A: 

Calculate the variance of your historical data. That can be used to 'predict' what you'll see in the future. If the assumption "the future will be something like the past" makes sense for your process. It's fairly simple and many sql servers offer builtins to help do the calculations.

Jay
A: 

Regression analysis is what you need, Linear regression in particular.

I hope that helped. God bless :)

Leo Jweda