views:

98

answers:

4

Suppose you were able keep track of the news mentions of different entities, like say "Steve Jobs" and "Steve Ballmer".

What are ways that could you tell whether the amount of mentions per entity per a given time period was unusual relative to their normal degree of frequency of appearance?

I imagine that for a more popular person like Steve Jobs an increase of like 50% might be unusual (an increase of 1000 to 1500), while for a relatively unknown CEO an increase of 1000% for a given day could be possible (an increase of 2 to 200). If you didn't have a way of scaling that your unusualness index could be dominated by unheard-ofs getting their 15 minutes of fame.

update: To make it clearer, it's assumed that you are already able to get a continuous news stream and identify entities in each news item and store all of this in a relational data store.

A: 

Way over simplified- store people's names and the amount of articles created in the past 24 hours with their name involved. Compare to historical data.

Real life- If you're trying to dynamically pick out people's names, how would you go about doing that? Searching through articles how do you grab names? Once you grab a new name, do you search for all articles for him? How do you separate out Steve Jobs from Apple from Steve Jobs the new star running back that is generating a lot of articles?

If you're looking for simplicity, create a table with 50 people's names that you actually insert. Every day at midnight, have your program run a quick google query for past 24 hours and store the number of results. There are a lot of variables in this though that we're not accounting for.

Mike M.
You'd need to look for associations, group articles that have ["steve jobs", "apple", "iPhone"] separately from articles that have ["steve jobs", "football", "running back"]. Of course, there's be some noise if the running back gave an interview about his new iProduct, but you can't expect this analysis to be perfect for all situations ;) Association rule analysis can be a useful data mining technique, and maybe could be applied here.
FrustratedWithFormsDesigner
"If you're trying to dynamically pick out people's names, how would you go about doing that?" This question supposes that you already have a way of doing this. How you would "Compare to historical data" is what I'm interested in.
ʞɔıu
@ʞɔıu: You have to GET historical data. You can scan old articles and build a database from that, or if the articles are not available, start with current ones and build a database for a while before comparing the the next "current" set. How much data you need for a good sample size is hard to say.
FrustratedWithFormsDesigner
You don't know that in two weeks Steve Jobs, the breakout running back will enter the scene. I'm saying that trying to do it all dynamically is never going to give you concrete stats. Too many variables to isolate, choose your battles.
Mike M.
All right, so suppose that he feeds it to Mechanical Turk and every mention is vetted by a human being - now how does he identify which counts are high compared to the historical data? By eyeballing it?
Matt Parker
+2  A: 
  • Create a database and keep a history of stories with a time stamp. You then have a history of stories over time of each category of news item you're monitoring.
  • Periodically calculate the number of stories per unit of time (you choose the unit).
  • Test if the current value is more than X standard deviations away from the historical data.

Some data will be more volatile than others so you may need to adjust X appropriately. X=1 is a reasonable starting point

Jay
I think this is a good start, and there's also the element that the variance for smaller populations is going to be larger. You need like variance-of-variance
ʞɔıu
Multilevel models (aka hierarchical linear models) are one way to account for that variance-of-variance.
Matt Parker
Thanks Matt. I'll look into that technique. Stack Overflow is always a great resource.
Jay
+3  A: 

You could use a rolling average. This is how a lot of stock trackers work. By tracking the last n data points, you could see if this change was a substantial change outside of their usual variance.

You could also try some normalization -- one very simple one would be that each category has a total number of mentions (m), a percent change from the last time period (δ), and then some normalized value (z) where z = m * δ. Lets look at the table below (m0 is the previous value of m) :

Name                m    m0    δ    z
Steve Jobs       4950  4500    .10      495
Steve Ballmer     400   300    .33      132
Larry Ellison      50    10    4.0      400
Andy Nobody        50    40    .20      10

Here, a 400% change for unknown Larry Ellison results in a z value of 400, a 10% change for the much better known Steve Jobs is 495, and my spike of 20% is still a low 10. You could tweak this algorithm depending on what you feel are good weights, or use standard deviation or the rolling average to find if this is far away from their "expected" results.

Andy
I also like this book -- [O'Reilly Programming Collective Intelligence][1] for algorithms and approaches to problems like this -- it's in Python and uses real-world data corpora to demonstrate a lot of concepts you might be interested in.[1]: http://oreilly.com/catalog/9780596529321
Andy
A: 

The method you use is going to depend on the distribution of the counts for each person. My hunch is that they are not going to be normally distributed, which means that some of the standard approaches to longitudinal data might not be appropriate - especially for the small-fry, unknown CEOs you mention, who will have data that are very much non-continuous.

I'm really not well-versed enough in longitudinal methods to give you a solid answer here, but here's what I'd probably do if you locked me in a room to implement this right now:

  1. Dig up a bunch of past data. Hard to say how much you'd need, but I would basically go until it gets computationally insane or the timeline gets unrealistic (not expecting Steve Jobs references from the 1930s).

  2. In preparation for creating a simulated "probability distribution" of sorts (I'm using terms loosely here), more recent data needs to be weighted more than past data - e.g., a thousand years from now, hearing one mention of (this) Steve Jobs might be considered a noteworthy event, so you wouldn't want to be using expected counts from today (Andy's rolling mean is using this same principle). For each count (day) in your database, create a sampling probability that decays over time. Yesterday is the most relevant datum and should be sampled frequently; 30 years ago should not.

  3. Sample out of that dataset using the weights and with replacement (i.e., same datum can be sampled more than once). How many draws you make depends on the data, how many people you're tracking, how good your hardware is, etc. More is better.

  4. Compare your actual count of stories for the day in question to that distribution. What percent of the simulated counts lie above your real count? That's roughly (god don't let any economists look at this) the probability of your real count or a larger one happening on that day. Now you decide what's relevant - 5% is the norm, but it's an arbitrary, stupid norm. Just browse your results for awhile and see what seems relevant to you. The end.

Here's what sucks about this method: there's no trend in it. If Steve Jobs had 15,000 a week ago, 2000 three days ago, and 300 yesterday, there's a clear downward trend. But the method outlined above can only account for that by reducing the weights for the older data; it has no way to project that trend forward. It assumes that the process is basically stationary - that there's no real change going on over time, just more and less probable events from the same random process.

Anyway, if you have the patience and willpower, check into some real statistics. You could look into multilevel models (each day is a repeated measure nested within an individual), for example. Just beware of your parametric assumptions... mention counts, especially on the small end, are not going to be normal. If they fit a parametric distribution at all, it would be in the Poisson family: the Poisson itself (good luck), the overdispersed Poisson (aka negative binomial), or the zero-inflated Poisson (quite likely for your small-fry, no chance for Steve).

Awesome question, at any rate. Lend your support to the statistics StackExchange site, and once it's up you'll be able to get a much better answer than this.

Matt Parker