statistics

Statistical Summarization of records

Here is the real world issue that we are solving. We have some rather large data sets that need to be aggregated and summarized in real time with a number of filters and formulas applied to them. It works fine to apply these to each record in real time when the data set is less than 50,000 records but as we approach 100,000 and then 100+...

Create CDF for Anderson Darling test for Octave forge Statistics package function

I am using Octave and I would like to use the anderson_darling_test from the Octave forge Statistics package to test if two vectors of data are drawn from the same statistical distribution. Furthermore, the reference distribution is unlikely to be "normal". This reference distribution will be the known distribution and taken from the hel...

How do I evaluate the effectiveness of an algorithm that predicts probabilities?

I need to evaluate the effectiveness of algorithms which predict the probability of something occurring. My current approach is to use "root mean squared error", ie. the square root of the mean of the errors squared, where the error is 1.0-prediction if the event occurred, or prediction if the event did not occur. The algorithms have n...

Creating a formula for calculating device "health" based on uptime/reboots

I have a few hundred network devices that check in to our server every 10 minutes. Each device has an embedded clock, counting the seconds and reporting elapsed seconds on every check in to the server. So, sample data set looks like CheckinTime Runtime 2010-01-01 02:15:00.000 101500 2010-01-01 02:25:00.000 102100 2010-...

blindly classifying new trends in incoming data

how do news outlets like google news automatically classify and rank documents about emerging topics, like "obama's 2011 budget"? i've got a pile of articles tagged with baseball data like player names and relevance to the article (thanks, opencalais), and would love to create a google news-style interface that ranks and displays new po...

What R packages are available for binary data that is both correlated and clustered?

I'm working on a project now that's rather unlike anything I've done before. I have two tests with binary results that will be administered to the same sample, which is drawn from a clustered population (i.e., some subjects will be from the same family). I'd like to compare proportions of positive test results, but the clustering makes...

How can I estimate the logarithmic form of data points using R?

I have data points that represent a logarithmic function. Is there an approach where I can just estimate the function that describes this data using R? Thanks. ...

Find out If index and table statistics are out of date

Hi, I Update indexes with full scan weekly. so when I run: SELECT name AS index_name, STATS_DATE(OBJECT_ID, index_id) AS StatsUpdated FROM sys.indexes Ref: link text I expect it to show me that all indexes were updated this weekend. But there are several records which look like: index_name StatsUpdated clust 2005-10-14 01:36:2...

Plot multiple functions in R

I previously asked this question which was useful in plotting a function. I want to try and plot twenty functions on the same axes to illustrate how a function varies between two ranges. I have successfully done this using individually specified functions, but I wanted to do this using a loop. What I have attempted doing is: ## add gg...

How can I correlate pageviews with memory spikes?

I'm having some memory problems with an application, but it's a bit difficult to figure out exactly where it is. I have two sets of data: Pageviews The page that was requested The time said page was requested Memory use The amount of memory being used The time this memory use was recorded I'd like to see exactly which pageviews...

What programming languages are good for statistics?

I'm doing a bit more statistical analysis on some things lately, and I'm curious if there are any programming languages that are particularly good for this purpose. I know about R, but I'd kind of prefer something a bit more general-purpose (or is R pretty general-purpose?). What suggestions do you guys have? Are there any languages o...

Exporting Stata results

I'm sure this is an issue anyone who uses Stata for publications or reports has run into: how do you conveniently export your output to something that can be parsed by a scripting language or Excel? There are a few ADO files that to this for specific commands (try findit tabout or findit outreg2). But what about exporting the output of ...

print beautiful value with error

I want to display in a HTML page some datas with errors, for example: (value, error) -> string (123, 12) -> (12 +- 1) x 10^1 (4234.3, 2) -> (4234 +- 2) (0.02312, 0.003) -> (23 +- 3) x 10^-3 I've produced this: from math import log10 def format_value_error(value,error): E = int(log10(abs(error))) val = float(value) / 10**E ...

SQL function to calculate median

Possible Duplicate: Function to Calculate Median in Sql Server I have a table containing two field (more, but not relevant). The fields are Price and Quantity. I want to find several statistically data for this table, and among them is median price when adjusted to quantity. Today I have a basic-slow-not so good looking funct...

How do you compare the "similarity" between two dendrograms (in R) ?

I have two dendrograms which I wish to compare to each other in order to find out how "similar" they are. But I don't know of any method to do so (let alone a code to implement it, say, in R). Any leads ? Thanks, Tal ...

Algorithm(s) for spotting anomalies ("spikes") in traffic data

I find myself needing to process network traffic captured with tcpdump. Reading the traffic is not hard, but what gets a bit tricky is spotting where there are "spikes" in the traffic. I'm mostly concerned with TCP SYN packets and what I want to do is find days where there's a sudden rise in the traffic for a given destination port. Ther...

Design pattern for ongoing survey anayisis

I'm doing an ongoing survey, every quarter. We get people to sign up (where they give extensive demographic info). Then we get them to answer six short questions with 5 possible values much worse, worse, same, better, much better. Of course over time we will not get the same participants,, some will drop out and some new ones will si...

Fluent NHibernate: Model Markov chain

I want to use Fluent NHibernate to model a Markov chain. It's basically a set of different states with transition probabilities between the states. I want to map the transition probabilities into MarkovState.TransitionProbabilities as a Dictionary. I want to use the NEXT state as key (using either MarkovState or int as key), so that I c...

Drupal - Counting data in nodes, creating custom statistics

Hiya, I'm building some custom content types to capture customer data on a website. Admins will enter the data, users will be able to view it, but I also need to be able to bolt on some statistics and infographics to the data. The problem I have is that I can't see any simple way of doing this within Drupal. Are there modules which ca...

Statistics regarding Windows XP before SP2, is it worth still supporting it

We would like to drop support for our application on stock Windows XP and XP SP1 and thus require SP2 or higher. I tried finding some statistics about market share of the various service packs of Windows but failed. Do you have such links? Do you still support XP before SP2? ...