statistics

How many professional software developers are there worldwide?

I'd like to know the estimation for the supply of professional software developers globally, and, wherever it's possible, regionally. Although weird, I hope this question to shed some light on the global availability of software development services, or, at the very least, realizing just how much of a commodity we are. Edit: by "profes...

Tools for sparse least squares regression

Hi, I want to do sparse high dimensional (a few thousand features) least squares regression with a few hundred thousands of examples. I'm happy to use non fancy optimisation - stochastic gradient descent is fine. Does anyone know of any software already implemented for doing this, so I don't have to write to my own? Kind regards. ...

Oracle V$OSSTAT

The Oracle view V$OSSTAT holds a few operating statistics, including: IDLE_TICKS Number of hundredths of a second that a processor has been idle, totalled over all processors BUSY_TICKS Number of hundredths of a second that a processor has been busy executing user or kernel code, totalled over all processors The documentation I've ...

Algorithm to order 'tag line' campaigns based on resulting sales

I want to be able to introduce new 'tag lines' into a database that are shown 'randomly' to users. (These tag lines are shown as an introduction as animated text.) Based upon the number of sales that result from those taglines I'd like the good ones to trickle to the top, but still show the others less frequently. I could come up with ...

How to get the statistics existing on a column, if any?

I want to check in Transact SQL if a specific column in a table has statistics and if so to get them all. ...

What's the quickest way to get the mean of a set of numbers from the command line?

Using any tools which you would expect to find on a nix system (in fact, if you want, msdos is also fine too), what is the easiest/fastest way to calculate the mean of a set of numbers, assuming you have them one per line in a stream or file? ...

tf-idf and previously unseen terms

TF-IDF (term frequency - inverse document frequency) is a staple of information retrieval. It's not a proper model though, and it seems to break down when new terms are introduced into the corpus. How do people handle it when queries or new documents have new terms, especially if they are high frequency. Under traditional cosine match...

Probability of selecting an element from a set

Hello, The expected probability of randomly selecting an element from a set of n elements is P=1.0/n . Suppose I check P using an unbiased method sufficiently many times. What is the distribution type of P? It is clear that P is not normally distributed, since cannot be negative. Thus, may I correctly assume that P is gamma distributed? ...

How can I create an ordered list of the most common substrings inside of my MySQL varchar column?

I have a MySQL database table with a couple thousand rows. The table is setup like so: id | text The id column is an auto-incrementing integer, and the text column is a 200-character varchar. Say I have the following rows: 3 | I think I'll have duck tonight 4 | Maybe the chicken will be alright 5 | I have a pet duck now, awesome! ...

Simple Random Samples from a (My)Sql database

How do I take an efficient simple random sample in SQL? The database in question is running MySQL; my table is at least 200,000 rows, and I want a simple random sample of about 10,000. The "obvious" answer is to: SELECT * FROM table ORDER BY RAND() LIMIT 10000 For large tables, that's too slow: it calls RAND() for every row (which al...

What is a good free online poll/survey app?

I need to conduct a survey of 3 questions. The first question will be Yes/No, the second will have multiple answers, in which you can select multiple answers for just that question, as well as a "other" box that you can fill in an answer. And the last will be a textarea in which they can enter general comments/suggestions. I would lov...

Is there an R package for learning a Dirichlet prior from counts data

I'm looking for a an R package which can be used to train a Dirichlet prior from counts data. I'm asking for a colleague who's using R, and don't use it myself, so I'm not too sure how to look for packages. It's a bit hard to search for, because "R" is such a nonspecific search string. There doesn't seem to be anything on CRAN, but ar...

Bug distribution

I have a program that I'm porting from one language to another. I'm doing this with a translation program that I'm developing myself. The relevant result of this is that I expect that there are a number of bugs in my system that I am going to need to find and fix. Each bug is likely to manifest in many places and fixing it will fix the b...

sas one-liner

Is there a way to run a one-liner in sas, or do I have to create a file? I'm looking for something like the -e flag in perl. ...

How can I calculate a fair overall game score based on a variable number of matches?

I have a game in which you can score from -40 to +40 on each match. Users are allowed to play any number of matches. I want to calculate a total score that implicitly takes into account the number of matches played. Calculating only the average is not fair. For example, if Peter plays four games and gets 40 points on each match, he wil...

calculate poisson probability percentage in python

When you use the POISSON function in Excel (or in OpenOffice Calc), it takes two arguments: an integer an 'average' number and returns a float. In python (i tried RandomArray and NumPy) it returns an array of random poisson numbers. What I really want is the percentage that this event will occur (it is a constant number and the array...

Weighted Mean

I have an existing web app that allows users to "rate" items based on their difficulty. (0 through 15). Currently, I'm simply taking the average of each user's opinion and presenting the average straight from MySQL. However, it's becoming clear to me (and my users) that weighting the numbers would be more appropriate. Oddly enough, a...

Computing a mean confidence interval without storing all the data points.

For large n (see below for how to determine what's large enough), it's safe to treat, by the central limit theorem, the distribution of the sample mean as normal (gaussian) but I'd like a procedure that gives a confidence interval for any n. The way to do that is to use a Student T distribution with n-1 degrees of freedom. So the quest...

Are there any free and open-source server-side analytics engines?

If you have any experiences with them, what are your thoughts on them as well? ...

Interactive Statistical Analysis tool

I'm looking for a basic software for statistical analysis. Most important is simple and intuitive use, getting started "right out of the box". At least basic operations should be interactive. Free would be a bonus :) The purpose is analysis of data dumps and logs of various processes. Importing a comma/tab separated file sorting and...