data-mining

Overwhelmed by Machine Learning---is there an ML101 book?

It seems like there are so many subfields linked to Machine Learning. Is there a book or a blog that gives an overview of those different fields and what each of them do, maybe how to get started, and what background knowledge is required? ...

Testing When Correctness is Poorly Defined?

I generally try to use unit tests for any code that has easily defined correct behavior given some reasonably small, well-defined set of inputs. This works quite well for catching bugs, and I do it all the time in my personal library of generic functions. However, a lot of the code I write is data mining code that basically looks for...

What kind of artificial intelligence jobs are out there?

Throughout my academic years in computer science I fell in love with many aspects of artificial intelligence. From expert systems, neural networks, to data mining (classification). I wonder, if I was to transform this academic passion professionally, what kind of AI-related jobs are out there? ...

Correcting a known bias in collected data

Ok, so here is a problem analogous to my problem (I'll elaborate on the real problem below, but I think this analogy will be easier to understand). I have a strange two-sided coin that only comes up heads (randomly) 1 in every 1,001 tosses (the remainder being tails). In other words, for every 1,000 tails I see, there will be 1 heads. ...

PHP immediate echo

I have quite a long data mining script, and in parts of it I echo some information to the page (during a foreach loop, actually.) However I am noticing that the information is being sent to the browse not immediately as I had hoped, but in 'segments'. Is there some function I can use after my echo to send all the data to the browser im...

R Random Forests Variable Importance

I am trying to use the random forests package for classification in R. The Variable Importance Measures listed are: -mean raw importance score of variable x for class 0 -mean raw importance score of variable x for class 1 -MeanDecreaseAccuracy -MeanDecreaseGini Now I know what these "mean" as in I know their definitions. What I wa...

Best open source library or application to crawl and data mine web sites

I would like to know what is the best eopen-source library for crawling and analyzing websites. One example would be a crawler property agencies, where I would like to grab information from a number of sites and aggregate them into my own site. For this I need to crawl the sites and extract the property ads. ...

Improving the accuracy of Microsoft Time Series Algorithm

Situation is that we have branches in every city, selling food. I feed the time series algorithm with the actual date, as the key time, and total sales of that day, as the input and predict. Predictions are not bad. But, I would like to know if i can improve the predictions by for example feeding with the number of branches(a new bran...

Charting and Data Manipulation.

Hi, Although there are some threads on here about .net charting controls, I'm starting new thread becuase I'm possibly looking for some advanced data manipulation (maybe this would fall under datamining but I'm not sure) along with charting. I've been asked to research and prototype and Key Performance Indicators (KPI) system. Basically...

curl not working for getting a web page content, why?

Hi all i am using a curl script to go to a link and get its content for further manipulation. following is the link and curl script: <?php $url = 'http://criminaljustice.state.ny.us/cgi/internet/nsor/fortecgi?serviceName=WebNSOR&amp;amp;templateName=detail.htm&amp;amp;requestingHandler=WebNSORDetailHandler&amp;amp;ID=368343543'; //cur...

Clustering Algorithm with discrete and continuous attributes?

Does anyone know a good algorithm for perform clustering on both discrete and continuous attributes? I am working on a problem of identifying a group of similar customers and each customer has both discrete and continuous attributes (Think type of customers, amount of revenue generated by this customer, geographic location and etc..) Tr...

SQL Server Non-Standard Date Based Histogram

I have user login data with timestamps and what I would like to do is get a histogram of logins by year, but with the year starting at an arbitrary date. For example, I want the following sort of information: 1 May 2005 - 30 Apr 2006 | 525 1 May 2006 - 30 Apr 2007 | 673 1 May 2007 - 30 Apr 2008 | 892 1 May 2006 - 30 Apr 2009 | 1047 Th...

Data Mining open source tools

Hi I'm due to take up a project which is into data mining. Before I jump in I wanted to probe around for different data mining tools (preferably open source) which allows web based reporting. In my scenario the all the data would be provided to me, so I'm not supposed to crawl for it. In n nutshell, am looking for a tool which does - D...

Industry benchmarks to assess data mining tools

Hi I'm looking for data mining tools for a project and in line with that I have put up another post in SO. I'm currently looking at different tools and am wondering whether any industry benchmark exists to asses different data mining tools so that I can refer it do a better evaluation of tools. Please let me know if any such benchmark ...

How is BI related to data mining?

Hi I'm a little confused on how to connect BI with data mining. Can BI be termed as some kind of a manifestation of data mining? How different is a BI tool like Microsoft Analysis Services from a data mining tool like Weka? I guess BI involves more of reporting and analysis of data, where in the data undergoes some kind of aggregatio...

How do I visualize a large document set?

I have 100 Gb of documents. I would like to characterize it and get a general sense of what topics are prevalent. The documents are plain text. I have considered using a tool like Google Desktop to search, but it is too large to really guess what to search ask for and too time consuming to perform enough searches to cover the entire se...

Data mining logs to locate a bug

I'm working on a data distribution application which receives data from a source and distributes that data to multiple target application. After successfully distributing several messages each second for 8 days, it missed a single message and did not deliver it properly to the clients. As I was looking at the logs I tried to find someth...

Sparse parameter selection using Genetic Algorithm

Hello, I'm facing a parameter selection problem, which I would like to solve using Genetic Algorithm (GA). I'm supposed to select not more than 4 parameters out of 3000 possible ones. Using the binary chromosome representation seems like a natural choice. The evaluation function punishes too many "selected" attributes and if the number o...

Business Intelligence: Data mining with MS SQL Server?

I have to study about data mining using SQL Server. As I know, Business Intelligence in SQL Server supports data mining, but I'm not pretty sure. Does BI really support data mining? How can I start with data mining with SQL Server? I mean, resources such as books, blogs,..etc Thank you all. ...

Data mining with SQL Server, how should I begin?

I have to study about data mining with SQL Server, but I don't know how to begin. Can you suggest me some books written in this subject? some sources of knowledge studied in it? Thank you in advance. ...