data-mining

Splitting data into trainning/testing datasets in MATLAB?

Upon some research I found two functions in MATLAB to do the task: cvpartition function in the Statistics Toolbox crossvalind function in the Bioinformatics Toolbox Now I've used the cvpartition to create n-fold cross validation subsets before, along with the Dataset/Nominal classes from the Statistics toolbox. So I'm just wondering ...

Data Mining project ideas?

Hi I am looking for project ideas in the field of data mining. I expect to complete it in a quarter and intend to use C++, Linux as the environment. The course I'm taking aims to build the basics of data mining and covers topics like Classification, Regression-Modeling, Clustering and Association learning. Please point me to some good...

Recommended data mining books for a developer (not mathematician)?

I'm a developer, not too good at math, but I'm willing to learn fun stuff to do with data mining techniques. I've looked at pragmatic books on the subject which gives me some ideas (and maybe introduces me rapidly some tools). I insist on the fact that I'm not a mathematician! What do you advise? ...

Netflix prize dataset?

Hi, I am looking to work on a machine learning project for my course and I would like to use the netflix prize dataset? But it looks like the contest is closed and the dataset is not available for download in the netflix website. Does anyone who wokred on it has the dataset? If so ,can u share it? ...

What techniques/tools are there for discovering common phrases in chunks of text?

Lets say I have 100000 email bodies and 2000 of them contains an abitrary common string like "the quick brown fox jumps over the lazy dog" or "lorem ipsum dolor sit amet". What techniques could/should I use to "mine" these phrases? I'm not interested in mining single words or short phrases. Also I need to filter out phrases that I alread...

Postfix API/Lookup tables for sent messages information

Currently it seems common practice to parse Postfix log files in order to determine if a message has been sent. Is there an API for Postfix or a look up table in it that yields this information in a manner quicker than parsing (rather lengthy) log files? ...

Converting an HTML Document to Selector based index file

I'm looking for some sort of tool that can take an html document and pump out a selector based representation of the file. For example: <div> Some text <ul class="foo"> <li>First</li> <li>Second</li> <ul> </div> And output a flat text file in the spirit of: div div #text Some text div ul.foo li Frist div ul.foo li Se...

open source data mining/text analysis tools in python

I have a database full of reviews of various products. My task is to perform various calculation and "create" another "database/xml-export" with aggregated data. I am thinking of writing command line programs in python to do that. But I know someone have done this before and I know that there is some open source python solution or simila...

How can I extract Information from open Social based networks ?

How can I extract information from opensocial based networks like orkut. ...

News Data API or Feeds

I would like to know if there is any news feeds/api that can be used for coding/datamining. Skygrid for example gives live news feeds and if the news is good or bad, but it's all in flash and they don't seems to provide any rss other than their twitter. ...

Best learning algorithm to make a decision tree in java ?

I have a datasets with information like age, city, age of children, ... and a result (confirm, accept). To help modelisation of "workflow", I want to create automatically a decision tree based on previous datasets. I have take a look at http://en.wikipedia.org/wiki/Decision_tree_learning and I know that the problem is clearly not obvio...

How can I install "DataMining Adding for Office 2007" as part of my setup?

I'm writting a setup program that needs to install the DataMining Adding for Office 2007. 1) How do I detect if it's already installed? 2) If it is not installed, I download and run the MSI (SQLServer2008_DMAddin.msi). But how can I run the Server Configuration (Microsoft.SqlServer.DataMining.Office.ServerConfiguration.exe) tool myself...

How do I data mine text?

Here's the problem. I have a bunch of large text files with paragraphs and paragraphs of written matter. Each para contains references to a few people (names), and documents a few topics (places, objects). How do I data mine this pile to assemble some categorised library? ... in general, 2 things. I don't know what I'm looking for, so...

How do I extract keywords used in text?

How do I data mine a pile of text to get keywords by usage? ("Jacob Smith" or "fence") And is there a software to do this already? even semi-automatically, and if it can filter out simple words like "the", "and", "or", then I could get to the topics quicker. ...

Lexical Analysis libraries

I would like to make a piece of software able to regognize whether a sentence is positive or negative. Is there any Lexical Analysis libraries arround? I don't really know where I should start. ...

how to get the similar texts from a lot of pages?

get the x most similar texts from a lot of texts to one text. maybe change the page to text is better. You should not compare the text to every text, because its too slow. ...

Datamining models in FORTRAN or C (or managed code)?

We are planning to develop a datamining package for windows. The program core / calculation engine will be developed in F# with GUI stuff / DB bindings etc done in C# and F#. However, we have not yet decided on the model implementations. Since we need high performance, we probably can't use managed code here (any objections here?). The ...

Besides NLTK, what is the best information retrieval library for Python?

For use to analyze documents on the Internet! ...

algorithms to evaluate user responses

I'm working on a web application which will be used for classifying photos of automobiles. The users will be presented with photos of various vehicles, and will be asked to answer a series of questions about what they see. The results will be recorded to a database, averaged, and displayed. I'm looking for algorithms to help me identify...

Set of books about Natural Language processing, Semantic Analysis and Data Mining.

So i´m starting to write my thesis of my master, next semester (should be done before june), i already have the theme, and i need to write the state of art till february. The main areas are Intelligent systems, Natural Language processing, Semantic Analysis and Data Mining. I am researching for the best books about Natural Language pro...