data-mining

Software to find dependencys in a database that not have them as restrictions

I have a database in SQL server 2005 that originaly comes fom an old mainframe. All relations was set in the surrounding software and there are non i the database. I need to find the relations, not by field name but by actual contence in the registers. (as suggestions, I realize I'l have to check them up) It would be nice with some ext...

Text mining on large database (data mining)

Hello, I have a large database of resumes (CV), and a certain table skills grouping all users skills. inside that table there's a field skill_text that describes the skill in full text. I'm looking for an algorithm/software/method to extract significant terms/phrases from that table in order to build a new table with standarized skill...

Architecture for database analytics

Hi, We have an architecture where we provide each customer Business Intelligence-like services for their website (internet merchant). Now, I need to analyze those data internally (for algorithmic improvement, performance tracking, etc...) and those are potentially quite heavy: we have up to millions of rows / customer / day, and I may w...

Data mining textbook

If you followed a DM course, which textbook was used? I know about Data Mining: Practical Machine Learning Tools and Techniques (Second Edition) and this poll. What did you effectively use? ...

how to find maximum frequent item sets from large transactional data file

Hi, I have the input file contains large amount of transactions like Transaction ID Items T1 Bread, milk, coffee, juice T2 Juice, milk, coffee T3 Bread, juice T4 Coffee, milk T5 Bread, Milk T6 Coffee, Bread T7 Coffee, Bread, Juice T8 Bread, Milk, Juice T9 Milk, Bread, Coffee, T10 Bread T11 Milk T12 Milk, Coffee, Bread, Juice i wan...

Data Mining - Predictive Analysis

We are looking at acquiring Data Mining software to primarily run predictive analysis processes. How does SQL Server Data Mining solution compares to other solutions like SPSS from IBM? Since SQL Server DM is included in SQL Server Enterprise license - what would be the justification to spend extra couple 100K to buy separate software ...

Finding Common Phrases in SQL Server TEXT Column

Short Desc: I'm curious to see if I can use SQL Analysis services or some other SQL Server service to mine some data for me that will show commonalities between SQL TEXT fields in a dataset. Long Desc I am looking at a subset of data that consists of about 10,000 rows of TEXT blobs which are used as a notes column in a issue tracking ...

Algorithm detect repeating/similiar strings in a corpus of data -- say email subjects, in Python

I'm downloading a long list of my email subject lines , with the intent of finding email lists that I was a member of years ago, and would want to purge them from my Gmail account (which is getting pretty slow.) I'm specifically thinking of newsletters that often come from the same address, and repeat the product/service/group's name in...

Naive Bayesian for Topic detection using "Bag of Words" approach

I am trying to implement a naive bayseian approach to find the topic of a given document or stream of words. Is there are Naive Bayesian approach that i might be able to look up for this ? Also, i am trying to improve my dictionary as i go along. Initially, i have a bunch of words that map to a topics (hard-coded). Depending on the occ...

Text mining with PHP

Hi, I'm doing a project for a college class I'm taking. I'm using PHP to build a simple web app that classify tweets as "positive" (or happy) and "negative" (or sad) based on a set of dictionaries. The algorithm I'm thinking of right now is Naive Bayes classifier or decision tree. However, I can't find any PHP library that helps me do...

Can I use rattle on 64-bit R?

Trying to install rattle on a windows server 2008 R2 64bit machine, using 64-bit R ver2.11, I got the following message: install.packages("rattle", dependencies=TRUE) Warning: dependencies ‘RGtk2’, ‘rggobi’, ‘RSvgDevice’, ‘Biobase’, ‘multicore’, ‘marray’, ‘affy’, ‘snowFT’, ‘Rmpi’, ‘rpvm’ are not available When I tried to install one o...

How can I extract similarities/patterns from a collection of binary strings?

I have a collection of binary strings of given size encoding effective solutions to a given problem. By looking at them, I can spot obvious similarities and intuitively see patterns of symmetry and periodicity. Are there mathematical/algorithmic tools I can "feed" this set of strings to and get results that might give me an idea of wh...

Performing a SVD on tweets. Memory problem

EDIT: I the size of the wordlist is 10-20 times bigger than I wrote down. I simply forgot a zero. EDIT2: I will have a look into SVDLIBC and also see how to reduce a matrix to its dense version so that might help too. I have generated a huge csv file as an output from my pos tagging and stemming. It looks like this: word1, w...

Keyword sorting algorithm

I have over 1000 surveys, many of which contains open-ended replies. I would like to be able to 'parse' in all the words and get a ranking of the most used words (disregarding common words) to spot a trend. How can I do this? Is there a program I can use? EDIT If a 3rd party solution is not available, it would be great if we can keep...

Indexing and Searching Over Word Level Annotation Layers in Lucene

I have a data set with multiple layers of annotation over the underlying text, such as part-of-tags, chunks from a shallow parser, name entities, and others from various natural language processing (NLP) tools. For a sentence like The man went to the store, the annotations might look like: Word POS Chunk NER ==== === ===== ...

'Similarity' in Data Mining

In the field of Data Mining, is there a specific sub-discipline called 'Similarity'? If yes, what does it deal with. Any examples, links, references will be helpful. Also, being new to the field, I would like the community opinion on how closely related Data Mining and Artificial Intelligence are. Are they synonyms, is one the subset of...

Implementing Naïve Bayes algorithm in Java - Need some guidance

hello stackflow people As a School assignment i'm required to implement Naïve Bayes algorithm which i am intending to do in Java. In trying to understand how its done, i've read the book "Data Mining - Practical Machine Learning Tools and Techniques" which has a section on this topic but am still unsure on some primary points that are...

DataMining / Analyzing responses to Multiple Choice Questions in a survey

Hi, I have a set of training data consisting of 20 multiple choice questions (A/B/C/D) answered by a hundred respondents. The answers are purely categorical and cannot be scaled to numerical values. 50 of these respondents were selected for free product trial. The selection process is not known. What interesting knowledge can be mined f...

Python, web log data mining for frequent patterns

Hello! I need to develop a tool for web log data mining. Having many sequences of urls, requested in a particular user session (retrieved from web-application logs), I need to figure out the patterns of usage and groups (clusters) of users of the website. I am new to Data Mining, and now examining Google a lot. Found some useful info,...

Confusion Matrix of Bayesian Network

Hi, I'm trying to understand bayesian network. I have a data file which has 10 attributes, I want to acquire the confusion table of this data table ,I thought I need to calculate tp,fp, fn, tn of all fields. Is it true ? if it's then what i need to do for bayesian network. Really need some guidance, I'm lost. ...