data-mining

Binning in Excel

Which formulae in MS Excel can we use for - equi-depth binning equi-width binning ...

Binarization in excel

How would you perform binarization of an attribute with five categorical values in excel? ...

Find Tags in website HTML's

I'm using Perl. I have the tag, for example: "XYZ_PKM_HTML" I would like to be able to provide a base url, for example: www.example.com and the to get the HTML page (not necessarily the main page, thats easy) where this tag appears. is it possible? any idea? (or already made modules, looked on cpan, there were some interesting stuff, bu...

Which web language can be used for data mining or web crawling

if I want to build a complex webiste like google news , which gathers data from oher websites. like data mining , crawling. In which language should i build the website. Currently i know only PHP. Can i do that in PHP ...

How can i start with data mining for small grocery shop

My company got the project to build simple website of grocery shop with catalogue only without shop cart. Few days ago i read something about data mining from here I found that it is possible to do some predictive modelling like For example, one Midwest grocery chain used the data mining capacity of Oracle software to analyze local bu...

Is my path of learning data mining correct

Someone has just told my boss what data mining can do to a company like recommendation , predictive modelling. Basically we are a website company. I am going on leave for 6 months. So my boss said that I can learn some DM techniques so that when I come back we can visit small shops or small companies to provide them with predictive data ...

How does data mining actually work?

Suppose I want to do some data mining on the database of a supermarket. What does that actually mean? 1) What will the output/results be like? 2) Will the output be different every day or change over time? 3) Before applying data mining, do I need to know what I want or will data mining give everything I want automatically? ...

beginner question on investigating on samples in Weka

Hello there, I've just used Weka to train my SVM classifier under "Classify" tag. Now I want to further investigate which data samples are mis-classified,I need to study their pattern,but I don't know where to look at this from Weka. Could anyone give me some help please? Thanks in advance. ...

How good is my error in data mining

Hi, I'm trying to calculate how good are my measurements in machine learning! Let's say that I have five choices, and that error is 4,2, 0.002, 3, 6. Naturally, I will pick third one for the hit, but I would like to say following: I'm X% certain that hit is third pick I'm Y% certain that hit is first (last) pick Of course, X>>Y but I ...

Reducing dimension of dataset

Hi, I'm trying to reduce dataset dimension. PCA is a good metric but that gives me new dataset. My goal is to determine from number of events (e.g. 60) and number of trials (e.g. 6) which events are more relevant. For example: 1st, 3rd, 21st, 45th ... (N total) events are good enough to approximate behavior of dataset. That will al...

what is the bootstrapped data in data mining?

Hello there,recently I came across this term,but really have no idea what it refers to.I've searched online,but with little gain. Thanks. ...

web services return type as complex

Hi all, I have return a web services which return "Instances" from a datamining api. Now the problem is obvious web services by default cannot handle "Instances" as return type. What should be my approach. Or I may have to say User defined data types, please guide me of any documentation where I can implement this. //////////////...

A question on classifying question categories on Yahoo! Answer

Hi all, now I have a seemingly easy but challenging task.I need to develop a data set of questions,and I classify the questions into two categories: Factoid questions: "who is the current president of France." Free questions: "Can you rate the cameras below for me,please?" now I need to know the percentage of both categories on Yaho...

A question on sampling on Yahoo! Answers

Hello there. I wonder what is the best way to sample,say, 1000 questions,completely randomly from Yahoo! Answer. I want to achieve this complete randomness in which I will totally ignore the categories or date of posting etc. Doing this manually may result in bias,so could anyone give some suggestions here,like using Yahoo! Answer API or...

Oracle Data Miner(Need Tutorial)

What does ODM(Oracle Data Miner) do? Can you give me useful materials or a brief information about this option? Thank you.. ...

Newbie: where to start given a problem to predict future success or not

We have had a production web based product that allows users to make predictions about the future value (or demand) of goods, the historical data contains about 100k examples, each example has about 5 parameters; Consider a class of data called a prediciton: prediction { id: int predictor: int predictionDate: date p...

application about generating pairs of frequent itemset

Hi guys, I am doing an application that will compute all 2 size frequent itemset from a set of transactions. That is the application will have as input a data file (space delimited text file - with the items encoded as integers) and a percentage, given as an integer (e.g. input 2 represents 2%). The application will output in a distinct...

Data Mining-SCAD1/SCAD2-Subspace Clustering

Hi everyone, I am looking for SCAD (Simultaneous clustering and attribute discrimination) subspace clustering algorithm. If anyone has implemented it, please let me know where I can find/download this algorithm. Thank you. ...

N-gram related question - C# algorithm

Hi, I am intending to use the n-gram part/algorithm of this code: http://www.codeproject.com/KB/cs/tfidf.aspx The algorithm produces these tri-gram results: t th the he e q qu qui uic ick ck k r re red ed d for: the quick red However, this source: http://en.wikipedia.org/wiki/Trigram reckons it should be: the qui k_r he_ u...

Flagging possible identical users in an account management system

Hi, I am working on a possible architecture for an abuse detection mechanism on an account management system. What I want is to detect possible duplicate users based on certain correlating fields within a table. To make the problem simplistic, lets say I have a USER table with the following fields: Name Nationality Current Address Logi...