data-mining

Hadoop begineers

Hi, I'm trying to practice some data mining algorithms over hadoop. Can I do it with HDFS alone or do I need to use the sub-projects like hive/hbase/pig? Thanks, ram. ...

Segmenting a set of data with discrete and continuos data values into one of two groups without using analysis services?

Say I have a table with the following scheme (note: this example is hypothetical, though the real use case is similar). Type | Name | Notes ===================================================================================== Gender | Gender | Either Male or Female (not null) GeoCoord | Location | Lattitude an...

Data Warehousing in Real World ebook

Can anyone please tell me where to download the ebook, "Data Warehousing in Real World"? Or if you have one mail me to [email protected] ...

Pricelist parser

I have to create Pricelist parser that imports data from excel or csv and put it in database. I have no problems to get data from source. I need to find columns that contains price, product title and description automaticaly. What can you suggest how to do that, is there common methods or libraries? Data sample 1: Intel Core 2 Duo E6...

could anyone give me help on ground-truth data

Hello everyone, I recently came to a term in one of my email communicatons with my supervisor.Since I am beinging doing a data-mining project on facebook user profile,and he said I should being collecting groud-truth data. I am very new to this term and I searched online for it,but found very few results about it in data-mining sense. ...

Efficent methods for finding most common phrases in a body of text AKA trending topics

Hi, I previously asked a similar question on this topic, I ended up deriving several solutions which worked, one based on bloom filters + ngrams, the other based on hash tables + ngrams. Both solutions perform fine with small data sets (<1000 texts, usually tweets) but the computation time grew exponentially meaning doing 10,000 could t...

Good ways to visualize a word-like document

What are some good ways to visualize a word-like document? Edit: It doesn't have to be a .doc, it could be a text file or blog post... ...

How would you group up articles by context? - Natural Language

Hi folks, I have lists of articles made of: title, subtitle and body. Now I need to parse all these articles and group them up under different context categories or sub categories based on their possible keywords. e.g. if the article is likely to be related to sports cars then the article would be associated with the car or/and veh...

datamining metadata

I build a bunch of data mining models on training data that is located in different folders. For eg. for data in folder1 I build an SVM based model, for data in folder2 I build an naive bayes model. I have almost 100 such folders and each of the folders have different data ( read different attributes ). Is there a framework which enables...

IR vs Data mining vs ML

People often throw around the terms IR, ML, and data mining, but I have noticed a lot of overlap between them. From people with experience in these fields, what exactly draws the line between these? ...

Anything better than ruby alchemy for extracting keywords?

I've currently written an algorithm in Ruby based on the arc90 readability code to extract an article from a web page. Now that I have the article, I want to extract keywords and specific information from it (names, author, etc) I heard Alchemy was a great ruby gem for doing this though it consumes a lot of resources. Are there any bet...

How to open large (HUGE) textfiles.

I am writing a program to produce random records in a format that can be specified in code and optionally write it to disk as a text file so it can be used for datamining benchmarks. My problem is that I can verify that my program works with small text files but I need to know if this is true for large amounts of data (this program will...

Possible to make a consistently successful stock market playing bot?

Who has created a bot to play the stock market and what kind of return did you see? I'm currently still in very alpha stages but I can play the stock market with play money and get some very nice results using historical real time data. Currently I have around 8 parameters that go into configuring the buy and sell function. When I varied...

Existing Database for Nutrition Facts?

I'm developing an ecommerce store using MVC and it will feature various health food products. We would like to display the Nutrition Facts label for each product, and am wondering if there is an existing way to do this dynamically without images and if there is a database out there with all the facts we can pull from, to minimize data en...

How to find out if a sentence is a question (interrogative)?

Is there an open source Java library/algorithm for finding if a particular piece of text is a question or not? I am working on a question answering system that needs to analyze if the text input by user is a question. I think the problem can probably be solved by using opensource NLP libraries but its obviously more complicated than s...

need some suggestions on my SVM feature refinement

Hello all, I've trained a system on SVM,that is given a question,whether the webpage is a good one for answering this question. The feature I selected are "Term frequency in webpage","Whether term matches with the webpage title", "number of images in the webpage", "length of the webpage","is it a wikipedia page?","the position of this ...

suggestions for a people similarity algorithm

Hello all, I want to get some suggestions for my "find similar people" algorithm :). I have one database where I store the following entities: Person, article, keywords. So for each person I have a collection of keywords (with the number of mentions by the person) that have been compiled from person's articles keywords. So I need to get...

Percentile MSExcel query

Is it possible to find percentile of a) a column called "Country" with text based column values b) a column called "Salary" with column values either of the following - <=50K, >50K ...

MAD formula for excel

What are the set excel formula for calculating 1) Median Absolute Difference MAD ...

Need help picking a datamining/neural-network API

I'm planning on building a feature for an e-commerce platform I developed in Java to display related products in much the same way Amazon does. There are a few different metrics for relating products that I want to explore. Purchase history (purchased at the same time) Related by family/type (similar product classifications) Intention...