text-mining

N-gram function in vb.net -> create grams for words instead of characters

Hi! I recently found out about n-grams and the cool possibility to compare frequency of phrases in a text body with it. Now I'm trying to make an vb.net app that simply gets an text body and returns a list of the most frequently used phrases (where n >= 2). I found an C# example of how to generate a n-gram from a text body so I started ...

Retrieve Information From Different Unstructured Text Files - Text Mining?

Hello, I need some help in solving this problem. We have a large amount of documents of a given specified domain. These documents are from differente sources and therefore their structure can be very different too. On the other side I have a table with some specified fields where some figures has to be filled from the extract of the do...

Text mining on large database (data mining)

Hello, I have a large database of resumes (CV), and a certain table skills grouping all users skills. inside that table there's a field skill_text that describes the skill in full text. I'm looking for an algorithm/software/method to extract significant terms/phrases from that table in order to build a new table with standarized skill...

Information Extraction Toolkits

I'm looking for information extraction libraries where I can have semi structured information that may have either hidden or incomplete data. I want to train some classifiers to pull out content based on the structure. I'm working on building a tool where I can select text in the browser, and it will generate (via some web service call)...

How to identify ideas and concepts in a given text

I'm working on a project at the moment where it would be really useful to be able to detect when a certain topic/idea is mentioned in a body of text. For instance, if the text contained: Maybe if you tell me a little more about who Mr Jones is, that would help. It would also be useful if I could have a description of his appearance, ...

Indexing and Searching Over Word Level Annotation Layers in Lucene

I have a data set with multiple layers of annotation over the underlying text, such as part-of-tags, chunks from a shallow parser, name entities, and others from various natural language processing (NLP) tools. For a sentence like The man went to the store, the annotations might look like: Word POS Chunk NER ==== === ===== ...

n-grams from text in PostgreSQL

I am looking to create n-grams from text column in PostgreSQL. I currently split(on white-space) data(sentences) in a text column to an array. enter code hereselect regexp_split_to_array(sentenceData,E'\s+') from tableName Once I have this array, how do I go about: Creating a loop to find n-grams, and write each to a row in another t...

Keeping Track of Word Proximity.

I am working on a small project which involves a dictionary based text searching within a collection of documents. My dictionary has positive signal words (a.k.a good words) but in the document collection just finding a word does not guarantee a positive result as there may be negative words for example (not, not significant) that may be...

Decision Trees For Document Classification

Hi I wanted to know that is it possible to use decision trees for document classification and if yes then how should be the data representation be? I know the use of R package party for Decision Trees. ...

Large scale Machine Learning

I need to run various machine learning techniques on a big dataset (10-100 billions records) The problems are mostly around text mining/information extraction and include various kernel techniques but are not restricted to them (we use some bayesian methods, bootstrapping, gradient boosting, regression trees -- many different problems an...

How does Shingleprinting work in practice?

I'm trying to use shingleprinting to measure document similarity. The process involves the following steps: Create a 5-shingling of the two documents D1, D2 Hash each shingle with a 64-bit hash Pick a random permutation of the numbers from 0 to 2^64-1 and apply to shingle hashes For each document find the smallest of the resulting valu...

Text classification/categorization algorithm

My objective is to [semi]automatically assign texts to different categories. There's a set of user defined categories and a set of texts for each category. The ideal algorithm should be able to learn from a human-defined classification and then classify new texts automatically. Can anybody suggest such an algorithm and perhaps .NET libr...

vector space model algorithm in Java to get the similarity score between two people

Hello all, I am trying to use/implement a vector space model algorithm in Java to get the similarity score between two people based on its keywords. So I have the following classes: Person - Has a List of keywords; Keyword - String text; Integer score; The keyword score is the number of mentions the person has made to the keyword. ...

Topic modeling using mallet

Hey guys, I'm trying to use topic modeling with Mallet but have a question. How do I know when do I need to rebuild the model? For instance I have this amount of documents I crawled from the web, using topic modeling provided by Mallet I might be able to create the models and infer documents with it. But overtime, with new data that I...

Clustering text in MATLAB

I want to do hierarchical agglomerative clustering on texts in MATLAB. Say, I have four sentences, I have a pen. I have a paper. I have a pencil. I have a cat. I want to cluster the above four sentences to see which are more similar. I know Statistic toolbox has command like pdist to measure pair-wise distances, linkage to calculat...

How to compute similarity between two sentences (syntactical and semantical)

I'm supposed to take two sentences each time and compute if they are similar. By similar I mean, both syntactically and semantically. INPUT1: Obama signs the law. A new law is signed by Obama. INPUT2: A Bus is stopped here. A vehicle stops here. INPUT3: Fire in NY. NY is burnt down. ...

Is there a better way to create Keyword frequency table in R ?

I want to take a csv export of my bibtex literature database and analyse the correlation between keywords and Journals. I start off with a csv file containing one row per piece of literature, each one with a Journal name, and a keyword list, which is a slash deliminated list. I want to end up with either a matrix of Journal by Keyword ...

Idea for Implementing new or modifying existing algo.

Hello Everyone, I am doing a class project. I want to implement new algo or modify existing ones (like dimension reduction, clustering, bagging, boosting, SVM, FPtree, text mining, etc). Please give me some ideas for project. Thanks ...

How to extract words from text as per the context

Hello, I want to extract relevant words from a text statement provided by the user. eg. For a question "How many sides are there in a rectangle?" The words should be 'rectangles' , 'sides', 'many' , 'how'. We've discovered that what exactly I'm aiming to do is a NLP Question answer system. But right now I want to only extract the requi...