Hi! I recently found out about n-grams and the cool possibility to compare frequency of phrases in a text body with it. Now I'm trying to make an vb.net app that simply gets an text body and returns a list of the most frequently used phrases (where n >= 2).
I found an C# example of how to generate a n-gram from a text body so I started ...
Hello,
I need some help in solving this problem.
We have a large amount of documents of a given specified domain. These documents are from differente sources and therefore their structure can be very different too. On the other side I have a table with some specified fields where some figures has to be filled from the extract of the do...
Hello,
I have a large database of resumes (CV), and a certain table skills grouping all users skills.
inside that table there's a field skill_text that describes the skill in full text.
I'm looking for an algorithm/software/method to extract significant terms/phrases from that table in order to build a new table with standarized skill...
I'm looking for information extraction libraries where I can have semi structured information that may have either hidden or incomplete data. I want to train some classifiers to pull out content based on the structure.
I'm working on building a tool where I can select text in the browser, and it will generate (via some web service call)...
I'm working on a project at the moment where it would be really useful to be able to detect when a certain topic/idea is mentioned in a body of text. For instance, if the text contained:
Maybe if you tell me a little more about who Mr Jones is, that would help. It would also be useful if I could have a description of his appearance, ...
I have a data set with multiple layers of annotation over the underlying text, such as part-of-tags, chunks from a shallow parser, name entities, and others from various natural language processing (NLP) tools. For a sentence like The man went to the store, the annotations might look like:
Word POS Chunk NER
==== === ===== ...
I am looking to create n-grams from text column in PostgreSQL. I currently split(on white-space) data(sentences) in a text column to an array.
enter code hereselect regexp_split_to_array(sentenceData,E'\s+') from tableName
Once I have this array, how do I go about:
Creating a loop to find n-grams, and write each to a row in another t...
I am working on a small project which involves a dictionary based text searching within a collection of documents. My dictionary has positive signal words (a.k.a good words) but in the document collection just finding a word does not guarantee a positive result as there may be negative words for example (not, not significant) that may be...
Hi I wanted to know that is it possible to use decision trees for document classification and if yes then how should be the data representation be?
I know the use of R package party for Decision Trees.
...
I need to run various machine learning techniques on a big dataset (10-100 billions records)
The problems are mostly around text mining/information extraction and include various kernel techniques but are not restricted to them (we use some bayesian methods, bootstrapping, gradient boosting, regression trees -- many different problems an...
I'm trying to use shingleprinting to measure document similarity. The process involves the following steps:
Create a 5-shingling of the two documents D1, D2
Hash each shingle with a 64-bit hash
Pick a random permutation of the numbers from 0 to 2^64-1 and apply to shingle hashes
For each document find the smallest of the resulting valu...
My objective is to [semi]automatically assign texts to different categories. There's a set of user defined categories and a set of texts for each category. The ideal algorithm should be able to learn from a human-defined classification and then classify new texts automatically.
Can anybody suggest such an algorithm and perhaps .NET libr...
Hello all,
I am trying to use/implement a vector space model algorithm in Java to get the similarity score between two people based on its keywords. So I have the following classes:
Person - Has a List of keywords;
Keyword -
String text;
Integer score;
The keyword score is the number of mentions the person has made to the keyword.
...
Hey guys,
I'm trying to use topic modeling with Mallet but have a question.
How do I know when do I need to rebuild the model? For instance I have this amount of documents I crawled from the web, using topic modeling provided by Mallet I might be able to create the models and infer documents with it. But overtime, with new data that I...
I want to do hierarchical agglomerative clustering on texts in MATLAB. Say, I have four sentences,
I have a pen.
I have a paper.
I have a pencil.
I have a cat.
I want to cluster the above four sentences to see which are more similar. I know Statistic toolbox has command like pdist to measure pair-wise distances, linkage to calculat...
I'm supposed to take two sentences each time and compute if they are similar. By similar I mean, both syntactically and semantically.
INPUT1: Obama signs the law.
A new law is signed by Obama.
INPUT2:
A Bus is stopped here.
A vehicle stops here.
INPUT3: Fire in NY.
NY is burnt down.
...
I want to take a csv export of my bibtex literature database and analyse the correlation between keywords and Journals. I start off with a csv file containing one row per piece of literature, each one with a Journal name, and a keyword list, which is a slash deliminated list. I want to end up with either a matrix of Journal by Keyword ...
Hello Everyone,
I am doing a class project. I want to implement new algo or modify existing ones (like dimension reduction, clustering, bagging, boosting, SVM, FPtree, text mining, etc). Please give me some ideas for project.
Thanks
...
Hello,
I want to extract relevant words from a text statement provided by the user.
eg. For a question "How many sides are there in a rectangle?"
The words should be 'rectangles' , 'sides', 'many' , 'how'.
We've discovered that what exactly I'm aiming to do is a NLP Question answer system.
But right now I want to only extract the requi...