text-mining

Natural Language/Text Mining and Reddit/social news site

I think there is a wealth of natural language data associated with sites like reddit or digg or news.google.com. I have done a little bit of research with text mining, but can't find how I could use those tools to parse something like reddit. What kind of applications can you come up with? ...

C# Sentiment Analysis

Does anyone know of a (preferably open source) C# library that can be implemented to calculate the overall sentiment of some given text? ...

Extracting meaning full content from web pages

I am doing some analysis by mining web content using my crawlers. Web pages often contain clutter (such as ads, unnecessary images and extraneous links) around the body of an article that distracts a user from actual content. To extract the sensible content is a difficult problem as I understand it, considering the fact that there is no...

Crawling The Internet

Hi All, I want to crawl for specific things. Specifically events that are taking place like concerts, movies, art gallery openings, etc, etc. Anything that one might spend time going to. How do I implement a crawler? I have heard of Grub (grub.org -> Wikia) and Heritix (http://crawler.archive.org/) Are there others? What opinions do...

Which NLP toolkit to use in JAVA ?

Hello there, i'm working on a project that consists of a website that connects to the NCBI(National Center for Biotechnology Information) and searches for articles there. Thing is that I have to do some text mining on all the results. I'm using the JAVA language for textmining and AJAX with ICEFACES for the development of the website. ...

automatic documents tagging related

I started working on a project in which i must tag documents with keywords, and it is really hard and time consuming if you do it manually (specially if you have thousands of documents). So I am planning to automatize the process (knowing that the result would not perfect but at least it gives you some suggested tags ). In the latest fir...

How to Predict if Function Name Follows Convention

Suppose you have a repository of 10,000 function names and possibly their frequency of use in a corpus of code which can be in C/C#/C++. (they have different conventions usually prescribed) Some Samples may be: DoPaint OnPaint CloseWindow DeleteGraphOnClose FreeConnection ConnectInternat (smallTypo, but part of code) FreeSoH Now give...

Background reading for parsing sloppy / quirky / "almost structured" data?

I'm maintaining a program that needs to parse out data that is present in an "almost structured" form in text. i.e. various programs that produce it use slightly different formats, it may have been printed out and OCR'd back in (yeah, I know) with errors, etc. so I need to use heuristics that guess how it was produced and apply differen...

How to determine the (natural) language of a document?

I have a set of documents in two languages: English and German. There is no usable meta information about these documents, a program can look at the content only. Based on that, the program has to decide which of the two languages the document is written in. Is there any "standard" algorithm for this problem that can be implemented in a...

Looking for an information retrival / text mining application or library

We extract various information from e-mails - flights, car rentals, hotels and more. the method is to extract the body of the mail, usually in HTML form but sometime it's text or we use the information in a PDF/Word/RTF attachment. We then apply regular expressions (sometimes in several steps) in order to get information, which is provid...

Besides NLTK, what is the best information retrieval library for Python?

For use to analyze documents on the Internet! ...

(python) text-mine PDF files with Python?

Is there a package/library for python that would allow me to open a PDF, and search the text for certain words? ...

What is "entropy and information gain"?

I am reading this book (NLTK) and it is confusing. Entropy is defined as: Entropy is the sum of the probability of each label times the log probability of that same label How can I apply entropy and maximum entropy in terms of text mining? Can someone give me a easy, simple example (visual)? ...

Find HEX patterns and number of occurrences

Hi, I'd like to find patterns and sort them by number of occurrences on an HEX file I have. I am not looking for some specific pattern, just to make some statistics of the occurrences happening there and sort them. DB0DDAEEDAF7DAF5DB1FDB1DDB20DB1BDAFCDAFBDB1FDB18DB23DB06DB21DB15DB25DB1DDB2EDB36DB43DB59DB32DB28DB2ADB46DB6FDB32DB44DB40D...

Perl within Python?

There is a Perl library I would like to access from within Python. How can I use it? FYI, the software is NCleaner. I would like to use it from within Python to transform an HTML string into text. (Yes, I know about aaronsw's Python html2text. NCleaner is better, because it removes boiler-plate.) I don't want to run the Perl program as...

term clustering library?

Hi, Does anybody know an open-source\free library that does term clustering? Thanks, yaniv ...

text mining library or lingual library ?

i have a bunch of data harvested from a forum I own, and would like to do some text mining or use some linguistic library to extract useful information. any text mining, data mining library in any language will do. Thank you. ...

extracting useful data from arbitary html pages ?

is there a library for ruby or php that is able to parse html pages and extract unique data by comparing it with other similar pages....should use some sort of text mining to identify which texts are more likely noise and repetivie, while other texts are more unique and useful... ...

Text mining, fact extraction, semantic analysis using .Net

I'm looking for any free tools/components/libraries that allow me to take anvantage of text mining, fact extraction and semantic analysis in my .NET application. The GATE project is what I need but it is written in Java. Is there something like GATE in the .NET world? My challange is to extract certain facts out of website text conten...

Building an index of URLs , what features to include?

I am working towards building an index of URLs. The objective is to build and store a data structure which will have key as a domain URL (eg. www.nytimes.com) and the value will be a set of features associated with that URL. I am looking for your suggestions for this set of features. For example I would like to store www.nytimes.com as f...