I think there is a wealth of natural language data associated with sites like reddit or digg or news.google.com.
I have done a little bit of research with text mining, but can't find how I could use those tools to parse something like reddit.
What kind of applications can you come up with?
...
Does anyone know of a (preferably open source) C# library that can be implemented to calculate the overall sentiment of some given text?
...
I am doing some analysis by mining web content using my crawlers. Web pages often contain clutter (such as ads, unnecessary images and extraneous links) around the body of an article that distracts a user from actual content.
To extract the sensible content is a difficult problem as I understand it, considering the fact that there is no...
Hi All,
I want to crawl for specific things. Specifically events that are taking place like concerts, movies, art gallery openings, etc, etc. Anything that one might spend time going to.
How do I implement a crawler?
I have heard of Grub (grub.org -> Wikia) and Heritix (http://crawler.archive.org/)
Are there others?
What opinions do...
Hello there, i'm working on a project that consists of a website that connects to the NCBI(National Center for Biotechnology Information) and searches for articles there. Thing is that I have to do some text mining on all the results.
I'm using the JAVA language for textmining and AJAX with ICEFACES for the development of the website.
...
I started working on a project in which i must tag documents with keywords, and it is really hard and time consuming if you do it manually (specially if you have thousands of documents). So I am planning to automatize the process (knowing that the result would not perfect but at least it gives you some suggested tags ).
In the latest fir...
Suppose you have a repository of 10,000 function names and possibly their frequency of use in a corpus of code which can be in C/C#/C++. (they have different conventions usually prescribed)
Some Samples may be:
DoPaint
OnPaint
CloseWindow
DeleteGraphOnClose
FreeConnection
ConnectInternat (smallTypo, but part of code)
FreeSoH
Now give...
I'm maintaining a program that needs to parse out data that is present in an "almost structured" form in text. i.e. various programs that produce it use slightly different formats, it may have been printed out and OCR'd back in (yeah, I know) with errors, etc. so I need to use heuristics that guess how it was produced and apply differen...
I have a set of documents in two languages: English and German. There is no usable meta information about these documents, a program can look at the content only. Based on that, the program has to decide which of the two languages the document is written in.
Is there any "standard" algorithm for this problem that can be implemented in a...
We extract various information from e-mails - flights, car rentals, hotels and more. the method is to extract the body of the mail, usually in HTML form but sometime it's text or we use the information in a PDF/Word/RTF attachment. We then apply regular expressions (sometimes in several steps) in order to get information, which is provid...
For use to analyze documents on the Internet!
...
Is there a package/library for python that would allow me to open a PDF, and search the text for certain words?
...
I am reading this book (NLTK) and it is confusing. Entropy is defined as:
Entropy is the sum of the probability of each label
times the log probability of that same label
How can I apply entropy and maximum entropy in terms of text mining? Can someone give me a easy, simple example (visual)?
...
Hi,
I'd like to find patterns and sort them by number of occurrences on an HEX file I have.
I am not looking for some specific pattern, just to make some statistics of the occurrences happening there and sort them.
DB0DDAEEDAF7DAF5DB1FDB1DDB20DB1BDAFCDAFBDB1FDB18DB23DB06DB21DB15DB25DB1DDB2EDB36DB43DB59DB32DB28DB2ADB46DB6FDB32DB44DB40D...
There is a Perl library I would like to access from within Python.
How can I use it?
FYI, the software is NCleaner. I would like to use it from within Python to transform an HTML string into text. (Yes, I know about aaronsw's Python html2text. NCleaner is better, because it removes boiler-plate.)
I don't want to run the Perl program as...
Hi,
Does anybody know an open-source\free library that does term clustering?
Thanks,
yaniv
...
i have a bunch of data harvested from a forum I own, and would like to do some text mining or use some linguistic library to extract useful information.
any text mining, data mining library in any language will do.
Thank you.
...
is there a library for ruby or php that is able to parse html pages and extract unique data by comparing it with other similar pages....should use some sort of text mining to identify which texts are more likely noise and repetivie, while other texts are more unique and useful...
...
I'm looking for any free tools/components/libraries that allow me to take anvantage of text mining, fact extraction and semantic analysis in my .NET application.
The GATE project is what I need but it is written in Java. Is there something like GATE in the .NET world?
My challange is to extract certain facts out of website text conten...
I am working towards building an index of URLs. The objective is to build and store a data structure which will have key as a domain URL (eg. www.nytimes.com) and the value will be a set of features associated with that URL. I am looking for your suggestions for this set of features. For example I would like to store www.nytimes.com as f...