information-retrieval

database row/ record pointers

Hi I don't know the correct words for what I'm trying to find out about and as such having a hard time googling. I want to know whether its possible with databases (technology independent but would be interested to hear whether its possible with Oracle, MySQL and Postgres) to point to specific rows instead of executing my query again. ...

Information Retrieval database formats?

I'm looking for some documentation on how Information Retrieval systems (e.g., Lucene) store their indexes for speedy "relevancy" lookups. My Google-fu is failing me: I've found a page which describes Lucene's file format, but it's more focused on how many bits each number is than on how the database is used in producing speedy queries....

Retrieve some info from the web automatically.

I need to retrieve some info from web. For example, I can visit weather.com to search my zip code to get HTML file that contains the temperature or something. I need to make a python script to do this automatically. I think there are two ways to do this. Run wget to download the web page, parse it to get the information I want. If th...

Interpreting Search Results

Hi all, I am tasked with writing a program that, given a search term and the HTML source of a page representing search results of some unknown search engine (it can really be anything, a blog, a shop, Google, eBay, ...), needs to build a data structure of the results containing "what's in the results": a title for earch result, the "de...

Wikipedia text download

Hi, I am looking to download full Wikipedia text for my college project. Do I have to write my own spider to download this or is there a public dataset of Wikipedia available online? To just give you some overview of my project, I want to find out the interesting words of few articles I am interested in. But to find these interesting w...

Find Tables in PDF's

Hello, Are there any tools or tricks how to automatically extract tables from pdfs. Are there any C# libraries that could do that? Or do you maybe know other methods how this could be handled? Thank you very much ...

How to estimate the quality of a web page?

Hello, I'm doing a university project, that must gather and combine data on a user provided topic. The problem I've encountered is that Google search results for many terms are polluted with low quality autogenerated pages and if I use them, I can end up with wrong facts. How is it possible to estimate the quality/trustworthiness of a pa...

entity set expansion python

Do you know of any existing implementation in any language (preferably python) of any entity set expansion algorithms, such that the one from Google sets ? ( http://labs.google.com/sets ) I couldn't find any library implementing such algorithms and I'd like to play with some of those to see how they would perform on some specific task I...

Gaining information from nodes of tree

I am working with the tree data structure and trying to come up with a way to calculate information I can gain from the nodes of the tree. I am wondering if there are any existing techniques which can assign higher numerical importance to a node which appears less frequently at lower level (Distance from the root of the tree) than the s...

Create a dataset: extract features from text documents (TF-IDF)

I've to create a dataset from some text files, writing them as vectors of features. Something like this: doc1: 1,0.45 6,0.001 94,0.1 ... doc2: 3,0.5 98,0.2 ... ... each position of the vector represent a word, and the score is given by something like TF-IDF. Do you know some library/tool/whatever for this? (java is better) ...

Writing a program to scrape forums

Hi, I need to write a program to scrape forums. Should I write the program in Python using the Scrapy framework or should I use Php cURL? Also is there a Php equivalent to Scrapy? Thanks ...

A software/hardware structure of the Google Search/Maps Linux-clusters ?

I am particularly interested how one can deal with a huge amount of information for a commercial service like Google Search or Google Maps. We all know they use (or "did" at least) a kind of Linux clusters, but how exactly are they organized? What kind of hardware do they use, what file systems, networking, what problems are the most fre...

Ngram IDF smoothing

I am trying to use IDF scores to find interesting phrases in my pretty huge corpus of documents. I basically need something like Amazon's Statistically Improbable Phrases, i.e. phrases that distinguish a document from all the others The problem that I am running into is that some (3,4)-grams in my data which have super-high idf actually ...

PHP: Working with video and timecodes

Are there any good libraries (preferably free) for working with video files and their timecodes? I especially need two kinds of functionality: Get information about video files in as many formats as possible, but most importantly QuickTime. For example duration, bit rate, frame rate, format, dimensions, display aspect ratio, pixel aspe...

How can I retrieve my Google search history?

In the Google Web History interface I can see all the search queries I have used over the years, and the pages I visited for a particular query. Is there a way I can retrieve this history using a computer program? I couldn't find a Google API that does it. Do you know of a tool that can do this, or suggest a way to achieve this? ...

How does Shingleprinting work in practice?

I'm trying to use shingleprinting to measure document similarity. The process involves the following steps: Create a 5-shingling of the two documents D1, D2 Hash each shingle with a 64-bit hash Pick a random permutation of the numbers from 0 to 2^64-1 and apply to shingle hashes For each document find the smallest of the resulting valu...

Get an article's title/author/date info with Javascript

I'm trying to build a bookmarklet that will get the current page/article's author and date information, for referencing purposes. I know that I can get the Page title and url with document.title and document.URL but I'm drawing a blank when it comes to the other information. Any ideas? ...

SCORM data pictorial representation

How can I present the information in the imsmanifest.xml file of a SCORM package? What i need is to create a tree view or any other type of pictorial representation of the information of the package . ...

Good documentation on structure tcp_info

Hi folks, I am working on getting the performance parameters of a tcp connection and one these parameters is the bandwidth. I am intending to use the tcp_info structure supported from linux 2.6 onwards, which holds the meta data about a tcp connection. The information can be retrieved using the getsockopt() function call on tcp_info. I h...

Cosine Similarity of Vectors, with < O(n^2) complexity

Hi, Having looked around this site for similar issues, I found this: http://math.nist.gov/javanumerics/jama/ and this: http://sujitpal.blogspot.com/2008/09/ir-math-with-java-similarity-measures.html However, it seems these run in O(n^2). I've been doing some document clustering and noticed this level of complexity wasn't feasible when ...