information-retrieval

What are some alternatives to a bit array?

I have an information retrieval application that creates bit arrays on the order of 10s of million bits. The number of "set" bits in the array varies widely, from all clear to all set. Currently, I'm using a straight-forward bit array (java.util.BitSet), so each of my bit arrays takes several megabytes. My plan is to look at the cardina...

Dynamic regex for date time formats

Is there an existing solution to create regular expressions dynamically out of given date time format pattern? Supported date time format pattern does not matter (Joda DateTimeFormat, java.text.SimpleDateTimeFormat or others). i.e. for a given date-time format (for example "dd/MM/yyyy hh:mm"), it will generate corresponding regular exp...

Suggestion needed to learn Machine Learning and Information Retrieval

Hi! I want lo learn about Information Retrieval and Machine Learning. Which books do you recommend and in what order do you think is better to read them? The idea is to reach a good understanding of recommendation systems. Thanks! Jonathan ...

Ways to do "related searches" functionality

I've seen a few sites that list related searches when you perform a search, namely they suggest other search queries you may be interested in. I'm wondering the best way to model this in a medium-sized site (not enough traffic to rely on visitor stats to infer relationships). My initial thought is to store the top 10 results for each un...

Search Engines Inexact Counting (about xxx results)

When you search in Google (i'm almost sure that Altavista did the same thing) it says "Results 1-10 of about xxxx"... This has always amazed me... What does it mean "about"? How can they count roughly? I do understand why they can't come up with a precise figure in a reasonable time, but how do they even reach this "approximate" one? I...

Hidden features of MSDN documentation

This is both for the local version and msdn.microsoft.com. Generally I find the MSDN documentation to be very good, but only if you can find what you're looking for. So if anybody has any general tips and tricks, I'd love to hear them. ...

Signature files for document retrieval

Hi all, I was wondering if you know somewhere where I can find information on how to build a signature file for docuement retrieval. Do you know if there is some code out there that I can use or look at? I have to create a signature file in C++ under linux platform. UPDATE: Sorry, I appreciatte the help but I was refering to signature ...

Natural Language/Text Mining and Reddit/social news site

I think there is a wealth of natural language data associated with sites like reddit or digg or news.google.com. I have done a little bit of research with text mining, but can't find how I could use those tools to parse something like reddit. What kind of applications can you come up with? ...

Looking for books on Information Science, Information Retrieval

Here's what I have on my list so far. I'd like to know of others in the same vein, perhaps more technical, perhaps less Blown to Bits: Your Life, Liberty, and Happiness After the Digital Explosion - Ableson, Leeden, and Lewis Glut: Mastering Information Through the Ages - Wright Information Rules - Varian and Shapiro Web Dragons: Ins...

Retrieving data with Selenium

Hi everyone, I want to get the business hours from ScotiaBank branches that are near to me. The base-URL is: http://maps.scotiabank.com/ I then, Click on the "Branches" radiobox. Click on the "Open Saturdays" checkbox. Enter "B3H 1M7" (my postal code) into the search box. Click the Search button. Click on the first result that po...

DynamicFields in Solr

In my current project i need to index all e-mails and their attachments from multiple mailbox. I will use Solr and I don't know what is the best approach to build my index's structure. My first approach was: <fields> <field name="id" require="true"/> <field name="uid" require="true"/> //A lot of other fields <dynamicField name="attachm...

Crawling The Internet

Hi All, I want to crawl for specific things. Specifically events that are taking place like concerts, movies, art gallery openings, etc, etc. Anything that one might spend time going to. How do I implement a crawler? I have heard of Grub (grub.org -> Wikia) and Heritix (http://crawler.archive.org/) Are there others? What opinions do...

Match rows containing a word with permutations

Say you've got a big table that contains a varchar column. How would you match rows that contain the word 'preferred' in the varchar col BUT the data is somewhat noisy and contains occasional spelling errors, e.g.: ['$2.10 Cumulative Convertible Preffered Stock, $25 par value', '5.95% Preferres Stock', 'Class A Preffered', 'Series A Pe...

What is the best way to compute trending topics or tags?

Many sites offer some statistics like "The hottest topics in the last 24h". For example, Topix.com shows this in its section "News Trends". There, you can see the topics which have the fastest growing number of mentions. I want to compute such a "buzz" for a topic, too. How could I do this? The algorithm should weight the topics which a...

Search index tool for personal knowledge base files

I have a large number of basic text, rtf, html, pdf and chm files that I store on a USB key as a personal knowledge base. Up until now, to retrieve information, I've used a standard file searching tools (windows search,grep etc). However these days a brute force search can take minutes due to sheer data size. Also PDF and CHM are als...

How quickly can 2 iphones exchange information regarding tilt/position?

I was wondering what amount of time is required to convey information regarding the tilt and position (not gps) of one particular iphone to another. Could 2 iphones send and receive this information simultaneously? What about 3 iphones? I'm interested in an application that is able to simultaneously send and receive and make conditional ...

Russian Document Corpus for Search Engine

Hi, I'm working on a cross language information retrieval that takes queries in english and searches documents in Russian. To evaluate this system it would be nice to have a collection of russian documents to search through. Does anyone out there know of a collection of documents I can search or websites from which I can easily scrape to...

How to add search functionality to my application

I am writing Windows application (with Borland C++ Builder), which stores large number of text files. I want users to be able to search these files very fast, so I need an indexing and search library. I do not use database, but my own file format for storing the documents (all are in a single file). Are there such libraries for Windows?...

Google Alerts API?

It seems that there is no Google Alerts API. Firstly, How would you get Google Alerts information into a database other than to parse the text of the email message that Google sends you? If you must parse text, how would you go about parsing out the relevant pieces of the email message? ...

Using spaced repetition to retain programming knowledge

Has anybody successfully used spaced repetition concepts embodied in programs like supermemo in the context of programming ? The motivation for this question: I'm increasingly having to look up things I knew. Reading this Wired piece "Want to Remember Everything You'll Ever Learn? Surrender to This Algorithm" has me wondering if this...