I have an information retrieval application that creates bit arrays on the order of 10s of million bits. The number of "set" bits in the array varies widely, from all clear to all set. Currently, I'm using a straight-forward bit array (java.util.BitSet), so each of my bit arrays takes several megabytes.
My plan is to look at the cardina...
Is there an existing solution to create regular expressions dynamically out of given date time format pattern? Supported date time format pattern does not matter (Joda DateTimeFormat, java.text.SimpleDateTimeFormat or others).
i.e. for a given date-time format (for example "dd/MM/yyyy hh:mm"), it will generate corresponding regular exp...
Hi!
I want lo learn about Information Retrieval and Machine Learning. Which books do you recommend and in what order do you think is better to read them?
The idea is to reach a good understanding of recommendation systems.
Thanks!
Jonathan
...
I've seen a few sites that list related searches when you perform a search, namely they suggest other search queries you may be interested in.
I'm wondering the best way to model this in a medium-sized site (not enough traffic to rely on visitor stats to infer relationships). My initial thought is to store the top 10 results for each un...
When you search in Google (i'm almost sure that Altavista did the same thing) it says "Results 1-10 of about xxxx"...
This has always amazed me... What does it mean "about"?
How can they count roughly?
I do understand why they can't come up with a precise figure in a reasonable time, but how do they even reach this "approximate" one?
I...
This is both for the local version and msdn.microsoft.com.
Generally I find the MSDN documentation to be very good, but only if you can find what you're looking for. So if anybody has any general tips and tricks, I'd love to hear them.
...
Hi all,
I was wondering if you know somewhere where I can find information on how to build a signature file for docuement retrieval.
Do you know if there is some code out there that I can use or look at?
I have to create a signature file in C++ under linux platform.
UPDATE: Sorry, I appreciatte the help but I was refering to signature ...
I think there is a wealth of natural language data associated with sites like reddit or digg or news.google.com.
I have done a little bit of research with text mining, but can't find how I could use those tools to parse something like reddit.
What kind of applications can you come up with?
...
Here's what I have on my list so far. I'd like to know of others in the same vein, perhaps more technical, perhaps less
Blown to Bits: Your Life, Liberty, and Happiness After the Digital Explosion - Ableson, Leeden, and Lewis
Glut: Mastering Information Through the Ages - Wright
Information Rules - Varian and Shapiro
Web Dragons: Ins...
Hi everyone,
I want to get the business hours from ScotiaBank branches that are near to me.
The base-URL is: http://maps.scotiabank.com/
I then,
Click on the "Branches" radiobox.
Click on the "Open Saturdays" checkbox.
Enter "B3H 1M7" (my postal code) into the search box.
Click the Search button.
Click on the first result that po...
In my current project i need to index all e-mails and their attachments from multiple mailbox.
I will use Solr and I don't know what is the best approach to build my index's structure. My first approach was:
<fields>
<field name="id" require="true"/>
<field name="uid" require="true"/>
//A lot of other fields
<dynamicField name="attachm...
Hi All,
I want to crawl for specific things. Specifically events that are taking place like concerts, movies, art gallery openings, etc, etc. Anything that one might spend time going to.
How do I implement a crawler?
I have heard of Grub (grub.org -> Wikia) and Heritix (http://crawler.archive.org/)
Are there others?
What opinions do...
Say you've got a big table that contains a varchar column.
How would you match rows that contain the word 'preferred' in the varchar col BUT the data is somewhat noisy and contains occasional spelling errors, e.g.:
['$2.10 Cumulative Convertible Preffered Stock, $25 par value',
'5.95% Preferres Stock',
'Class A Preffered',
'Series A Pe...
Many sites offer some statistics like "The hottest topics in the last 24h". For example, Topix.com shows this in its section "News Trends". There, you can see the topics which have the fastest growing number of mentions.
I want to compute such a "buzz" for a topic, too. How could I do this? The algorithm should weight the topics which a...
I have a large number of basic text, rtf, html, pdf and chm files that I store on a USB key as a personal knowledge base.
Up until now, to retrieve information, I've used a standard file searching tools (windows search,grep etc). However these days a brute force search can take minutes due to sheer data size. Also PDF and CHM are als...
I was wondering what amount of time is required to convey information regarding the tilt and position (not gps) of one particular iphone to another. Could 2 iphones send and receive this information simultaneously? What about 3 iphones? I'm interested in an application that is able to simultaneously send and receive and make conditional ...
Hi,
I'm working on a cross language information retrieval that takes queries in english and searches documents in Russian. To evaluate this system it would be nice to have a collection of russian documents to search through. Does anyone out there know of a collection of documents I can search or websites from which I can easily scrape to...
I am writing Windows application (with Borland C++ Builder), which stores large number of text files. I want users to be able to search these files very fast, so I need an indexing and search library. I do not use database, but my own file format for storing the documents (all are in a single file).
Are there such libraries for Windows?...
It seems that there is no Google Alerts API.
Firstly, How would you get Google Alerts information into a database other than to parse the text of the email message that Google sends you?
If you must parse text, how would you go about parsing out the relevant pieces of the email message?
...
Has anybody successfully used spaced repetition concepts embodied in programs like supermemo in the context of programming ?
The motivation for this question: I'm increasingly having to look up things I knew.
Reading this Wired piece "Want to Remember Everything You'll Ever Learn? Surrender to This Algorithm" has me wondering if this...