information-retrieval

Fast Sequence Alignment on Unicode Strings

I want to run something like the BLAST algorithm to query a large database of unicode strings. Most of the alignment software like BLAST expects nucleotide or protein strings as input. But my input could potentially contain any unicode character. Is anyone aware of a piece of software that will let me do this? The scoring matrix coul...

IR vs Data mining vs ML

People often throw around the terms IR, ML, and data mining, but I have noticed a lot of overlap between them. From people with experience in these fields, what exactly draws the line between these? ...

Search for (Very) Approximate Substrings in a Large Database

I am trying to search for long, approximate substrings in a large database. For example, a query could be a 1000 character substring that could differ from the match by a Levenshtein distance of several hundred edits. I have heard that indexed q-grams could do this, but I don't know the implementation details. I have also heard that L...

Relevance Flow Graph

What is Relevance Flow Graph, How Relevance Flow Graph is used in Information Retrieval? ...

how to cluster evolving data streams

Hi Guys, I want to incrementally cluster text documents reading them as data streams but there seems to be a problem. Most of the term weighting options are based on vector space model using TF-IDF as the weight of a feature. However, in our case IDF of an existing attribute changes with every new data point and hence previous clusterin...

HTML\PHP - How to elicit user (visitor)'s info

How to elicit user's information when he/she is visiting your website? IP Address Mac Address User Profile Name OS Name OS version OS Registered to (Name/Company) Computer Name Browser Name Browser Version ISP Name/Internet Connection Provider Name Connection Type Location - City/Country (based on IP) ...

Storing an inverted index

Hello, I am working on a project on Info Retrieval. I have made a Full Inverted Index using Hadoop/Python. Hadoop outputs the index as (word,documentlist) pairs which are written on the file. For a quick access, I have created a dictionary(hashtable) using the above file. My question is, how do I store such an index on disk that also ha...

Which information is stored by Google crawler?

.. and how the web crawler infers the semantics of information on the website? List out the ranking signal in separate answer. ...

Python: Storing a list value associated with a key in dictionary

I know how python dictionaries store key: value tuples. In the project I'm working on, I'm required to store key associated with a value that's a list. ex: key -> [0,2,4,5,8] where, key is a word from text file the list value contains ints that stand for the DocIDs in which the word occurs. as soon as I find the same word in another d...

Python: intersection of lists/sets

def boolean_search_and(self, text): results = [] and_tokens = self.tokenize(text) tokencount = len(and_tokens) term1 = and_tokens[0] print ' term 1:', term1 term2 = and_tokens[1] print ' term 2:', term2 #for term in and_tokens: if term1 in self._inverted_index.keys(): resultlist1 = self._i...

Wikipedia: pages across multiple languages

Hi, I want to use wikipedia dump for my project. The below information is required for my project. For an wikipedia entry, I want to know which other language contain the page? I want an downloadable data in csv or other common format. Is there a way to get this data? Thanks Bala ...

CF - Information about device, platform etc...

Hi! I would like to get some information about the device, os etc. from the device running my app. I need to log this data so i can run some diagnostics later. Now i think that this data is located in Microsoft.Win32.Registry but that means i need to know all the keys to access values. Any idea? ...

Kindly review the python code to boost its performance

Hello, I'm doing an Information Retrieval task. I built a simple searchengine. The InvertedIndex is a python dictionary object which is serialized (pickled in python terminology) to a file. Size of this file is InvertedIndex is just 6.5MB. So, my Code just unpickles it and searches it for query & ranks the matching documents according ...

N-gram related question - C# algorithm

Hi, I am intending to use the n-gram part/algorithm of this code: http://www.codeproject.com/KB/cs/tfidf.aspx The algorithm produces these tri-gram results: t th the he e q qu qui uic ick ck k r re red ed d for: the quick red However, this source: http://en.wikipedia.org/wiki/Trigram reckons it should be: the qui k_r he_ u...

Python: Dictionary of list of lists

def makecounter(): return collections.defaultdict(int) class RankedIndex(object): def __init__(self): self._inverted_index = collections.defaultdict(list) self._documents = [] self._inverted_index = collections.defaultdict(makecounter) def index_dir(self, base_path): num_files_indexed = 0 allfiles = os.listd...

besides BM25, whats other ranking functions exists?

besides BM25, whats other ranking functions exists? Where i found information on this topic? ...

Searching a normal query in an inverted index

I have a full inverted index in form of nested python dictionary. Its structure is : {word : { doc_name : [location_list] } } For example let the dictionary be called index, then for a word " spam ", entry would look like : { spam : { doc1.txt : [102,300,399], doc5.txt : [200,587] } } so that, the documents containing...

How to obtain a log of web search queries?

It would help if I could do a search log analysis for my research. Is it possible to use a search API (Google, Yahoo, Bing) to create a log of web search queries over a specified time span, or is it available on request? ...

What is proper Tokenization algorithm? & Error: TypeError: coercing to Unicode: need string or buffer, list found

Hello, I'm doing an Information Retrieval Task. As part of pre-processing I want to doing. Stopword removal Tokenization Stemming (Porter Stemmer) Initially, I skipped tokenization. As a result I got terms like this: broker broker' broker, broker. broker/deal broker/dealer' broker/dealer, broker/dealer. broker/dealer; broker/deale...