I want to run something like the BLAST algorithm to query a large database of unicode strings. Most of the alignment software like BLAST expects nucleotide or protein strings as input. But my input could potentially contain any unicode character. Is anyone aware of a piece of software that will let me do this? The scoring matrix coul...
People often throw around the terms IR, ML, and data mining, but I have noticed a lot of overlap between them.
From people with experience in these fields, what exactly draws the line between these?
...
I am trying to search for long, approximate substrings in a large database. For example, a query could be a 1000 character substring that could differ from the match by a Levenshtein distance of several hundred edits. I have heard that indexed q-grams could do this, but I don't know the implementation details. I have also heard that L...
What is Relevance Flow Graph, How Relevance Flow Graph is used in Information Retrieval?
...
Hi Guys,
I want to incrementally cluster text documents reading them as data streams but there seems to be a problem. Most of the term weighting options are based on vector space model using TF-IDF as the weight of a feature. However, in our case IDF of an existing attribute changes with every new data point and hence previous clusterin...
How to elicit user's information when he/she is visiting your website?
IP Address
Mac Address
User Profile Name
OS Name
OS version
OS Registered to (Name/Company)
Computer Name
Browser Name
Browser Version
ISP Name/Internet Connection Provider Name
Connection Type
Location - City/Country (based on IP)
...
Hello,
I am working on a project on Info Retrieval.
I have made a Full Inverted Index using Hadoop/Python.
Hadoop outputs the index as (word,documentlist) pairs which are written on the file.
For a quick access, I have created a dictionary(hashtable) using the above file.
My question is, how do I store such an index on disk that also ha...
.. and how the web crawler infers the semantics of information on the website?
List out the ranking signal in separate answer.
...
I know how python dictionaries store key: value tuples. In the project I'm working on, I'm required to store key associated with a value that's a list.
ex:
key -> [0,2,4,5,8]
where,
key is a word from text file
the list value contains ints that stand for the DocIDs in which the word occurs.
as soon as I find the same word in another d...
def boolean_search_and(self, text):
results = []
and_tokens = self.tokenize(text)
tokencount = len(and_tokens)
term1 = and_tokens[0]
print ' term 1:', term1
term2 = and_tokens[1]
print ' term 2:', term2
#for term in and_tokens:
if term1 in self._inverted_index.keys():
resultlist1 = self._i...
Hi,
I want to use wikipedia dump for my project. The below information is required for my project.
For an wikipedia entry, I want to know which other language contain the page?
I want an downloadable data in csv or other common format.
Is there a way to get this data?
Thanks
Bala
...
Hi!
I would like to get some information about the device, os etc. from the device running my app.
I need to log this data so i can run some diagnostics later.
Now i think that this data is located in Microsoft.Win32.Registry but that means i need to know all the keys to access values.
Any idea?
...
Hello,
I'm doing an Information Retrieval task. I built a simple searchengine. The InvertedIndex is a python dictionary object which is serialized (pickled in python terminology) to a file. Size of this file is InvertedIndex is just 6.5MB.
So, my Code just unpickles it and searches it for query & ranks the matching documents according ...
Hi,
I am intending to use the n-gram part/algorithm of this code:
http://www.codeproject.com/KB/cs/tfidf.aspx
The algorithm produces these tri-gram results:
t
th
the
he
e q
qu
qui
uic
ick
ck
k r
re
red
ed
d
for:
the quick red
However, this source:
http://en.wikipedia.org/wiki/Trigram
reckons it should be:
the qui k_r
he_ u...
def makecounter():
return collections.defaultdict(int)
class RankedIndex(object):
def __init__(self):
self._inverted_index = collections.defaultdict(list)
self._documents = []
self._inverted_index = collections.defaultdict(makecounter)
def index_dir(self, base_path):
num_files_indexed = 0
allfiles = os.listd...
besides BM25, whats other ranking functions exists? Where i found information on this topic?
...
I have a full inverted index in form of nested python dictionary. Its structure is :
{word : { doc_name : [location_list] } }
For example let the dictionary be called index, then for a word " spam ", entry would look like :
{ spam : { doc1.txt : [102,300,399], doc5.txt : [200,587] } }
so that, the documents containing...
It would help if I could do a search log analysis for my research. Is it possible to use a search API (Google, Yahoo, Bing) to create a log of web search queries over a specified time span, or is it available on request?
...
Hello,
I'm doing an Information Retrieval Task. As part of pre-processing I want to doing.
Stopword removal
Tokenization
Stemming (Porter Stemmer)
Initially, I skipped tokenization. As a result I got terms like this:
broker
broker'
broker,
broker.
broker/deal
broker/dealer'
broker/dealer,
broker/dealer.
broker/dealer;
broker/deale...