information-retrieval

Information mining, classification, modification

Any examples, tips, guidance for the following scenario? I have retrieved updates from several different news websites. I then analyse that information to predict on current trend in the world. I could only find the information on data mining when searching for above idea, but it is for database systems. While data mining is similar to...

Similarity Between Users Based On Votes

lets say i have a set of users, a set of songs, and a set of votes on each song: =========== =========== ======= User Song Vote =========== =========== ======= user1 song1 [score] user1 song2 [score] user1 song3 [score] user2 song1 [score] user2 song2 [score] user...

How to retrieve google pages

Dear all,I am now using a webtool http://fiddesktop.cs.northwestern.edu/mmp/scrape?url= to parse a webpage. For example,we can parse newyorktimes homepage,we do: http://fiddesktop.cs.northwestern.edu/mmp/scrape?url=http%3A//www.nytimes.com/pages/world/index.html in the address bar of our browser,it will parse things nicely for us. ...

Is there a popular tool for crawling web data?

I'm doing work on information extraction, and I need a tool to crawl data from web page , is there a popular one in windows? ...

Trying to get facebook/twitter/myspace statuses and other data for statistics

hi I was wondering if anyone knows how to gather data from millions of people around the globe via these social networks in order to get the statistics. I need this for a project I'm trying to do and do not need to know the actual person posting such information (such as statuses, comments, information about them, etc) so as not to brea...

term clustering library?

Hi, Does anybody know an open-source\free library that does term clustering? Thanks, yaniv ...

Finding sets that are a subset of a specific set

Lets say I have 4 different values A,B,C,D with sets of identifiers attached. A={1,2,3,4,5} B={8,9,4} C={3,4,5} D={12,8} And given set S of identifiers {1,30,3,4,5,12,8} I want it to return C and D. i.e. retrieve all sets from a group of sets for which S is a superset. Is there any algorithms to perform this task efficiently (Prefer...

Search in Folksonomies. How to tackle synonymy problem?

Hi all, Can someone shed some light on how searching is done on web-sites like del.icio.us? If I enter "js"(1), "javascript"(2) or "java script"(3) as my query on delicious, I'm pointed to resources about Java Script. However, depending on the query the returned result sets are different(del.icio.us system returns different set of boo...

How to do related questions autopopulate

I want to get a related [things/questions] in my app, similar to what StackOverflow does, when you tab out of the Title field. I can think of only one way to do it, which i think might be fast enough Do a search for the title in corpus of titles of all [things], and return first x matches. We can use whatever search is being used for ...

C#: How should I save my data?

I have two structs like so: public struct KeyLog { Keys key; DateTime time; } public struct MouseLog { MouseEvents mouse; Point coordinates; DateTime time; } For every keyboard key press and mousebutton click I wish to save this data but I do not know which way would be the most efficient to store/handle it in? Ar...

Self-indexing (and traditional indexing) algorithms - Implementations and advice to share?

As part of a research project I'm currently looking for open-source implementations of self-indexing algorithms, i.e. a compressed form of the traditional inverted index yielding nice characteristics such as faster lookup and/or less consumed space. Do you know of any open-source implementations of self-indexing algorithms? Do you have ...

What is non-serializable schedule? in transaction database

Can anyone explain me what is non-serializable in transaction DB. please give me an example. r1(x) r2(x)w1(y) c2 c1 is this non-serializable? ...

Best Java Open Source Text Mining Framework

Hello Everyone, I want to know what is the best open source java based framework for Text Mining, to use botg Machine Learning and dictionary Methods. I'm using Mallet but there are not that much documentation and I do not know if it will fit all my requirements. Thanks in advance. Best Regards, ukrania ...

Text similarity function for strict document similarity

I'm writing a piece of java software that has to make the final judgement on the similarity of two documents encoded in UTF-8. The two documents are very likely to be the same, or slightly different from each other, because they have many features in common like date, location, creator, etc., but their text is what decides if they reall...

How do search engines conduct 'AND' operation?

Consider the following search results: Google for 'David' - 591 millions hits in 0.28 sec Google for 'John' - 785 millions hits in 0.18 sec OK. Pages are indexed, it only needs to look up the count and the first few items in the index table, so speed is understandable. Now consider the following search with AND operation: Goog...

Getting lucene to return only unique threads (indexing both threads and posts)

I have a StackOverflow-like system where content is organised into threads, each thread having content of its own (the question body / text), and posts / replies. I'm producing the ability to search this content via Lucene, and if possible I have decided I would like to index individual posts, (it makes the index easier to update, and m...

Access information from one webpart and use it in another webpart in sharepoint 2010

My problem is this one, I am using Sharepoint 2010, I have a form created in sharepoint designer 2010, above that form I have a silverlight webpart. Now I need to be able to access information from the silverlight webpart when I click on it and insert that information in the form below it. Does anyone have any insight on how to do that?...

tfidf, am I understanding it right?

Hey everyone, I am interested in doing some document clustering, and right now I am considering using TF-IDF for this. If I am not wrong, TFIDF is particularly used for evaluating the relevance of a document given a query. If I do not have a particular query, how can I apply tfidf to clustering? ...

Assistance with building an inverted-index

It's part of an information retrieval thing I'm doing for school. The plan is to create a hashmap of words using the the first two letters of the word as a key and any words with the two letters saved as a string value. So, hashmap["ba"] = "bad barley base" Once I'm done tokenizing a line I take that hashmap, serialize it, and append i...

Any tips to development an advertisement system like Google's adsense?

In order to show a best match ad each time,there are at least these things to do: retrieve the main information of the current page get an ad that's related with the information retrieved above But the above is almost impossible for a non-search-engine company. So what's the practical way for a non-google company to approach a best ...