views:

351

answers:

5

Hi All, I am looking for an algorithm, hint or any source code that can solve my following problem.

I have a folder it contains many text files. I read them and store all text in STRING. Now I want to to calculate, if any of the word appeared in other files or no. ( I know its not clear let me give an example)

For example i have two documents: Doc A => "brown fox jump" Doc B => "dog not jump" Doc C = > "fox jump dog"

Lets say my program read the first document and now first word is "brown" now my program will check if this word is also appeared in any other document? So the answer would be 0. Now it will check again for 2nd word "fox", it will give output that yes it appeared in (Doc C) so on...... Now it will read Doc B and it will check if dog appeared in other document? Answer would be (Doc C) so on....

Any advice or pseudo code?

Hint: It is also called inverse document frequency ( Idf ). I know what is idf.

+3  A: 

Hint: HashMap mapping Strings to Lists of files.

GregS
+4  A: 

Like GregS said, use HashMap. I'm not posting any code, because I think this is a homework and I want to give to you the opportunity to create it on your own, but the outline is:

  1. Open new document
  2. For every word, look at your hashmap if it's already there. If it isn't, create a new key in HashMap with this word, and in that position add the new document (the filename). If it is, just add the filename of the document.

For example, if you have: DocA: Brown fox jump DocB: Fox jump dog

You would open DocA and traverse its contents. 'brown' is not in your hashmap, so you would add a new element with key 'brown' and value 'DocA'. The same with 'fox' and 'jump'. Then you would open DocB. 'fox' is already in your hashmap, so you would add to its value DocB, (the value would be 'DocA DocB'). Maybe using an ArrayList (in Java) would help.

Alex
Thanks alex. It's good tip. I will try to make it. I am laughing to read that you all great people think its assignment. It reminded me my school days. I am professional programmer. I just recently moved to Java from php. So you can say it is my current project's small part. Main project is online book store with some new ideas.
+1  A: 

It might be helpful to think about the problem in terms of 'I have this set of words for all documents together' and 'I could store somehow in which of the documents each of these words appear'. Given such a representation of your data it would be very easy to determine if a given word appears in multiple documents. On how to do this, others have provided hints here.

Fabian Steeg
A: 

HashMap mapping Strings to Integers. Integers are a immutable so there is a bit of hustle to "increment" but not too much. You can override the put() method do that.

David Soroko
I assume you mean mapping Strings to *Lists* of Integers? This serves the same purpose as GregS and Alex's solution but less clear. A list of documents, in Java, is really just a list of pointers, so a list of documents is definitely more readable and easier to work with.
MatrixFrog
+1  A: 

Just another idea different then all valuable answers, i admit hash looks better, i just wanted to see it in another angle.

i would sort all words in each document and compare each document with each other.

For example docA > brown, fox, jump; docB-> doc, jump, not docC-> dog, fox, jump

comparing them comes like this

 until there is a single document with words
   get first element of documents
   compare the most descending first element if that element exists more than once reserve it
   throw the one that is the most descending (in my case)

so in first comparison

docA -> fox, jump docB -> doc, jump, not docC -> dog, fox, jump

in second comparison

docA -> fox, jump docB -> jump, not docC -> dog, fox

in third comparison

docA -> fox, jump docB -> jump, not docC -> fox, jump

reserve fox in 4 th comparison, reserve jump in 5 th comparison.

erdemoo