views:

611

answers:

13

Hello

I have an array of integers.

I need an O(n) algorithm to find if the array contains a number and its square; one pair is sufficient.

I tried to do it myself, but I have only managed to find a solution in O(n2).

Thanks

Edit: like I was asked: I thought about using counting sort, but the memory usage is too big.

A: 

In which range are your integers? Can there be negative integers? If you give me more infos about your integers I can probably write you an O(n) solution.

Webinator
Remember, this is `homework`, the idea is help the student on its way to finding his/her own solution...
mjv
A: 

If the array is not sorted, you won't be able to do O(n).

If it is sorted, you can make use of that property to do it in one pass, like so:

foreach e in array
    if squares contains e
        return true
    remove all less than e from squares
    add e * e to squares
return false

Where squares is, say, a HashSet.

If it's not sorted, you can sort it in O(n log n) and then use this method to check for squares, which will still be faster than the naive solution on a large enough data set.

Anon.
well, if the algorithm is sorted then it's very easy to do it in O(n).Sorting the array in O(n log n) won't help since that means the whole algorithm is O(n log n) in the good case..
gillyb
+11  A: 

create a new array twice the length of the input array. O(2N)
copy all of the numbers in O(N)
copy the squares of the numbers in O(N)
radix sort (we can since they are all ints) O(N)
iterate over once to see if there are two numbers the same one after the other O(N)
profit! O(1)

Chris H
note that you can't do this if you copy in the square roots of the ints since radix sort only works with integers. squares of integers are integers, so this is ok.
Chris H
It is, of course, unnecessary to copy the square roots AND squares. And would definitly make more sense to only copy squares, or perfect square roots.
nlucaroni
that's what I said...
Chris H
What if there are duplicates in the original array?
MAK
+1: good point. I hadn't thought of that. It's still pretty easy to keep track in O(N) though.
Chris H
How can you do radix sort if the integers in the array aren't bound ?
gillyb
every integer is bounded on a computer. Note that I answered this question assuming you're taking an algorithms class not writing actual code. If the prof said nothing about using too much space, then use as much as you can.
Chris H
By the same argument that radix sort is O(N) on bounded integers, quicksort is O(1) on systems where SIZE_MAX (or equivalent) provides a finite upper bound on the problem size. The O(N)-ness of radix sort is quite interesting if you have a truly huge input consisting of a lot of duplicates, but not otherwise.
Steve Jessop
That said, I haven't thought of a better solution than radix sort either, so it's possible my complaint is with the question, not the answer...
Steve Jessop
@Steve, I'm not sure what you mean...quick sort always O(1) where SIZE_MAX is bounded? I'm talking about a bound on the max value of an integer, not the number of integers.
Chris H
@gillyb: if the integers in the array are not bounded, then I'm pretty sure it can't be done. Just calculating the square of an integer requires greater than linear time, so consider an input, consisting of two integers of equal size. Testing whether one is the square of the other cannot be done in O(n). I think.
Steve Jessop
@Chris: yes, but your justification for the max value of an integer being bounded is that "on real computers, they always are". Equally, on real computers, arrays are always of bounded size and hence can be comparison-sorted in with a fixed upper bound of comparisons (about 2^37 comparisons for 32 bit array size, and 2^70 comparisons for 64 bit array size). Not practically computable, perhaps, but still O(1) ;-)
Steve Jessop
@Steve: You are correct, radix-sort is **O(n log(k))** (where k is the value of the largest integer), so the overall algorithm is also **O(n log(k))**. Algorithms like this are known as pseudo-polynomial (http://en.wikipedia.org/wiki/Pseudo-polynomial_time). See my answer for a true O(n) algorithm.
BlueRaja - Danny Pflughoeft
+4  A: 

There are basically two ways to do this.

  1. Sort the array and then perform a binary search for the square of each number. Overall complexity would be O(nlogn), but it would need sorting which would destroy the original ordering (which might be important for your case).

  2. Insert all items of the array into a hashtable (or any fast set data structure). Then iterate over the elements of the array again, checking to see if its square exists in the hashtable. Using a hashtable gives an overall complexity of O(n), but you will need O(n) extra space. You could also use a tree-based set (e.g. std::set in C++ or TreeSet in Java), which would give you a complexity of O(nlogn).

MAK
A: 

If I correctly understand the problem, you have to check if a specified number is in the array. And not finding all the numbers in the array that have their square in the array too. Simply maintain two boolean (one to check if the number has been found, another for the square), iterate over the elements in the array and test each element. Return the AND of the two boolean.

In pseudo code :

bool ArrayContainsNumberAndSquare(int number, int[] array):
boolean numberFound, squareFound;
int square = number * number;
foreach int i in array
(
  numberFound = numberFound || i == number;
  squareFound = squareFound || i == square;
)
return numberFound && squareFound;
Sylvestre Equy
Nope, as far as I understand, the OP is looking for any pair of number/square that happens to be in the array
Kena
+1  A: 

While I can't add to the suggestions above, you can reduce the average run time by first finding the min and max values in your data set (both O(n)) and confining your search to that range. For instance if the maximum value is 620, I know that no integer 25 or over has a square in the list.

Grembo
+2  A: 

If we're allowed to take that the input can be sorted in O(N) by radix sort, I'd improve a bit on Chris's solution:

  • radix sort the input.
  • For the first element of the result, linear search forward until we find either its square (in which case stop with true), or else the end (in which case stop with false) or else a value larger than the square (in which case continue searching for the square of the second and subsequent elements of the sorted array).

Each of the two "pointers" is moving strictly forward, so the overall complexity is O(N), assuming that the radix sort is O(N) and that squaring and comparison are O(1). Presumably whoever set the question intended these assumptions to be made.

In response to a comment by the questioner on another answer: if the integers in the input are not bounded, then I don't think it can be done. Just calculating the square of an integer requires greater than linear time (at least: no linear algorithm for multiplication is known), so consider an input of size n bits, consisting of two integers of size n / 3 bits and 2 * n / 3 bits. Testing whether one is the square of the other cannot be done in O(n). I think. I could be wrong.

Steve Jessop
I gave this post a -1 because it confuses the 'normal' definition of 'input' size.On 'standard' RAM models of computation, it is assumed that the integers are small enough (or registers are large enough) to fit in O(1) registers, and MUL/DIV etc are O(1).
Moron
I though I stated what assumptions were made in order to give an O(N) solution. I've added that the professor presumably intended them to be made. Is there anything else you would add in order for this not to cause confusion?
Steve Jessop
When judging the time-complexity of algorithms, we consider multiplications (et al.) as single operations. The algorithm is then judged by the number of operations it takes to complete (or, more accurately, how the number of operations grows as the size of the input grows).
BlueRaja - Danny Pflughoeft
We sometimes do that, and we sometimes calculate the complexity as the number of operations performed by a Turing machine (or other abstract machine). Depends whether we need to handle arbitrary integer input, which sometimes you do and sometimes you don't. That's why I said what happens in both cases.
Steve Jessop
If you need to handle arbitrary integer input, the size of the input is not really n.For instance, say you had n integers, each of logn bits. Then the size of input is nlogn and not n. So even if arithmetic is O(log n) now, the algorithm is still linear in the size of input.So if you do consider the bits of the individual integers when considering the time complexity, you also need to use the right input size.
Moron
n is always the number of bits of input in "proper" complexity calculations, although sometimes as BlueRaja says we involve the values of the input integers. So for "proper" complexity you wouldn't have n integers of log n bits, you'd have m integers of sizes s(0), s(1) ... s(k-1) bits, and n would be equal to the sum over i of s(i). Plus presumably some overhead for some kind of separator in the encoding, which you ignore because it's at worst proportional to n. Arithmetic is therefore not O(log n), it's O(n) for addition (where m=2) and worse than that for multiplication.
Steve Jessop
http://en.wikipedia.org/wiki/Computational_complexity_of_mathematical_operations might help. In my example I specifically chose an obviously bad class of cases of the problem stated, which is where the largest integer in the input has size in bits proportional to the total size in bits of the input, and this is why: because O(n) is incredibly restrictive for any algorithm on arbitrarily-sized integers. Therefore, presumably, there's an assumption somewhere in the course notes but unstated in this question, that integers are fixed size and thus (in my notation) m and n are proportional.
Steve Jessop
@Steve:We are saying the same thing.I was talking about time complexity of addition of log n bit integers as being O(logn).You are talking about addition of n bit integers being O(n). But if you are looking at an array of n integers of log n bits each, addition of two integers among those is O(log n) and not O(n).In any case, it is tedious to always describe the model of computation each time you state an algorithm problem. When not stated, the standard model assumed is the RAM word model which I described earlier, where addition/multiplication are assume to be O(1) (due to hardware support).
Moron
Well, I did assume the fixed-size model in the first part of my answer. But in response to a follow-up question from the questioner ("what if the integers are not bounded?"), I also considered the general model, to explain why this could not be the model in use by whoever set the question. In the RAM model, integers are constant size, and in the general model their size is not necessarily bound by the log of the number of integers, so I still don't understand where "n integers, each of size log n" is coming from. Since there are worse cases, I don't quite see what your example illustrates.
Steve Jessop
I mean, if the input was how you say, then a particular algorithm would be Theta(total size of the input). Which is fair enough, but I was talking about arbitrarily-sized integers, as mentioned by the questioner, not about integers bounded in size by log n, and for arbitrarily-sized integers the same algorithm would not be Theta(total size of the input).
Steve Jessop
I have removed the -1. I re-read your answer. Sorry, it is probably my reading comprehension which was confused.
Moron
In fact, now I want to give the answer a +1, but unfortunately, my vote has become too old. Sorry if I have caused you any distress, Steve.
Moron
The vote doesn't bother me as long as we understand roughly what each other are talking about :-) I did edit the answer a bit after your first comment, to make it more clear I was talking about two separate sets of assumptions, leading to different answers, so thanks.
Steve Jessop
+1  A: 

You may be able to do it with a couple of hashsets helping you out.

While iterating, If the value is in the squares hashset, you've got a pair (value is the square of a previously found value) If the square is in the values hashset, you've got a pair (square of this value was already passed) else store the value in one and the square in the other.

SB
+1  A: 

Personally I think that Anon's answer (the little algorithm with 'squares') is more useful than it appears to be: remove the 'remove all less than e from squares' line from it, and the algorithm can handle an unsorted input array.

If we assume the typical Homework machine with Sufficient Space, the 'squares' datastructure could be modelled as an array of boolean flags, yielding true O(1) lookup time.

Lars
+1  A: 

If we're using C/C++ 32 bit unsigned ints the maximum value that can be stored is: 4294967295 =(2<<32)-1. The largest number whose square we can store is (1<<16)-1=65535. Now if create an array of bits and store in the array whether we've seen the number and/or its square (2 bits per "slot") we can get the total storage down to 65535/4 = 16384 bytes.

IMO This is not excessive memory consumption so we should be able to do this without radix sorting. An O(N) algorithm could look like this:

uint32_t index(uint32_t i ) { return i/4; }
unsigned char bit1( uint32_t i ) { return 1<<( (i%4)*2 ); }
unsigned char bit2( uint32_t i ) { return 1<<( (i%4)*2 +1 ); }


bool hasValueAndSquare( std::vector<uint32_t> & v )
{
   const uint32_t max_square=65535;

   unsigned char found[(max_square+1)/4]={0};
   for(unsigned int i=0; i<v.size(); ++i)
   {
      if (v[i]<=max_square)
      {
          found[ index(v[i]) ] |= bit1(v[i]);
          if ((found[ index(v[i])] & bit2(v[i])) == bit2(v[i])) return true;
      }
      uint32_t w = (uint32_t)round(sqrt(v[i]));
      if( w*w == v[i] )
      {
          found[ index(w) ] |= bit2(w);
          if ((found[index(w)] & bit1(w)) == bit1(w)) return true;
      }
    }
    return false;
 }

This is not tested, not very optimized, and a proper integer square-root would be better. however the compiler should inline all the bit-accessing functions - so they'll be OK.

Note that if we're using 64 bit ints the memory consumption becomes much larger, instead of an array of 16Kb we need an array of 1Gb - possible less practical.

Michael Anderson
A: 

1) With the hashmap you get O(n).

2) If you use std::set on 2 sets: the evens, and the odds, you can get

2*O((n/2)log(n/2))=O(nlog(n/2))

assuming there is roughly as many evens than odds

Bob Yoplait
+1  A: 

Without sorting, works with duplicates:

Iterate the array to find the smallest and largest integers. O(n)
Create an array of bits the the size of the difference. O(1) time, O(k) space
(Now each possible integer between the smallest and largest values has a corresponding bit in the array)
Iterate the old-array, setting the bit corresponding to each integer found to 1. O(n)
Iterate the old-array again, checking if the integer's square has its corresponding bit set. O(n)

(Though I didn't sort, this algorithm can be very easily modified to create a sorting algorithm which sorts in O(n+k) time and O(k) space)

BlueRaja - Danny Pflughoeft
Of course, in real life this is O(n+k), since you need to zero the array; but we do not usually consider things like that when defining algorithms. This is very likely the "correct" answer.
BlueRaja - Danny Pflughoeft
"we do not usually consider things like that". Some of us do, some of us don't. For instance when looking at the complexity of an algorithm for ISPRIME, O(n + k) would be a disaster, since k is about 2^n. It all depends on context, and never having studied CompSci at a university, I can't guess what assumptions would have been stated in the lecture, but not in this SO question...
Steve Jessop
@Steve: I meant considering things like how long it takes to zero an array. Of course `k` is an important factor in most other cases. However, from an algorithms point-of-view, an array can be *"declared"* as zeroed in O(1), so in this case `k` is not counted as part of the complexity (a real-world example: an array could, theoretically, be zeroed in hardware with a simple switch in **O(1)** ; I do not know of any computers which actually implement this).
BlueRaja - Danny Pflughoeft
Yes, fair point, because abstract machine definitions treat storage as having 0 initial value (or at any rate a defined initial value, which in this case you'd use to mean "false"), whereas malloc provides non-deterministic values. In practice, the time cost of malloc and calloc are both indeterminate from an algorithms POV, to analyse the real performance on a real machine you need more information...
Steve Jessop
A: 

Optimization notes

Both the hashset and radix sort algorithms can be optimized by noting three facts:

  1. Odd and even values can be handled separately
  2. Calculating an integer square root is a very fast operation (typically consists of 3-5 divides and a few adds)
  3. Cache locality is important for both of these algorithms

The optimized algorithms below will typically perform 5x faster and use less than half the RAM of the unoptimized case. In some cases where the data size is similar to the L2/L3 cache size they may perform 100x faster or more.

Optimized algorithm based on radix sort

Data structure is five lists of integers: IN, Aodd, Bodd, Aeven, Beven The A and B lists use half the integer size of IN. (eg if IN = 64bits, A & B = 32bits)

  1. Scan list IN to find the largest odd and even numbers MAXodd and MAXeven
  2. Let LIMITodd = floor(sqrt(MAXodd))
  3. Let LIMITeven = floor(sqrt(MAXeven))
  4. For each number in list IN: a. Compute the square root if positive. If exact, add the square root to list Aodd/Aeven. b. If the number is >=0 and <= LIMITodd/LIMITeven, add it to list Bodd/Beven
  5. Radix sort list Aodd and Bodd using just log2(LIMITodd) bits
  6. Linear scan Aodd and Bodd for a match
  7. Radix sort list Aeven and Beven using just log2(LIMITeven) bits
  8. Linear scan Aeven and Beven for a match

If either linear scan finds a match, return that match immediately.

The reason this is much faster than the straightforward radix sort algorithm is that:

  • The arrays being sorted typicaly have less than 1/4 the numbers of values and need only half the number of bits per integer, so a total of less than 1/8 the RAM in use in a given sort which is good for the cache.
  • The radix sort is done on much fewer bits leading to fewer passes, so even if it does exceed your L1 or L2 cache you read RAM fewer times, and you read much less RAM
  • The linear scan is typically much faster because the A list contains only exact square roots and the B list only contains small values

Optimized algorithm based on hashset

Data structure is list of integers IN, plus two hashsets A and B The A and B sets use half the integer size of IN

  1. Scan list IN to find the largest odd and even numbers MAXodd and MAXeven
  2. Let LIMITodd = floor(sqrt(MAXodd))
  3. Let LIMITeven = floor(sqrt(MAXeven))
  4. For each odd number in list IN: a. Compute the square root if positive. If exact, check if square root exists in B & return if true otherwise add it to A. b. If the number is >=0 and <= LIMITodd/LIMITeven, check if it exists in A & return if true otherwise add it to B.
  5. Clear A and B and repeat step 4 for even numbers

The reason this is faster than the straightforward hashset algorithm is that:

  • The hashset is typically 1/8 the amount of RAM leading to much better cache performance
  • Only exact squares and small numbers have hashset entries, so much less time is spent hashing and adding/removing values

There is an additional small optimization available here: A and B can be a single hashset, which stores bit flag with each entry to say whether the integer is in A or B (it can't be in both because then the algorithm would have terminated).

Ray Burns