ansaurus

Question

Finding frequent sequence of numbers in an array

Answer 1

+1 A:

In Python3

>>> from collections import Counter
>>> count_hash=Counter()
>>> T=(3, 5, 1, 3, 5, 48, 4, 7, 13, 55, 65, 4, 7, 13, 32)
>>> for i in range(2,len(T)+1):
...     for j in range(len(T)+1-i):
...         count_hash[T[j:j+i]]+=1
... 
>>> for k,v in count_hash.items():
...     if v >= 2:
...         print(k,v)
... 
(3, 5) 2
(4, 7, 13) 2
(7, 13) 2
(4, 7) 2

Do you need to filter the (7,13) and the (4,7) out? What happens if there was also (99, 7, 14) in the sequence?

a Counter is just like a hash used to keep track of the number of times we see each substring
The two nested for loops produce all the substrings of T, using count_hash to accumulate the count of each substring.
The final for loop filters all those substrings that only occurred once

Here is a version with a filter

from collections import Counter
def substrings(t, minlen=2):
    tlen = len(t)
    return (t[j:j+i] for i in range(minlen, tlen+1) for j in range(tlen+1-i))

def get_freq(*t):
    counter = Counter(substrings(t))
    for k in sorted(counter, key=len):
        v=counter[k]
        if v < 2:
            del counter[k]
            continue
        for t in substrings(k):
            if t in counter:
                if t==k:
                    continue
                counter[k]+=counter[t]-v
                del counter[t]
    return counter

print(get_freq(3, 5, 1, 3, 5, 48, 4, 7, 13, 55, 65, 4, 7, 13, 32, 4, 7))
print(get_freq(1,2,3,4,1,2,3,4,1,2,7,8,7,8,3,4,3,4,1,2))

the output is

Counter({(4, 7, 13): 3, (3, 5): 2})
Counter({(1, 2, 3, 4, 1, 2): 8, (7, 8): 2}) # Is this the right answer?

Which is why I asked how the filtering should work for the sequence I gave in the comments

gnibbler 2010-08-18 06:36:10

Python !== PHP. ;)

cdburgess 2010-08-18 06:44:35

Yes I need to filter them, I search only for the longest sequence of numbers and ignore any sub-sequence included already in it, Can you write the code in general or by any other language like Java or C++ or PHP

D3VELOPER 2010-08-18 06:45:34

@cdburgess, The questions asks for an algorithm or pseudo code. This is an algorithm

gnibbler 2010-08-18 06:49:51

@D3VELOPER, but what happens if the substring `(7,14)` also occurs in other places (not following a `4`)?

gnibbler 2010-08-18 06:55:53

it will be included in the frequency of the largest one

D3VELOPER 2010-08-18 06:59:06

@D3VELOPER, what would be the output for `(1,2,3,4,1,2,3,4,1,2,7,8,7,8,3,4,3,4,1,2)`?

gnibbler 2010-08-18 07:11:11

D3VELOPER 2010-08-18 08:15:34

@D3VELOPER, but if you don't count overlapping substrings, you'll need to choose between (1,2,3,4) having a count of 2 or (3,4,1,2) having a count of 2. The other one should have a count of 1

gnibbler 2010-08-18 08:26:56

Answer 2

A:

Ok, just to start off the discussion.

Create another array/map, call this weightage array.
Start iterating on the values array.
For each value in values array,increment the corresponding position in weightage array. Eg: for 3 increase weightage[3]++, for 48 weightage[48]++.
After the iteration the weightage array contains repetitions

questzen 2010-08-18 06:41:01

Answer 3

+4 A:

** EDIT ** : slightly better implementation, now also returns frequences and has a better sequence filter.

function getFrequences($input, $minimalSequenceSize = 2) {
   $sequences = array();
   $frequences = array();

   $len = count($input);
   for ($i=0; $i<$len; $i++) {
      $offset = $i;

      for ($j=$i+$minimalSequenceSize; $j<$len; $j++) {
         if ($input[$offset] == $input[$j]) {
            $sequenceSize = 1;
            $sequence = array($input[$offset]);
            while (($offset + $sequenceSize < $j) 
               && ($input[$offset+$sequenceSize] == $input[$j+$sequenceSize])) {

               if (false !== ($seqIndex = array_search($sequence, $frequences))) {
                  // we already have this sequence, since we found a bigger one, remove the old one
                  array_splice($sequences, $seqIndex, 1);
                  array_splice($frequences, $seqIndex, 1);
               }              

               $sequence[] = $input[$offset+$sequenceSize];
               $sequenceSize++;
            }

            if ($sequenceSize >= $minimalSequenceSize) {
               if (false !== ($seqIndex = array_search($sequence, $sequences))) {
                  $frequences[$seqIndex]++;
               } else {
                  $sequences[] = $sequence;
                  $frequences[] = 2;  // we have two occurances already
               }
               // $i += $sequenceSize;  // move $i so we don't reuse the same sub-sequence
               break;
            }
         }
      }
   }

   // remove sequences that are sub-sequence of another frequence
   // ** comment this to keep all sequences regardless ** 
   $len = count($sequences);
   for ($i=0; $i<$len; $i++) {
      $freq_i = $sequences[$i];
      for ($j=$i+1; $j<$len; $j++) {
         $freq_j = $sequences[$j];
         $freq_inter = array_intersect($freq_i, $freq_j);
         if (count($freq_inter) != 0) {
            $len--;
            if (count($freq_i) > count($freq_j)) {
               array_splice($sequences, $j, 1);
               array_splice($frequences, $j, 1);
               $j--;
            } else {
               array_splice($sequences, $i, 1);
               array_splice($frequences, $i, 1); 
               $i--;
               break;
            }
         }
      }
   }

   return array($sequences, $frequences);
};

Test case

header('Content-type: text/plain');

$input = array(3, 5, 1, 3, 5, 48, 4, 7, 13, 55, 3, 5, 65, 4, 7, 13, 32, 5, 48, 4, 7, 13);

list($sequences, $frequences) = getFrequences($input);
foreach ($sequences as $i => $s) {
   echo "(" . implode(',', $s) . ') f=' . $frequences[$i] . "\n";
}

** EDIT ** : here's an update to the function. It was almost completely rewritten... tell me if this is what you were looking for. I also added a redundancy check to prevent counting the same sequence, or subsequence, twice.

function getFrequences2($input, $minSequenceSize = 2) {
  $sequences = array();

  $last_offset = 0;
  $last_offset_len = 0;

  $len = count($input);
  for ($i=0; $i<$len; $i++) {
     for ($j=$i+$minSequenceSize; $j<$len; $j++) {
        if ($input[$i] == $input[$j]) {
           $offset = 1;
           $sub = array($input[$i]);
           while ($i + $offset < $j && $j + $offset < $len) {
              if ($input[$i + $offset] == $input[$j + $offset]) {
                 array_push($sub, $input[$i + $offset]);
              } else {
                 break;
              }
              $offset++;
           }

           $sub_len = count($sub);
           if ($sub_len >= $minSequenceSize) {
              // $sub must contain more elements than the last sequence found
              // otherwise we will count the same sequence twice
              if ($last_offset + $last_offset_len >= $i + $sub_len) {
                 // we already saw this sequence... ignore
                 continue;
              } else {
                 // save offset and sub_len for future check
                 $last_offset = $i;
                 $last_offset_len = $sub_len;
              }

              foreach ($sequences as & $sequence) {
                 $sequence_len = count($sequence['values']);
                 if ($sequence_len == $sub_len && $sequence['values'] == $sub) {
                    //echo "Found add-full ".var_export($sub, true)." at $i and $j...\n";
                    $sequence['frequence']++;
                    break 2;
                 } else {
                    if ($sequence_len > $sub_len) {
                       $end = $sequence_len - $sub_len;
                       $values = $sequence['values'];
                       $slice_len = $sub_len;
                       $test = $sub;
                    } else {
                       $end = $sub_len - $sequence_len;
                       $values = $sub;
                       $slice_len = $sequence_len;
                       $test = $sequence['values'];
                    }
                    for ($k=0; $k<=$end; $k++) {
                       if (array_slice($values, $k, $slice_len) == $test) {
                          //echo "Found add-part ".implode(',',$sub)." which is part of ".implode(',',$values)." at $i and $j...\n";
                          $sequence['values'] = $values;
                          $sequence['frequence']++;
                          break 3;
                       }
                    }
                 }
              }

              //echo "Found new ".implode(',',$sub)." at $i and $j...\n";
              array_push($sequences, array('values' => $sub, 'frequence' => 2));
              break;
           }
        }
     }
  }

  return $sequences;
};

Yanick Rochon 2010-08-18 06:59:03

Worked for me. Works well. It even eliminates the duplicated 4,7 and only shows the 4,7,13. Nice work!

cdburgess 2010-08-18 07:12:16

I found some potential problems and fixed them in this one. The algorithm now also return the frequence for each sequence. Cheers! If you find any error, please tell me so I can update/fix this answer.

Yanick Rochon 2010-08-18 07:37:30

D3VELOPER 2010-08-18 08:25:58

I think I understand. I'll update my solution later, however. I'm in a little rush at the moment.

Yanick Rochon 2010-08-18 09:05:08

solution updated

Yanick Rochon 2010-08-18 23:35:50

Thanks, I will test it and back to you with the results

D3VELOPER 2010-08-18 23:46:25

Furthering my tests, I have found out that (5, 1, 3, 5, 1, 3) was found 3 times -- because (1, 3) was also found within the first (5, 1, 3) and should not have --, which was wrong. So I added a redundancy check.

Yanick Rochon 2010-08-19 01:19:54

Awesome Work! and the code very simple and efficient.

D3VELOPER 2010-08-19 02:35:44

ansaurus

tags:

views:

answers:

Finding frequent sequence of numbers in an array

related questions