ansaurus

Question

Answer 1

+6 A:

Use a set data structure. Most programming languages / standard libraries come with one flavour or another. If they don't, use a hash table (or generally, dictionary) implementation and just omit the value field. Use your characters as keys. These data structures generally filter out duplicate entries (hence the name set, from its mathematical usage: sets don't have a particular order and only unique values).

Konrad Rudolph 2008-12-22 22:42:56

A set will be slower than the bitmap evilteach is suggesting

Sam Saffron 2008-12-22 23:01:54

But my solution works, the array doesn't (of course I'm using Unicode files). ;-)

Konrad Rudolph 2008-12-22 23:18:46

I also claim that the array solution *is* in fact an optimized and specialized set.

Konrad Rudolph 2008-12-22 23:19:16

Answer 2

A:

in c++ i would first loop through the letters in the alphabet then run a strchr() on each with the file as a string. this will tell you if that letter exists, then just add it to the list.

Russ Bradberry 2008-12-22 22:47:29

Which alphabet? What if the alphabet you use is significantly smaller than the file? What if it's considerably larger?

Dustin 2008-12-22 23:01:32

well he already gave his filesize in the question. im not sure of the performance differences in each, i imagine it would depend on how much memory you have.

Russ Bradberry 2008-12-22 23:04:40

Or what if the file is very large but only contains a few characters from the alphabet? This will scan through the whole input file multiple times.

Greg Hewgill 2008-12-22 23:30:51

point take, then maybe after finding the character you remove all instances of it from the search string. this way it speeds up as it finds more.

Russ Bradberry 2008-12-22 23:42:47

Answer 3

A:

Python without using a set.

file = open('location', 'r')
letters = []
for line in file:
    for character in line:
        if character not in letters:
            letters.append(character)

Jough 2008-12-22 22:49:24

Why do you avoid the set the list version is a lot slower....

André 2008-12-22 22:52:58

I dunno.. just an alternative.

Jough 2008-12-22 22:54:59

Why not just use a dictionary, should be of the same order of magnitude as a set, and is as integrated into the language as lists.

Cervo 2008-12-23 01:11:16

Answer 4

+2 A:

A very fast solution would be to make a small C program that reads its standard input, does the aggregation and spits out the result.

Why the arbitrary limitation that you need a "script" that does it?

What exactly is a script anyway?

Would Python do?

If so, then this is one solution:

import sys;

s = set([]);
while True:
    line = sys.stdin.readline();
    if not line:
        break;
    line = line.rstrip();
    for c in line.lower():
        s.add(c);

print("".join(sorted(s)));

Lasse V. Karlsen 2008-12-22 22:51:55

Using a set is slightly faster than using a list and checking for contains every time, as @Jough did.

Joachim Sauer 2008-12-22 22:56:23

Not only “slightly” actually. For large files, the difference is truly significant, i.e. in the order of O(n^2) vs. O(n).

Konrad Rudolph 2008-12-22 23:04:09

You're confusing n's here. the n in question for this algorithm is the number of unique characters, not the size of the file, so it's not that big a performance hit. That said, if you only care about ASCII characters then a bitfield is the easiest approach.

Yuliy 2008-12-23 15:11:38

There's ways to make this superfast by using assembly code piped to an assembler, thus keeping in line with the "script" part of the question, but my solution is quick to implement and runs decently fast. Yes, it can be improved, but is it necessary?

Lasse V. Karlsen 2008-12-23 19:51:48

Answer 5

+2 A:

Algorithm: Slurp the file into memory.

Create an array of unsigned ints, initialized to zero.

Iterate though the in memory file, using each byte as a subscript into the array.
    increment that array element.

Discard the in memory file

Iterate the array of unsigned int
       if the count is not zero,
           display the character, and its corresponding count.

EvilTeach 2008-12-22 22:56:54

This only works if you assume a 8 bit encoding and therefore don't support unicode characters. Or at least you'll need to modify the bitfield size.

Joachim Sauer 2008-12-22 23:12:05

He is talking characters, not unicode. He also could have probably implemented and adequate solution by now, without help.

EvilTeach 2008-12-23 01:06:35

Answer 6

+5 A:

As requested, a pure shell-script "solution":

sed -e "s/./\0\n/g" inputfile | sort -u

It's not nice, it's not fast and the output is not exactly as specified, but it should work ... mostly.

For even more ridiculousness, I present the version that dumps the output on one line:

sed -e "s/./\0\n/g" inputfile | sort -u | while read c; do echo -n "$c" ; done

Joachim Sauer 2008-12-22 23:06:36

You're missing a `wc -l` at the end. Other than that, nice solution. I tried something similar but didn't get it to work (I forgot the `g` option on sed).

Konrad Rudolph 2008-12-22 23:14:37

"wc -l"? As I understand he's not interested in the number of unique characters but in which ones there are. Did I get that wrong?

Joachim Sauer 2008-12-22 23:23:13

saua, forget it … I actually misread the question.

Konrad Rudolph 2008-12-22 23:25:05

@Konrad: but I stole your sed-magic using "\0" it looks much nicer that way. Thanks ;-)

Joachim Sauer 2008-12-22 23:26:07

You're welcome. I'll probably delete my other answer in due course since it's now redundant.

Konrad Rudolph 2008-12-22 23:29:52

This doesn't ignore case, if you pipe the sed output through "tr [A-Z] [a-z]" before passing to sort, it'll ignore case also, and it's only a few tenths of a second slower.

Jay 2008-12-23 15:16:37

the while at the end could be replaced by tr -d "\n"

Abdul 2009-12-04 01:01:52

Answer 7

+10 A:

BASH shell script version (no sed/awk):

while read -n 1 char; do echo "$char"; done < entry.txt | tr [A-Z] [a-z] |  sort -u

UPDATE: Just for the heck of it, since I was bored and still thinking about this problem, here's a C++ version using set. If run time is important this would be my recommended option, since the C++ version takes slightly more than half a second to process a file with 450,000+ entries.

#include <iostream>
#include <set>

int main() {
    std::set<char> seen_chars;
    std::set<char>::const_iterator iter;
    char ch;

    /* ignore whitespace and case */
    while ( std::cin.get(ch) ) {
        if (! isspace(ch) ) {
            seen_chars.insert(tolower(ch));
        }
    }

    for( iter = seen_chars.begin(); iter != seen_chars.end(); ++iter ) {
        std::cout << *iter << std::endl;
    }

    return 0;
}

Note that I'm ignoring whitespace and it's case insensitive as requested.

For a 450,000+ entry file (chars.txt), here's a sample run time:

[user@host]$ g++ -o unique_chars unique_chars.cpp 
[user@host]$ time ./unique_chars < chars.txt
a
b
d
o
y

real    0m0.638s
user    0m0.612s
sys     0m0.017s

Jay 2008-12-22 23:13:58

To me sed/awk are part of a shell: If the shell is available and sed/awk are not, then I'm in bizzaro-world ;-)

Joachim Sauer 2008-12-22 23:15:44

But thats some nice work there ... always good to learn some new tricks.

Joachim Sauer 2008-12-22 23:16:15

But: "sort -u" should be a lot more efficient than "sort | uniq"

Joachim Sauer 2008-12-22 23:17:39

@saua - Noted, I removed the call to uniq after I checked the manpage to sort. Old habit ;)

Jay 2008-12-22 23:18:12

I'm afraid the while-lop at the beginning really slows things down, as this is still by far the slowest bash-alternative in this thread, it seems.

Joachim Sauer 2008-12-22 23:21:38

@saua - yes, not exactly speedy as requested, but then if speed is the major concern I'd probably want to write this in C anyway.

Jay 2008-12-22 23:41:33

Answer 8

+1 A:

cat yourfile | 
 perl -e 'while(<>){chomp;$k{$_}++ for split(//, lc $_)}print keys %k,"\n";'

codelogic 2008-12-22 23:15:12

Answer 9

+1 A:

~~Alternative solution using bash:~~

sed "s/./\l\0\n/g" inputfile | sort -u | grep -vc ^$

EDIT Sorry, I actually misread the question. The above code counts the unique characters. Just omitting the c switch at the end obviously does the trick but then, this solution has no real advantage to saua's (especially since he now uses the same sed pattern instead of explicit captures).

Konrad Rudolph 2008-12-22 23:17:28

(Go people! Code golf!)

Konrad Rudolph 2008-12-22 23:19:57

Answer 10

A:

Try this file with JSDB Javascript (includes the javascript engine in the Firefox browser):

var seenAlreadyMap={};
var seenAlreadyArray=[];
while (!system.stdin.eof)
{
  var L = system.stdin.readLine();
  for (var i = L.length; i-- > 0; )
  {
    var c = L[i].toLowerCase();
    if (!(c in seenAlreadyMap))
    {
      seenAlreadyMap[c] = true;
      seenAlreadyArray.push(c);
    }
  }
}
system.stdout.writeln(seenAlreadyArray.sort().join(''));

Jason S 2008-12-22 23:21:12

Answer 11

+1 A:

While not an script this java program will do the work. It's easy to understand an fast ( to run )

import java.util.*;
import java.io.*;
public class  Unique {
    public static void main( String [] args ) throws IOException { 
        int c = 0;
        Set s = new TreeSet();
        while( ( c = System.in.read() ) > 0 ) {
            s.add( Character.toLowerCase((char)c));
        }
        System.out.println( "Unique characters:" + s );
    }
}

You'll invoke it like this:

type yourFile | java Unique

or

cat yourFile | java Unique

For instance, the unique characters in the HTML of this question are:

Unique characters:[ , , ,  , !, ", #, $, %, &, ', (, ), +, ,, -, ., /, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, :, ;, <, =, >, ?, @, [, \, ], ^, _, a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z, {, |, }]

OscarRyz 2008-12-23 00:24:53

Answer 12

A:

Python using a dictionary. I don't know why people are so tied to sets or lists to hold stuff. Granted a set is probably more efficient than a dictionary. However both are supposed to take constant time to access items. And both run circles around a list where for each character you search the list to see if the character is already in the list or not. Also Lists and Dictionaries are built in Python datatatypes that everyone should be using all the time. So even if set doesn't come to mind, dictionary should.

file = open('location.txt', 'r')
letters = {}
for line in file:
  if line == "":
    break
  for character in line.strip():
    if character not in letters:
      letters[character] = True
file.close()
print "Unique Characters: {" + "".join(letters.keys()) + "}"

Cervo 2008-12-23 00:51:52

Answer 13

+3 A:

Python w/sets (quick and dirty)

s = open("data.txt", "r").read()
print "Unique Characters: {%s}" % ''.join(set(s))

Python w/sets (with nicer output)

import re

text = open("data.txt", "r").read().lower()
unique = re.sub('\W, '', ''.join(set(text))) # Ignore non-alphanumeric

print "Unique Characters: {%s}" % unique

Triptych 2008-12-23 00:57:32

Answer 14

A:

A C solution. Admittedly it is not the fastest to code solution in the world. But since it is already coded and can be cut and pasted, I think it counts as "fast to implement" for the poster :) I didn't actually see any C solutions so I wanted to post one for the pure sadistic pleasure :)

#include<stdio.h>

#define CHARSINSET 256
#define FILENAME "location.txt"

char buf[CHARSINSET + 1];

char *getUniqueCharacters(int *charactersInFile) {
    int x;
    char *bufptr = buf;
    for (x = 0; x< CHARSINSET;x++) {
     if (charactersInFile[x] > 0)
      *bufptr++ = (char)x;
    }
    bufptr = '\0';
    return buf;
}

int main() {
    FILE *fp;
    char c;
    int *charactersInFile = calloc(sizeof(int), CHARSINSET);
    if (NULL == (fp = fopen(FILENAME, "rt"))) {
     printf ("File not found.\n");
     return 1;
    }
    while(1) {
     c = getc(fp);
     if (c == EOF) {
      break;
     }
     if (c != '\n' && c != '\r')
      charactersInFile[c]++;
    }

    fclose(fp);
    printf("Unique characters: {%s}\n", getUniqueCharacters(charactersInFile));
    return 0;
}

Cervo 2008-12-23 01:09:51

Answer 15

+4 A:

Quick and dirty C program that's blazingly fast:

#include <stdio.h>

int main(void)
{
  int chars[256] = {0}, c;
  while((c = getchar()) != EOF)
    chars[c] = 1;
  for(c = 32; c < 127; c++)  // printable chars only
  {
    if(chars[c])
      putchar(c);
  }

  putchar('\n');

  return 0;
}

Compile it, then do

cat file | ./a.out

To get a list of the unique printable characters in file.

Adam Rosenfield 2008-12-23 01:21:02

Nice, very fast, however it doesn't handle accented characters (anything non-ASCII). Try this test case:cat /usr/share/dict/american-english | ./a.out

codelogic 2008-12-23 01:36:29

I assumed that Brig only cared about ASCII. If not, that's easy to fix - just change the 127 in the loop bound to 256.

Adam Rosenfield 2008-12-23 02:07:48

I agree, nice and fast! Doesn't ignore the case of the characters as requested, but easy enough to fix with a call to tolower() in the while loop.

Jay 2008-12-23 14:48:11

Answer 16

A:

Where C:/data.txt contains 454,863 rows of seven random alphabetic characters, the following code

using System;
using System.IO;
using System.Collections;
using System.Diagnostics;

namespace ConsoleApplication {
    class Program {
        static void Main(string[] args) {
            FileInfo fileInfo = new FileInfo(@"C:/data.txt");
            Console.WriteLine(fileInfo.Length);

            Stopwatch sw = new Stopwatch();
            sw.Start();

            Hashtable table = new Hashtable();

            StreamReader sr = new StreamReader(@"C:/data.txt");
            while (sr.EndOfStream == false) {
                char c = Char.ToLower((char)sr.Read());
                if (table.Contains(c) == false) {
                    table.Add(c, null);
                }
            }
            sr.Close();

            foreach (char c in table.Keys) {
                Console.Write(c);
            }
            Console.WriteLine();

            sw.Stop();
            Console.WriteLine(sw.ElapsedMilliseconds);
        }
    }
}

produces output

4093767 mytojevqlgbxsnidhzupkfawr c 889 Press any key to continue . . .

The first line of output tells you the number of bytes in C:/data.txt (454,863 * (7 + 2) = 4,093,767 bytes). The next two lines of output are the unique characters in C:/data.txt (including a newline). The last line of output tells you the number of milliseconds the code took to execute on a 2.80 GHz Pentium 4.

Jason 2008-12-23 01:34:46

Answer 17

A:

Quick and dirty solution using grep (assuming the file name is "file"):

for char in a b c d e f g h i j k l m n o p q r s t u v w x y z; do 
    if [ ! -z "`grep -li $char file`" ]; then 
        echo -n $char; 
    fi; 
done; 
echo

I could have made it a one-liner but just want to make it easier to read.

(EDIT: forgot the -i switch to grep)

PolyThinker 2008-12-23 01:56:38

Answer 18

+2 A:

Here's a PowerShell example:

gc file.txt | select -Skip 2 | % { $_.ToCharArray() } | sort -CaseSensitive -Unique

which produces:

D
Y
a
b
o

I like that it's easy to read.

EDIT: Here's a faster version:

$letters = @{} ; gc file.txt | select -Skip 2 | % { $_.ToCharArray() } | % { $letters[$_] = $true } ; $letters.Keys

Jay Bazuzi 2008-12-23 02:24:22

As much as I like PowerShell, it is not a good solution for this many rows. The script has been running for at least 5 min using 1.7 GB of memory.

2009-01-02 19:26:08

Bummer. I wonder if PowerShell v2 will do any better? I bet replacing the `sort` with an accumulation set (like some of the other answers) would improve the memory performance. I will ponder it.

Jay Bazuzi 2009-01-02 21:23:32

There, I wrote one that uses a hashtable as a set.

Jay Bazuzi 2009-01-07 18:15:27

Answer 19

A:

Well my friend, I think this is what you had in mind....At least this is the python version!!!

f = open("location.txt", "r") # open file

ll = sorted(list(f.read().lower())) #Read file into memory, split into individual characters, sort list
ll = [val for idx, val in enumerate(ll) if (idx == 0 or val != ll[idx-1])] # eliminate duplicates
f.close()
print "Unique Characters: {%s}" % "".join(ll) #print list of characters, carriage return will throw in a return

It does not iterate through each character, it is relatively short as well. You wouldn't want to open a 500 MB file with it (depending upon your RAM) but for shorter files it is fun :)

I also have to add my final attack!!!! Admittedly I eliminated two lines by using standard input instead of a file, I also reduced the active code from 3 lines to 2. Basically if I replaced ll in the print line with the expression from the line above it, I could have had 1 line of active code and one line of imports.....Anyway now we are having fun :)

import itertools, sys

# read standard input into memory, split into characters, eliminate duplicates
ll = map(lambda x:x[0], itertools.groupby(sorted(list(sys.stdin.read().lower()))))
print "Unique Characters: {%s}" % "".join(ll) #print list of characters, carriage return will throw in a return

Cervo 2008-12-23 03:36:04

Answer 20

A:

This answer above mentioned using a dictionary.

If so, the code presented there can be streamlined a bit, since the Python documentation states:

It is best to think of a dictionary as an unordered set of key: value pairs, with the requirement that the keys are unique (within one dictionary).... If you store using a key that is already in use, the old value associated with that key is forgotten.

Therefore, this line of the code can be removed, since the dictionary keys will always be unique anyway:

    if character not in letters:

And that should make it a little faster.

2009-01-03 03:54:34

ansaurus

tags:

views:

answers:

Find Unique Characters in a File

Update

Update 2

related questions