views:

1295

answers:

20

I have a file with 450,000+ rows of entries. Each entry is about 7 characters in length. What I want to know is the unique characters of this file.

For instance, if my file were the following;

Entry
-----
Yabba
Dabba
Doo

Then the result would be

Unique characters: {abdoy}

Notice I don't care about case and don't need to order the results. Something tells me this is very easy for the Linux folks to solve.

Update

I'm looking for a very fast solution. I really don't want to have to create code to loop over each entry, loop through each character...and so on. I'm looking for a nice script solution.

Update 2

By Fast, I mean fast to implement...not necessarily fast to run.

+6  A: 

Use a set data structure. Most programming languages / standard libraries come with one flavour or another. If they don't, use a hash table (or generally, dictionary) implementation and just omit the value field. Use your characters as keys. These data structures generally filter out duplicate entries (hence the name set, from its mathematical usage: sets don't have a particular order and only unique values).

Konrad Rudolph
A set will be slower than the bitmap evilteach is suggesting
Sam Saffron
But my solution works, the array doesn't (of course I'm using Unicode files). ;-)
Konrad Rudolph
I also claim that the array solution *is* in fact an optimized and specialized set.
Konrad Rudolph
A: 

in c++ i would first loop through the letters in the alphabet then run a strchr() on each with the file as a string. this will tell you if that letter exists, then just add it to the list.

Russ Bradberry
Which alphabet? What if the alphabet you use is significantly smaller than the file? What if it's considerably larger?
Dustin
well he already gave his filesize in the question. im not sure of the performance differences in each, i imagine it would depend on how much memory you have.
Russ Bradberry
Or what if the file is very large but only contains a few characters from the alphabet? This will scan through the whole input file multiple times.
Greg Hewgill
point take, then maybe after finding the character you remove all instances of it from the search string. this way it speeds up as it finds more.
Russ Bradberry
A: 

Python without using a set.

file = open('location', 'r')
letters = []
for line in file:
    for character in line:
        if character not in letters:
            letters.append(character)
Jough
Why do you avoid the set the list version is a lot slower....
André
I dunno.. just an alternative.
Jough
Why not just use a dictionary, should be of the same order of magnitude as a set, and is as integrated into the language as lists.
Cervo
+2  A: 

A very fast solution would be to make a small C program that reads its standard input, does the aggregation and spits out the result.

Why the arbitrary limitation that you need a "script" that does it?

What exactly is a script anyway?

Would Python do?

If so, then this is one solution:

import sys;

s = set([]);
while True:
    line = sys.stdin.readline();
    if not line:
        break;
    line = line.rstrip();
    for c in line.lower():
        s.add(c);

print("".join(sorted(s)));
Lasse V. Karlsen
Using a set is slightly faster than using a list and checking for contains every time, as @Jough did.
Joachim Sauer
Not only “slightly” actually. For large files, the difference is truly significant, i.e. in the order of O(n^2) vs. O(n).
Konrad Rudolph
You're confusing n's here. the n in question for this algorithm is the number of unique characters, not the size of the file, so it's not that big a performance hit. That said, if you only care about ASCII characters then a bitfield is the easiest approach.
Yuliy
There's ways to make this superfast by using assembly code piped to an assembler, thus keeping in line with the "script" part of the question, but my solution is quick to implement and runs decently fast. Yes, it can be improved, but is it necessary?
Lasse V. Karlsen
+2  A: 

Algorithm: Slurp the file into memory.

Create an array of unsigned ints, initialized to zero.

Iterate though the in memory file, using each byte as a subscript into the array.
    increment that array element.

Discard the in memory file

Iterate the array of unsigned int
       if the count is not zero,
           display the character, and its corresponding count.
EvilTeach
This only works if you assume a 8 bit encoding and therefore don't support unicode characters. Or at least you'll need to modify the bitfield size.
Joachim Sauer
He is talking characters, not unicode. He also could have probably implemented and adequate solution by now, without help.
EvilTeach
+5  A: 

As requested, a pure shell-script "solution":

sed -e "s/./\0\n/g" inputfile | sort -u

It's not nice, it's not fast and the output is not exactly as specified, but it should work ... mostly.

For even more ridiculousness, I present the version that dumps the output on one line:

sed -e "s/./\0\n/g" inputfile | sort -u | while read c; do echo -n "$c" ; done
Joachim Sauer
You're missing a `wc -l` at the end. Other than that, nice solution. I tried something similar but didn't get it to work (I forgot the `g` option on sed).
Konrad Rudolph
"wc -l"? As I understand he's not interested in the number of unique characters but in which ones there are. Did I get that wrong?
Joachim Sauer
saua, forget it … I actually misread the question.
Konrad Rudolph
@Konrad: but I stole your sed-magic using "\0" it looks much nicer that way. Thanks ;-)
Joachim Sauer
You're welcome. I'll probably delete my other answer in due course since it's now redundant.
Konrad Rudolph
This doesn't ignore case, if you pipe the sed output through "tr [A-Z] [a-z]" before passing to sort, it'll ignore case also, and it's only a few tenths of a second slower.
Jay
the while at the end could be replaced by tr -d "\n"
Abdul
+10  A: 

BASH shell script version (no sed/awk):

while read -n 1 char; do echo "$char"; done < entry.txt | tr [A-Z] [a-z] |  sort -u

UPDATE: Just for the heck of it, since I was bored and still thinking about this problem, here's a C++ version using set. If run time is important this would be my recommended option, since the C++ version takes slightly more than half a second to process a file with 450,000+ entries.

#include <iostream>
#include <set>

int main() {
    std::set<char> seen_chars;
    std::set<char>::const_iterator iter;
    char ch;

    /* ignore whitespace and case */
    while ( std::cin.get(ch) ) {
        if (! isspace(ch) ) {
            seen_chars.insert(tolower(ch));
        }
    }

    for( iter = seen_chars.begin(); iter != seen_chars.end(); ++iter ) {
        std::cout << *iter << std::endl;
    }

    return 0;
}

Note that I'm ignoring whitespace and it's case insensitive as requested.

For a 450,000+ entry file (chars.txt), here's a sample run time:

[user@host]$ g++ -o unique_chars unique_chars.cpp 
[user@host]$ time ./unique_chars < chars.txt
a
b
d
o
y

real    0m0.638s
user    0m0.612s
sys     0m0.017s
Jay
To me sed/awk are part of a shell: If the shell is available and sed/awk are not, then I'm in bizzaro-world ;-)
Joachim Sauer
But thats some nice work there ... always good to learn some new tricks.
Joachim Sauer
But: "sort -u" should be a lot more efficient than "sort | uniq"
Joachim Sauer
@saua - Noted, I removed the call to uniq after I checked the manpage to sort. Old habit ;)
Jay
I'm afraid the while-lop at the beginning really slows things down, as this is still by far the slowest bash-alternative in this thread, it seems.
Joachim Sauer
@saua - yes, not exactly speedy as requested, but then if speed is the major concern I'd probably want to write this in C anyway.
Jay
+1  A: 
cat yourfile | 
 perl -e 'while(<>){chomp;$k{$_}++ for split(//, lc $_)}print keys %k,"\n";'
codelogic
+1  A: 

Alternative solution using bash:

sed "s/./\l\0\n/g" inputfile | sort -u | grep -vc ^$

EDIT Sorry, I actually misread the question. The above code counts the unique characters. Just omitting the c switch at the end obviously does the trick but then, this solution has no real advantage to saua's (especially since he now uses the same sed pattern instead of explicit captures).

Konrad Rudolph
(Go people! Code golf!)
Konrad Rudolph
A: 

Try this file with JSDB Javascript (includes the javascript engine in the Firefox browser):

var seenAlreadyMap={};
var seenAlreadyArray=[];
while (!system.stdin.eof)
{
  var L = system.stdin.readLine();
  for (var i = L.length; i-- > 0; )
  {
    var c = L[i].toLowerCase();
    if (!(c in seenAlreadyMap))
    {
      seenAlreadyMap[c] = true;
      seenAlreadyArray.push(c);
    }
  }
}
system.stdout.writeln(seenAlreadyArray.sort().join(''));
Jason S
+1  A: 

While not an script this java program will do the work. It's easy to understand an fast ( to run )

import java.util.*;
import java.io.*;
public class  Unique {
    public static void main( String [] args ) throws IOException { 
        int c = 0;
        Set s = new TreeSet();
        while( ( c = System.in.read() ) > 0 ) {
            s.add( Character.toLowerCase((char)c));
        }
        System.out.println( "Unique characters:" + s );
    }
}

You'll invoke it like this:

type yourFile | java Unique

or

cat yourFile | java Unique

For instance, the unique characters in the HTML of this question are:

Unique characters:[ , , ,  , !, ", #, $, %, &, ', (, ), +, ,, -, ., /, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, :, ;, <, =, >, ?, @, [, \, ], ^, _, a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z, {, |, }]
OscarRyz
A: 

Python using a dictionary. I don't know why people are so tied to sets or lists to hold stuff. Granted a set is probably more efficient than a dictionary. However both are supposed to take constant time to access items. And both run circles around a list where for each character you search the list to see if the character is already in the list or not. Also Lists and Dictionaries are built in Python datatatypes that everyone should be using all the time. So even if set doesn't come to mind, dictionary should.

file = open('location.txt', 'r')
letters = {}
for line in file:
  if line == "":
    break
  for character in line.strip():
    if character not in letters:
      letters[character] = True
file.close()
print "Unique Characters: {" + "".join(letters.keys()) + "}"
Cervo
+3  A: 

Python w/sets (quick and dirty)

s = open("data.txt", "r").read()
print "Unique Characters: {%s}" % ''.join(set(s))

Python w/sets (with nicer output)

import re

text = open("data.txt", "r").read().lower()
unique = re.sub('\W, '', ''.join(set(text))) # Ignore non-alphanumeric

print "Unique Characters: {%s}" % unique
Triptych
A: 

A C solution. Admittedly it is not the fastest to code solution in the world. But since it is already coded and can be cut and pasted, I think it counts as "fast to implement" for the poster :) I didn't actually see any C solutions so I wanted to post one for the pure sadistic pleasure :)

#include<stdio.h>

#define CHARSINSET 256
#define FILENAME "location.txt"

char buf[CHARSINSET + 1];

char *getUniqueCharacters(int *charactersInFile) {
    int x;
    char *bufptr = buf;
    for (x = 0; x< CHARSINSET;x++) {
     if (charactersInFile[x] > 0)
      *bufptr++ = (char)x;
    }
    bufptr = '\0';
    return buf;
}

int main() {
    FILE *fp;
    char c;
    int *charactersInFile = calloc(sizeof(int), CHARSINSET);
    if (NULL == (fp = fopen(FILENAME, "rt"))) {
     printf ("File not found.\n");
     return 1;
    }
    while(1) {
     c = getc(fp);
     if (c == EOF) {
      break;
     }
     if (c != '\n' && c != '\r')
      charactersInFile[c]++;
    }

    fclose(fp);
    printf("Unique characters: {%s}\n", getUniqueCharacters(charactersInFile));
    return 0;
}
Cervo
+4  A: 

Quick and dirty C program that's blazingly fast:

#include <stdio.h>

int main(void)
{
  int chars[256] = {0}, c;
  while((c = getchar()) != EOF)
    chars[c] = 1;
  for(c = 32; c < 127; c++)  // printable chars only
  {
    if(chars[c])
      putchar(c);
  }

  putchar('\n');

  return 0;
}

Compile it, then do

cat file | ./a.out

To get a list of the unique printable characters in file.

Adam Rosenfield
Nice, very fast, however it doesn't handle accented characters (anything non-ASCII). Try this test case:cat /usr/share/dict/american-english | ./a.out
codelogic
I assumed that Brig only cared about ASCII. If not, that's easy to fix - just change the 127 in the loop bound to 256.
Adam Rosenfield
I agree, nice and fast! Doesn't ignore the case of the characters as requested, but easy enough to fix with a call to tolower() in the while loop.
Jay
A: 

Where C:/data.txt contains 454,863 rows of seven random alphabetic characters, the following code

using System;
using System.IO;
using System.Collections;
using System.Diagnostics;

namespace ConsoleApplication {
    class Program {
        static void Main(string[] args) {
            FileInfo fileInfo = new FileInfo(@"C:/data.txt");
            Console.WriteLine(fileInfo.Length);

            Stopwatch sw = new Stopwatch();
            sw.Start();

            Hashtable table = new Hashtable();

            StreamReader sr = new StreamReader(@"C:/data.txt");
            while (sr.EndOfStream == false) {
                char c = Char.ToLower((char)sr.Read());
                if (table.Contains(c) == false) {
                    table.Add(c, null);
                }
            }
            sr.Close();

            foreach (char c in table.Keys) {
                Console.Write(c);
            }
            Console.WriteLine();

            sw.Stop();
            Console.WriteLine(sw.ElapsedMilliseconds);
        }
    }
}

produces output

4093767
mytojevqlgbxsnidhzupkfawr
c
889
Press any key to continue . . .

The first line of output tells you the number of bytes in C:/data.txt (454,863 * (7 + 2) = 4,093,767 bytes). The next two lines of output are the unique characters in C:/data.txt (including a newline). The last line of output tells you the number of milliseconds the code took to execute on a 2.80 GHz Pentium 4.

Jason
A: 

Quick and dirty solution using grep (assuming the file name is "file"):

for char in a b c d e f g h i j k l m n o p q r s t u v w x y z; do 
    if [ ! -z "`grep -li $char file`" ]; then 
        echo -n $char; 
    fi; 
done; 
echo

I could have made it a one-liner but just want to make it easier to read.

(EDIT: forgot the -i switch to grep)

PolyThinker
+2  A: 

Here's a PowerShell example:

gc file.txt | select -Skip 2 | % { $_.ToCharArray() } | sort -CaseSensitive -Unique

which produces:

D
Y
a
b
o

I like that it's easy to read.

EDIT: Here's a faster version:

$letters = @{} ; gc file.txt | select -Skip 2 | % { $_.ToCharArray() } | % { $letters[$_] = $true } ; $letters.Keys
Jay Bazuzi
As much as I like PowerShell, it is not a good solution for this many rows. The script has been running for at least 5 min using 1.7 GB of memory.
Bummer. I wonder if PowerShell v2 will do any better? I bet replacing the `sort` with an accumulation set (like some of the other answers) would improve the memory performance. I will ponder it.
Jay Bazuzi
There, I wrote one that uses a hashtable as a set.
Jay Bazuzi
A: 

Well my friend, I think this is what you had in mind....At least this is the python version!!!

f = open("location.txt", "r") # open file

ll = sorted(list(f.read().lower())) #Read file into memory, split into individual characters, sort list
ll = [val for idx, val in enumerate(ll) if (idx == 0 or val != ll[idx-1])] # eliminate duplicates
f.close()
print "Unique Characters: {%s}" % "".join(ll) #print list of characters, carriage return will throw in a return

It does not iterate through each character, it is relatively short as well. You wouldn't want to open a 500 MB file with it (depending upon your RAM) but for shorter files it is fun :)

I also have to add my final attack!!!! Admittedly I eliminated two lines by using standard input instead of a file, I also reduced the active code from 3 lines to 2. Basically if I replaced ll in the print line with the expression from the line above it, I could have had 1 line of active code and one line of imports.....Anyway now we are having fun :)

import itertools, sys

# read standard input into memory, split into characters, eliminate duplicates
ll = map(lambda x:x[0], itertools.groupby(sorted(list(sys.stdin.read().lower()))))
print "Unique Characters: {%s}" % "".join(ll) #print list of characters, carriage return will throw in a return
Cervo
A: 

This answer above mentioned using a dictionary.

If so, the code presented there can be streamlined a bit, since the Python documentation states:

It is best to think of a dictionary as an unordered set of key: value pairs, with the requirement that the keys are unique (within one dictionary).... If you store using a key that is already in use, the old value associated with that key is forgotten.

Therefore, this line of the code can be removed, since the dictionary keys will always be unique anyway:

    if character not in letters:

And that should make it a little faster.