views:

326

answers:

5

I need to find names which contain three number 7 in the random order.

My attempt

We need to find first names which do not contain seven

ls | grep [^7]

Then, we could remove these matches from the whole space

ls [remove] ls | grep [^7]

The problem in my pseudo-code starts to repeat itself quickly.

How can you find the names which contain three 7s in the random order by AWK/Python/Bash?

[edit] The name can contain any number of letters and it contains words of three 7s.

+7  A: 

I don't understand the part about "random order". How do you differentiate between the "order" when it's the same token that repeats? Is "a7b7" different from "c7d7" in the order of the 7s?

Anyway, this ought to work:

 ls *7*7*7*

It just let's the shell solve the problem, but maybe I didn't understand properly.

EDIT: The above is wrong, it includes cases with more than four 7s which is not wanted. Assuming this is bash, and extended globbing is enabled, this works:

ls *([^7])7*([^7])7*([^7])7*([^7])

This reads as "zero or more characters which are not sevens, followed by a seven, followed by zero or more characters that are not sevens", and so on. It's important to understand that the asterisk is a prefix operator here, operating on the expression ([^7]) which means "any character except 7".

unwind
It's not clear thwat happens if there are more than 3 7s.
Georg
Your command does not work. I mean that the filename can contain any number of letters and figures, and you need to find the words of three 7s.
Masi
Aah, thanks, that clarifies it. I knew there was something I was missing. :) I'll edit.
unwind
+5  A: 

I'm guessing you want to find files that contain exactly three 7's, but no more. Using gnu grep with the extends regexp switch (-E):


ls | grep -E '^([^7]*7){3}[^7]*$'

Should do the trick.

Basically that matches 3 occurrences of "not 7 followed by a 7", then a bunch of "not 7" across the whole string (the ^ and $ at the beginning and end of the pattern respectively).

John Montgomery
It worked on my hosts Linux box. Tested it by echoing some text strings and all seemed ok... Perhaps the regexp isn't understood by the version of grep you are using?
John Montgomery
@Your code works! Thank you.
Masi
+2  A: 

Something like this:

printf '%s\n' *|awk -F7 NF==4
radoulov
Neat solution using awk but replace the printf with an 'ls -1' or simply an 'ls' (which defaults to -1 if the output is a pipe).
Andrew Dalke
No, ls is an external command (an expensive unnecessary fork) that is not needed in this case. On most modern shells printf is a builtin command so it's faster.
radoulov
What does %s mean in your code, and the last part -F7 NF==4?
Masi
"Faster"? With 2 files I see 0.007s using ls vs. 0.004s with printf. Subtract a smidgeon as my timings needed an extra 'sh' to capture the glob time. With 1000 files the times are 0.010 vs. 0.009. With 10000 it's 0.031 vs 0.057 (printf slower) and "Argument list too long" with * on 100,000 files.
Andrew Dalke
So while it is faster for the common case (few directories have >2000 files), it fails for some cases, and that failure bothers me more than the performance. Also, I just realized that ls should be 'ls -f' since the OP didn't care about sorted order.
Andrew Dalke
Arguably, anyone who has a Unix system running where 'ls' ISN'T cached in the VM is probably on a system that isn't doing anything. While the 'ls' fork won't be "free", when cached, it will most certainly be damn cheap.
Will Hartung
Actually you cannot get "argument list too long with builtin commands,so only ls could eventually break (assuming printf is a builtin and not external command, as already pointed out).
radoulov
While I agree that in most cases the performance benefit will be insignificant, I don't see how ls could be better in this case.
radoulov
@Masi,%s is a character escape sequence which converts an argument to a string copies it to the standard output. -F7 sets the awk input field separator FS to the string/digit 7, NF == 6 means match only records that have exactly four fields, NF is number of fields (to be continued...)
radoulov
i.e. the records that contain exactly three occurrences of the digit 7.
radoulov
I'm on a Mac. 'which printf' gives me /usr/bin/printf for both bash and tcsh. That's why I get "Argument list too long". I did, after all, test my timing numbers
Andrew Dalke
@dalke,you should use type printf to verify, not which.
radoulov
Interesting. Didn't know about type. Why does 'which cd' report cd is a built-in command and not printf? I did my timing tests with two bash scripts, 'time x1.sh' 'time x2.sh' being 'printf "%s\n" * > /dev/null' and 'ls > /dev/null' and I got Arg list too long. Why? Can you do better timings?
Andrew Dalke
You get the Arg list too long error when you exceed the ARG_MAX system variable limit. This is on SunOS 5.8 machine with bash 2.03.0(1):bash-2.03$ getconf ARG_MAX1048320bash-2.03$ printf '%s ' * | perl -nle'print length'1083857(continues ...)
radoulov
radoulov
A builtin printf will not fail.
radoulov
radoulov
P.S. I don't know how to insert code tags in a comment ...
radoulov
Your time printf does not work to measure the glob time because globbing is done by the shell, which passes the argv to time, which forwards the argv to printf. The globbing there occurs *before* time starts, so the results aren't comparable. That's why I had to make a shell script to do the timings
Andrew Dalke
radoulov
Andrew Dalke
Reading your posts I realize that probably I was not clear, evry builtin command is faster and more efficient than an external command because there is no need so start a sub shell. My point was about how many times you execute a command and not about extreme cases with 399714 files (continues...)
radoulov
I believe this example illustrates what I was talking about:time bash -c 'for c in {1..1000};do ls >/dev/null;done'real 0m15.307suser 0m33.299ssys 0m10.276stime bash -c 'for c in {1..1000};do printf '%s\n' * >/dev/null;done'real 0m0.537suser 0m0.265ssys 0m0.234s
radoulov
My timings agree with you: printf is faster than ls for small directories. Once there are ~1000 files, my tests show ls as being faster. Perhaps due to poor malloc/string use? And printf fails for extreme cases. While slightly slower, I prefer 'ls -1f' as it doesn't break and it's easily understood
Andrew Dalke
+1  A: 

Or instead of doing it in a single grep, use one grep to find files with 3-or-more 7s and another to filter out 4-or-more 7s.

ls -f | egrep '7.*7.*7' | grep -v '7.*7.*7.*7'

You could move some of the work into the shell glob with the shorter

ls -f *7*7*7* | grep -v '7.*7.*7.*7'

though if there are a large number of files which match that pattern then the latter won't work because of built-in limits to the glob size.

The '-f' in the 'ls' is to prevent 'ls' from sorting the results. If there is a huge number of files in the directory then the sort time can be quite noticeable.

This two-step filter process is, I think, more understandable than using the [^7] patterns.

Also, here's the solution as a Python script, since you asked for that as an option.

import os
for filename in os.listdir("."):
    if filename.count("7") == 4:
        print filename

This will handle a few cases that the shell commands won't, like (evil) filenames which contain a newline character. Though even here the output in that case would likely still be wrong, or at least unprepared for by downstream programs.

Andrew Dalke
The first command works, while the last does not.
Masi
Indeed. I needed a leading "*" and trailing "*". My test set for that case was too limited (only '777' and '7777'). Fixed. Thanks!
Andrew Dalke
+2  A: 

A Perl solution:

$ ls | perl -ne 'print if (tr/7/7/ == 3)'
3777
4777
5777
6777
7077
7177
7277
7377
7477
7577
7677
...

(I happen to have a directory with 4-digit numbers. 1777 and 2777 don't exist. :-)

Jon Ericson
Great - your command works too.
Masi