ansaurus

Question

Unable search names which contain three 7s in random order by AWK/Python/Bash

Answer 1

+7 A:

I don't understand the part about "random order". How do you differentiate between the "order" when it's the same token that repeats? Is "a7b7" different from "c7d7" in the order of the 7s?

Anyway, this ought to work:

 ls *7*7*7*

It just let's the shell solve the problem, but maybe I didn't understand properly.

EDIT: The above is wrong, it includes cases with more than four 7s which is not wanted. Assuming this is bash, and extended globbing is enabled, this works:

ls *([^7])7*([^7])7*([^7])7*([^7])

This reads as "zero or more characters which are not sevens, followed by a seven, followed by zero or more characters that are not sevens", and so on. It's important to understand that the asterisk is a prefix operator here, operating on the expression ([^7]) which means "any character except 7".

unwind 2009-03-12 15:15:27

It's not clear thwat happens if there are more than 3 7s.

Georg 2009-03-12 15:22:08

Your command does not work. I mean that the filename can contain any number of letters and figures, and you need to find the words of three 7s.

Masi 2009-03-12 15:36:32

Aah, thanks, that clarifies it. I knew there was something I was missing. :) I'll edit.

unwind 2009-03-12 15:37:45

Answer 2

+5 A:

I'm guessing you want to find files that contain exactly three 7's, but no more. Using gnu grep with the extends regexp switch (-E):


ls | grep -E '^([^7]*7){3}[^7]*$'

Should do the trick.

Basically that matches 3 occurrences of "not 7 followed by a 7", then a bunch of "not 7" across the whole string (the ^ and $ at the beginning and end of the pattern respectively).

John Montgomery 2009-03-12 15:35:32

It worked on my hosts Linux box. Tested it by echoing some text strings and all seemed ok... Perhaps the regexp isn't understood by the version of grep you are using?

John Montgomery 2009-03-12 16:45:07

@Your code works! Thank you.

Masi 2009-03-12 17:00:54

Answer 3

+2 A:

Something like this:

printf '%s\n' *|awk -F7 NF==4

radoulov 2009-03-12 16:31:29

Neat solution using awk but replace the printf with an 'ls -1' or simply an 'ls' (which defaults to -1 if the output is a pipe).

Andrew Dalke 2009-03-12 16:44:24

No, ls is an external command (an expensive unnecessary fork) that is not needed in this case. On most modern shells printf is a builtin command so it's faster.

radoulov 2009-03-12 17:06:21

What does %s mean in your code, and the last part -F7 NF==4?

Masi 2009-03-12 17:10:43

"Faster"? With 2 files I see 0.007s using ls vs. 0.004s with printf. Subtract a smidgeon as my timings needed an extra 'sh' to capture the glob time. With 1000 files the times are 0.010 vs. 0.009. With 10000 it's 0.031 vs 0.057 (printf slower) and "Argument list too long" with * on 100,000 files.

Andrew Dalke 2009-03-12 19:02:43

So while it is faster for the common case (few directories have >2000 files), it fails for some cases, and that failure bothers me more than the performance. Also, I just realized that ls should be 'ls -f' since the OP didn't care about sorted order.

Andrew Dalke 2009-03-12 19:08:32

Arguably, anyone who has a Unix system running where 'ls' ISN'T cached in the VM is probably on a system that isn't doing anything. While the 'ls' fork won't be "free", when cached, it will most certainly be damn cheap.

Will Hartung 2009-03-12 19:16:16

Actually you cannot get "argument list too long with builtin commands,so only ls could eventually break (assuming printf is a builtin and not external command, as already pointed out).

radoulov 2009-03-12 20:26:44

While I agree that in most cases the performance benefit will be insignificant, I don't see how ls could be better in this case.

radoulov 2009-03-12 20:34:30

@Masi,%s is a character escape sequence which converts an argument to a string copies it to the standard output. -F7 sets the awk input field separator FS to the string/digit 7, NF == 6 means match only records that have exactly four fields, NF is number of fields (to be continued...)

radoulov 2009-03-12 21:53:31

i.e. the records that contain exactly three occurrences of the digit 7.

radoulov 2009-03-12 21:54:25

I'm on a Mac. 'which printf' gives me /usr/bin/printf for both bash and tcsh. That's why I get "Argument list too long". I did, after all, test my timing numbers

Andrew Dalke 2009-03-13 06:46:34

@dalke,you should use type printf to verify, not which.

radoulov 2009-03-13 10:24:03

Interesting. Didn't know about type. Why does 'which cd' report cd is a built-in command and not printf? I did my timing tests with two bash scripts, 'time x1.sh' 'time x2.sh' being 'printf "%s\n" * > /dev/null' and 'ls > /dev/null' and I got Arg list too long. Why? Can you do better timings?

Andrew Dalke 2009-03-13 15:24:54

You get the Arg list too long error when you exceed the ARG_MAX system variable limit. This is on SunOS 5.8 machine with bash 2.03.0(1):bash-2.03$ getconf ARG_MAX1048320bash-2.03$ printf '%s ' * | perl -nle'print length'1083857(continues ...)

radoulov 2009-03-13 16:13:54

radoulov 2009-03-13 16:15:16

A builtin printf will not fail.

radoulov 2009-03-13 16:15:49

radoulov 2009-03-13 16:16:39

P.S. I don't know how to insert code tags in a comment ...

radoulov 2009-03-13 16:17:48

Your time printf does not work to measure the glob time because globbing is done by the shell, which passes the argv to time, which forwards the argv to printf. The globbing there occurs *before* time starts, so the results aren't comparable. That's why I had to make a shell script to do the timings

Andrew Dalke 2009-03-13 19:57:16

radoulov 2009-03-13 20:11:43

Andrew Dalke 2009-03-14 15:41:32

Reading your posts I realize that probably I was not clear, evry builtin command is faster and more efficient than an external command because there is no need so start a sub shell. My point was about how many times you execute a command and not about extreme cases with 399714 files (continues...)

radoulov 2009-03-14 17:12:27

I believe this example illustrates what I was talking about:time bash -c 'for c in {1..1000};do ls >/dev/null;done'real 0m15.307suser 0m33.299ssys 0m10.276stime bash -c 'for c in {1..1000};do printf '%s\n' * >/dev/null;done'real 0m0.537suser 0m0.265ssys 0m0.234s

radoulov 2009-03-14 17:17:04

My timings agree with you: printf is faster than ls for small directories. Once there are ~1000 files, my tests show ls as being faster. Perhaps due to poor malloc/string use? And printf fails for extreme cases. While slightly slower, I prefer 'ls -1f' as it doesn't break and it's easily understood

Andrew Dalke 2009-03-16 17:46:29

Answer 4

+1 A:

Or instead of doing it in a single grep, use one grep to find files with 3-or-more 7s and another to filter out 4-or-more 7s.

ls -f | egrep '7.*7.*7' | grep -v '7.*7.*7.*7'

You could move some of the work into the shell glob with the shorter

ls -f *7*7*7* | grep -v '7.*7.*7.*7'

though if there are a large number of files which match that pattern then the latter won't work because of built-in limits to the glob size.

The '-f' in the 'ls' is to prevent 'ls' from sorting the results. If there is a huge number of files in the directory then the sort time can be quite noticeable.

This two-step filter process is, I think, more understandable than using the [^7] patterns.

Also, here's the solution as a Python script, since you asked for that as an option.

import os
for filename in os.listdir("."):
    if filename.count("7") == 4:
        print filename

This will handle a few cases that the shell commands won't, like (evil) filenames which contain a newline character. Though even here the output in that case would likely still be wrong, or at least unprepared for by downstream programs.

Andrew Dalke 2009-03-12 16:39:21

The first command works, while the last does not.

Masi 2009-03-12 17:05:37

Indeed. I needed a leading "*" and trailing "*". My test set for that case was too limited (only '777' and '7777'). Fixed. Thanks!

Andrew Dalke 2009-03-12 18:29:18

Answer 5

+2 A:

A Perl solution:

$ ls | perl -ne 'print if (tr/7/7/ == 3)'
3777
4777
5777
6777
7077
7177
7277
7377
7477
7577
7677
...

(I happen to have a directory with 4-digit numbers. 1777 and 2777 don't exist. :-)

Jon Ericson 2009-03-12 16:53:43

Great - your command works too.

Masi 2009-03-12 17:07:19

ansaurus

tags:

views:

answers:

Unable search names which contain three 7s in random order by AWK/Python/Bash

related questions