ansaurus

Question

Grab a line's whitespace/indention with Python

Answer 1

A:

How about using the regex \s* which matches any whitespace characters. You only want the whitespace at the beginning of the line so either search with the regex ^\s* or simply match with \s*.

MatrixFrog 2010-02-15 20:00:29

If you want the match to start at the beginning, use `match`. If the match can start anywhere, use `search`.

Mark Byers 2010-02-15 20:02:35

`match` has to match the entire string, right? So you would have to add a separate group that matches the entire *rest* of the string and then the whitespace would just be in that first group. I think.

MatrixFrog 2010-02-15 20:10:01

@MatrixFrog: Read the doc (http://docs.python.org/library/re.html#matching-vs-searching). The only difference between `match` and `search` is the implicit anchor at the start. There's no restriction to match the entire string.

KennyTM 2010-02-15 20:48:35

@KennyTM I've edited my answer to reflect that information. Please comment again if it's still wrong.

MatrixFrog 2010-02-15 21:45:16

Answer 2

+11 A:

import re
s = "\t\tthis line has two tabs of indention"
re.match(r"\s*", s).group()
// "\t\t"
s = "    this line has four spaces of indention"
re.match(r"\s*", s).group()
// "    "

And to strip leading spaces, use lstrip.

As there are down votes probably questioning the efficiency of regex, I've done some profiling to check the efficiency of each cases.

Very long string, very short leading space

RegEx > Itertools >> lstrip

>>> timeit.timeit('r.match(s).group()', 'import re;r=re.compile(r"\s*")s="          hello world!"*10000', number=100000)
0.10037684440612793
>>> timeit.timeit('"".join(itertools.takewhile(lambda x:x.isspace(),s))', 'import itertools;s="          hello world!"*10000', number=100000)
0.7092740535736084
>>> timeit.timeit('"".join(itertools.takewhile(str.isspace,s))', 'import itertools;s="          hello world!"*10000', number=100000)
0.51730513572692871
>>> timeit.timeit('s[:-len(s.lstrip())]', 's="          hello world!"*10000', number=100000)
2.6478431224822998

Very short string, very short leading space

lstrip > RegEx > Itertools

If you can limit the string's length to thousounds of chars or less, the lstrip trick maybe better.

>>> timeit.timeit('r.match(s).group()', 'import re;r=re.compile(r"\s*");s="          hello world!"*100', number=100000)
0.099548101425170898
>>> timeit.timeit('"".join(itertools.takewhile(str.isspace,s))', 'import itertools;s="          hello world!"*100', number=100000)
0.53602385520935059
>>> timeit.timeit('s[:-len(s.lstrip())]', 's="          hello world!"*100', number=100000)
0.064291000366210938

This shows the lstrip trick scales roughly as O(√n) and the RegEx and itertool methods are O(1) if the number of leading spaces is not a lot.

Very short string, very long leading space

lstrip >> RegEx >>> Itertools

If there are a lot of leading spaces, don't use RegEx.

>>> timeit.timeit('s[:-len(s.lstrip())]', 's=" "*2000', number=10000)
0.047424077987670898
>>> timeit.timeit('r.match(s).group()', 'import re;r=re.compile(r"\s*");s=" "*2000', number=10000)
0.2433168888092041
>>> timeit.timeit('"".join(itertools.takewhile(str.isspace,s))', 'import itertools;s=" "*2000', number=10000)
3.9949162006378174

Very long string, very long leading space

lstrip >>> RegEx >>>>>>>> Itertools

>>> timeit.timeit('s[:-len(s.lstrip())]', 's=" "*200000', number=10000)
4.2374031543731689
>>> timeit.timeit('r.match(s).group()', 'import re;r=re.compile(r"\s*");s=" "*200000', number=10000)
23.877214908599854
>>> timeit.timeit('"".join(itertools.takewhile(str.isspace,s))', 'import itertools;s=" "*200000', number=100)*100
415.72158336639404

This shows all methods scales roughly as O(m) if the non-space part is not a lot.

KennyTM 2010-02-15 20:01:07

For a line with *no* indentation, this gives an AttributeError because `match` returns `None`. Assuming the desired result is `''`, the empty string, changing the `+` to a `*` seems to solve this.

MatrixFrog 2010-02-15 20:05:01

I prefer my solution of abusing lstrip! No slow regexes needed...

Phil H 2010-02-15 20:08:03

@Phil: Even after implementing Adam Bernier's fix, your `lstrip()` trick is still 10x slower than the regex version. Try it: `timeit.timeit('re.match(r"\s*", s)', 'import re;s=" hello world!"*10000', number=1000000)` (0.25 sec) vs `timeit.timeit('s[:len(s)-len(s.lstrip())]', 's=" hello world!"*10000', number=100000)` (2.7 sec)`

KennyTM 2010-02-15 20:29:15

@Kenny: Remove `*10000` (so the test better models real lines) and lstrip is *much* faster than re: `timeit.timeit('s[:-len(s.lstrip())]', 's=" hello world!"', number=100000)` (4.810 vs 0.074 for me).

Roger Pate 2010-02-15 20:41:49

@Roger: That means Phil's method does not scale to very long string (O(string length) vs O(number of leading spaces)). Not a good algorithm I would say.

KennyTM 2010-02-15 20:45:05

@Kenny: That may be true, and yours may even be better for other reasons, but it's pointless to say yours is faster on lines of 130k(!) characters if it's extremely rare to have lines longer than 80 characters.

Roger Pate 2010-02-15 20:48:37

probably half the time is taken compliling the regex, why don't you do that in the setup?

gnibbler 2010-02-15 20:58:48

@Roger: That's good point. I've updated the test cases to show the time in each extreme cases. @gnibbler: From the doc, *"The compiled versions of the most recent patterns passed to re.match(), re.search() or re.compile() are cached,"* and compiling at setup doesn't affect the time much either.

KennyTM 2010-02-15 21:07:26

Compiling the regex, even though I know it's otherwise catched, more than halves the time in these rudimentary tests. http://codepad.org/Oc5KDQhU

Roger Pate 2010-02-15 21:18:19

@Roger: Ah, that's good. I'd just re-tested the last case when replying and saw no difference.

KennyTM 2010-02-15 21:23:59

@Kenny: What I was trying to say is that *common cases* matter much more than extreme cases. I think all this testing is missing the real point, however, as this is not likely to be a bottleneck and could be implemented in C if it was, for better performance than any pure-Python we'll come up with. (It wasn't my downvote, btw, and I only see the one.)

Roger Pate 2010-02-15 21:26:09

"If you can limit the string's length to thousounds [sic] of chars or less..."?! So according to these tests, lstrip always beats regex, except for the cases that never happen. Hmm... wonder which way I'd use....

John Y 2010-02-15 22:06:13

Answer 3

A:

If you're interested in using regular expressions you can use that. /\s/ usually matches one whitespace character, so /^\s+/ would match the whitespace starting a line.

adamse 2010-02-15 20:02:29

Answer 4

+2 A:

A sneaky way: abuse lstrip!

fullstr = "\t\tthis line has two tabs of indentation"
startwhites = fullstr[:len(fullstr)-len(fullstr.lstrip())]

This way you don't have to work through all the details of whitespace!

(Thanks Adam for the correction)

Phil H 2010-02-15 20:06:32

Your shadowing of the str builtin is confusing

prestomation 2010-02-15 20:12:04

Nice idea. Doesn't work as written. `lstrip()` returns a copy of the string with leading characters removed. Instead: `startwhites = s[:len(s)-len(s.lstrip())]` Btw, I wasn't the one who downvoted.

Adam Bernier 2010-02-15 20:22:12

You probably meant `[:-len(..`.

Roger Pate 2010-02-15 20:51:40

Answer 5

+3 A:

This can also be done with str.isspace and itertools.takewhile instead of regex.

import itertools

tests=['\t\tthis line has two tabs of indention',
       '    this line has four spaces of indention']

def indention(astr):
    # Using itertools.takewhile is efficient -- the looping stops immediately after the first
    # non-space character.
    return ''.join(itertools.takewhile(str.isspace,astr))

for test_string in tests:
    print(indention(test_string))

unutbu 2010-02-15 20:12:08

you can use `str.isspace` in place of the `lambda` function

gnibbler 2010-02-15 20:43:45

@gnibbler: Indeed. Thank you for that!

unutbu 2010-02-15 20:50:18

Answer 6

A:

def whites(a):
return a[0:a.find(a.strip())]

Basically, the my idea is:

Find a strip of starting line
Find a difference between starting line and stripped one

woo 2010-02-15 20:12:08

ansaurus

tags:

views:

answers:

Grab a line's whitespace/indention with Python

Very long string, very short leading space

Very short string, very short leading space

Very short string, very long leading space

Very long string, very long leading space

related questions