views:

1079

answers:

8

How do you find the phone numbers in 50,000 HTML pages?

Jeff Attwood posted 5 Questions for programmers applying for jobs:


In an effort to make life simpler for phone screeners, I've put together this list of Five Essential Questions that you need to ask during an SDE screen. They won't guarantee that your candidate will be great, but they will help eliminate a huge number of candidates who are slipping through our process today.

1) Coding The candidate has to write some simple code, with correct syntax, in C, C++, or Java.

2) OO design The candidate has to define basic OO concepts, and come up with classes to model a simple problem.

3) Scripting and regexes The candidate has to describe how to find the phone numbers in 50,000 HTML pages.

4) Data structures The candidate has to demonstrate basic knowledge of the most common data structures.

5) Bits and bytes The candidate has to answer simple questions about bits, bytes, and binary numbers.

Please understand: what I'm looking for here is a total vacuum in one of these areas. It's OK if they struggle a little and then figure it out. It's OK if they need some minor hints or prompting. I don't mind if they're rusty or slow. What you're looking for is candidates who are utterly clueless, or horribly confused, about the area in question.

>>> The Entirety of Jeff´s Original Post <<<


Note: Steve Yegge originally posed the Question

+1  A: 

Perl Solution

By: "MH" via codinghorror,com on September 5, 2008 07:29 AM

XX)
#!/usr/bin/perl
while(<*.html>)
{
my $filename = $_;
my @data = <$filename>;
# Loop once through with simple search
while(@data)
{
if(/\(?(\d\d\d)\)?[ -]?(\d\d\d)-?(\d\d\d\d)/)
{
push (@files,$filename);
next;
}
}
# None found, strip html
$text = "";
$text .= $_ while(@data);
$text =~ s#<[^>]+>##gxs;

# Strip line breaks
$text =~ s#\n|\r##gxs;

# Check for occurrence.
if($text =~ /\(?(\d\d\d)\)?[ -]?(\d\d\d)-?(\d\d\d\d)/)
{
push (@files,$filename);
next;
}
}
# Print out result
print join('\n',@files);
_ande_turner_
+2  A: 

Made this in Java. The regex was borrowed from this forum.

 final String regex = "[\\s](\\({0,1}\\d{3}\\){0,1}" +
   "[- \\.]\\d{3}[- \\.]\\d{4})|" +
   "(\\+\\d{2}-\\d{2,4}-\\d{3,4}-\\d{3,4})";
 final Pattern phonePattern = Pattern.compile(regex);

 /* The result set */
 Set<File> files = new HashSet<File>();

 File dir = new File("/initDirPath");
 if (!dir.isDirectory()) return;

 for (File file : dir.listFiles()) {
  if (file.isDirectory()) continue;

  BufferedReader reader = new BufferedReader(new FileReader(file));

  String line;
  boolean found = false;
  while ((line = reader.readLine()) != null 
    && !found) {

   if (found = phonePattern.matcher(line).find()) {
    files.add(file);
   }
  }
 }

 for (File file : files) {
  System.out.println(file.getAbsolutePath());
 }

Performed some tests and it went ok! :) Remeber I'm not trying to use the best design here. Just implemented the algorithm for that.

Marcio Aguiar
Seems to me like this only lists files that have phone numbers; it doesn't actually extract and list the actual numbers.
Outlaw Programmer
+2  A: 

egrep '(?\d{3})?[-\s.]?\d{3}[-.]\d{4}' *.html

Pat
+10  A: 
Unkwntech
+13  A: 

egrep "(([0-9]{1,2}.)?[0-9]{3}.[0-9]{3}.[0-9]{4})" -R *.html

Jeremy Banks
+1  A: 

i love doing these little problems, can't help myself.

not sure if it was worth doing though since it's very similar to the java answer.

private readonly Regex phoneNumExp = new Regex(@"(\({0,1}\d{3}\){0,1}[- \.]\d{3}[- \.]\d{4})|(\+\d{2}-\d{2,4}-\d{3,4}-\d{3,4})");

public HashSet<string> Search(string dir)
{
    var numbers = new HashSet<string>();

    string[] files = Directory.GetFiles(dir, "*.html", SearchOption.AllDirectories);

    foreach (string file in files)
    {
        using (var sr = new StreamReader(file))
        {
            string line;

            while ((line = sr.ReadLine()) != null)
            {
                var match = phoneNumExp.Match(line);

                if (match.Success)
                {
                    numbers.Add(match.Value);
                }
            }
        }
    }

    return numbers;
}
sieben
+1  A: 

Here's why phone interview coding questions don't work:

phone screener: how do you find the phone numbers in 50,000 HTML pages?

candidate: hang on one second (covers phone) hey (roommate/friend/etc who's super good at programming), how do you find the phone numbers in 50,000 HTML pages?

Save the coding questions for early in the in-person interview, and make the interview questions more personal, i.e. "I'd like details about the last time you solved a problem using code". That's a question that will beg follow-ups to their details and it's a lot harder to get someone else to answer it for you without sounding weird over the phone.

devinmoore
+1  A: 

Borrowing 2 things from the C# answer from sieben, here's a little F# snippet that will do the job. All it's missing is a way to call processDirectory, which is left out intentionally :)


open System
open System.IO
open System.Text.RegularExpressions

let rgx = Regex(@"(\({0,1}\d{3}\){0,1}[- \.]\d{3}[- \.]\d{4})|(\+\d{2}-\d{2,4}-\d{3,4}-\d{3,4})", RegexOptions.Compiled)

let processFile contents = contents |> rgx.Matches |> Seq.cast |> Seq.map(fun m -> m.Value)

let processDirectory path = Directory.GetFiles(path, "*.html", SearchOption.AllDirectories) |> Seq.map(File.ReadAllText >> processFile) |> Seq.concat
emaster70