tags:

views:

522

answers:

3

I have a few-million-line PHP code base without true separation of display and logic, and I am trying to extract all of the strings that are represented in the code for the purposes of localization. Separation of display and logic is a long term goal, but for now I just want to be able to localize.

In the code, strings are represented in every possible format for PHP, so I need a theoretical (or practical) way to parse our entire source and at the very least LOCATE where each string lives. Ideally, of course, I'd replace every string with a function call, for example

"this is a string"

would be replaced with

_("this is a string")

Of course I'd need to support both single and double quote format. The others I'm not too concerned about, they appear so infrequently that I can manually change them.

Also, I wouldn't want to localize array indexes of course. So strings like

$arr["value"]

should not become

$arr[_("value")]

Can anyone help me get started in this?

+10  A: 

You could use token_get_all() to get all the tokens from a PHP file e.g.

<?php

$fileStr = file_get_contents('file.php');

foreach (token_get_all($fileStr) as $token) {
    if ($token[0] == T_CONSTANT_ENCAPSED_STRING) {
        echo "found string {$token[1]}\r\n";
        //$token[2] is line number of the string
    }
}

You could do a really dirty check that it isn't being used as an array index by something like:

$fileLines = file('file.php');

//inside the loop and if
$line = $fileLines[$token[2] - 1];
if (false === strpos($line, "[{$token[1]}]")) {
    //not an array index
}

but you will really struggle to do this properly because someone might have written something you might not be expecting e.g.:

$str = 'string that is not immediately an array index';
doSomething($array[$str]);


Edit As Ant P says, you would probably be better off looking for [ and ] in the surrounding tokens for the second part of this answer rather than my strpos hack, something like this:

$i = 0;
$tokens = token_get_all(file_get_contents('file.php'));
$num = count($tokens);
for ($i = 0; $i < $num; $i++) {
    $token = $tokens[$i];

    if ($token[0] != T_CONSTANT_ENCAPSED_STRING) {
        //not a string, ignore
        continue;
    }

    if ($tokens[$i - 1] == '[' && $tokens[$i + 1] == ']') {
        //immediately used as an array index, ignore
        continue; 
    }

    echo "found string {$token[1]}\r\n";
    //$token[2] is line number of the string
}
Tom Haigh
+1 never knew about this function. Thats awesome.
cletus
Only thing is that for$_SESSION['logsession']it actually gives mefound string 'logsession'which is of course not what I want for localization.
Ray
Ah you have since edited.
Ray
@tomhaigh: I would do a second up-vote, if I could. Hats off.
Tomalak
@ray: You can probably figure out whether a string's being used as a string or an array ID by looking at it in context of surrounding tokens. I haven't tried it myself though. YMMV.
Ant P.
A: 

Instead of trying to solve this with an overly-clever command line hack using perl or grep, you should write a program to do this :)

Write a perl/python/ruby/whatever script to search through each file for a pair of single or double quotes. Each time it finds a match, it should prompt you to replace it with your underscore function, and you can either tell it to do it or to skip to the next one.

In a perfect world, you'd write something that would do it all for you, but this would probably take less time in the end, and you'd be faced with fewer errors.

Pseudo:

for fname in yourBigFileList:
    create file handle for actual source file
    create temp file handle (like fname +".tmp" or something)
    for fline in fname:
        get quoted strings
        for qstring in quoted_strings:
            show it in context, i.e. the entire line of code.
            replace with _()?
                if Y, replace and write line to tmp file
                if N, just write that line to the tmp file
    close file handles
    rename it to current name + ".old"
    rename ".tmp" file to name of orignal file

I'm sure there's a more *nix-fu way of doing this, but this method would let you look at each instance yourself and decide. if it's a million lines and each one contains a string and each one takes you 1 second to evaluate, then it'll take you about 270-ish hours to do the whole thing... Perhaps you should ignore this post :)

inkedmn
Sorry but the only relevant part of this answer is the "get quoted strongs" in your pseudocode that you don't address so I'm not sure why you've given this answer.
cletus
+4  A: 

There are some other situations that are likely to exist in the code base that you will utterly break by doing an automatic search and replace in addition to associative arrays.

SQL queries:

$myname = "steve";
$sql = "SELECT foo FROM bar WHERE name = " . $myname;

Indirect variable reference.

$bar = "Hello, World"; // a string that needs localization
$foo = "bar"; // a string that should not be localized
echo($$foo);

SQL string manipulation.

$sql = "SELECT CONCAT('Greetings, ', firstname) as greeting from users where id = ?";

There is no automatic way to filter for all possibilities. Perhaps the solution would be to write an application that creates a "moderation" queue of possible strings and displays each one highlighted and in context of several lines of code. You could then glance at the code to determine if it is a string that needs localization or not and hit a single key to localize or ignore the string.

postfuturist