views:

107

answers:

2

Sorry for duplicating this question, but here I tried to explain it in more details. I need to parse the data from certain file and store it to database (MySQL). This is how the data is displayed in the file:

戚谊 
戚誼 
    [m1][b]qīyì[/b][/m] 
    [m2]translation 1[/m] 
    [m1][b]qīyi[b][/m] 
    [m2]translation 2[/m] 
三州府 
    [m1][b]sānzhōufǔ[/b][/m] 
    [m2]translation of other character[/m]
etc.

The first and the second line represent the same character, but the first line is a simplified and the second line is a traditional character. I need to store them to ch_simplified and ch_trad columns accordingly.

The third line, which begins with [m1], is a transcription (pinyin), the forth line (begins with [m2]) is a translation of the character. There is also the second translation of the character, you can notice it has different transcription.

We need to store both transcriptions (sometimes there are more than 2 transcriptions for the same character) in a separate column (transcription), and then store all translation part to a column translation.

And the table in mysql db looks like this:

ID  |  ch_simplified  |  ch_trad    | transcription           |   translation               | 
--------------------------------------------------------------------------------------------- 
1.        戚谊             戚誼        [m1][b]qīyì[/b][/m];     [m1][b]qīyì[/b][/m] 
                                      [m1][b]qīyi[b][/m]       [m2]translation 1[/m] 
                                                               [m1][b]qīyi[b][/m] 
                                                               [m2]translation 2[/m] 
---------------------------------------------------------------------------------------------
2.        三州府           三州府      [m1][b]sānzhōufǔ[/b][/m]  [m1][b]sānzhōufǔ[/b][/m] 
                                                               [m2]translation of other character[/m]

The problem is I don't know how parse this data using php. I tried to start with

$content = file_get_contents('myfile.txt', true);

and stuck at the step where I have to separate data between first character and the second character (戚谊 and 三州府).

Any help would be greatly appreciated!

P.S. Sorry for such a long text and confusing explanation.

A: 

You could use EXPLODE() and break on space or any other character

Phill Pafford
A: 

Your data fields are on separate lines, so Phil's explode() call would be on the newline character. So the basic datafield acquisition is something like this:

$content = file_get_contents('myfile.txt', true);

foreach(explode("\n", $content) as $line)
{
  $line = trim($line);  // remove leading white space
  // if necessary, check for empty lines here
  switch(substr($line, 0,4)) // examine first four characters
  {
    case '[m1]':
      // regular expression has some escaped characters
      preg_match('/^\[m1](.+)\[\/m]$/', $line, $matches);  
      $field = $matches[1];
      echo "pinyin: '$field'\n";
      break;

    case '[m2]':
      preg_match('/^\[m2](.+)\[\/m]$/', $line, $matches);
      $field = $matches[1];
      echo "translation: '$field'\n";
      break;

    default:
      $field = $line;  // for clarity
      echo "character: '$field'\n";
      break;
  }

}

Here, I have not attempted to identify (a) the start of a new record, or (b) identification of simplified and trad characters. These issues are probably addressed by counting character field identifications -- first one is simplified, second trad, first for a while indicates a new field -- but that's your job.

Nor have I assessed any issues relating to the non-ascii character set. I assume you are on top of that stuff.

I have taken the opportunity to separate the content from presentational markup (like the [b] tags). It's just good practice to keep those semantics separate from the data proper.

Ewan Todd
Thank you! That's what I needed.
Josh