I'm working on a parser in php which is designed to extract MySQL records out of a text file. A particular line might begin with a string corresponding to which table the records (rows) need to be inserted into, followed by the records themselves. The records are delimited by a backslash and the fields (columns) are separated by commas. For the sake of simplicity, let's assume that we have a table representing people in our database, with fields being First Name, Last Name, and Occupation. Thus, one line of the file might be as follows
[People] = "\Han,Solo,Smuggler\Luke,Skywalker,Jedi..."
Where the ellipses (...) could be additional people. One straightforward approach might be to use fgets()
to extract a line from the file, and use preg_match()
to extract the table name, records, and fields from that line.
However, let's suppose that we have an awful lot of Star Wars characters to track. So many, in fact, that this line ends up being 200,000+ characters/bytes long. In such a case, taking the above approach to extract the database information seems a bit inefficient. You have to first read hundreds of thousands of characters into memory, then read back over those same characters to find regex matches.
Is there a way, similar to the Java String next(String pattern)
method of the Scanner
class constructed using a file, that allows you to match patterns in-line while scanning through the file?
The idea is that you don't have to scan through the same text twice (to read it from the file into a string, and then to match patterns) or store the text redundantly in memory (in both the file line string and the matched patterns). Would this even yield a significant increase in performance? It's hard to tell exactly what PHP or Java are doing behind the scenes.
On fgetcsv()
This function makes it very easy to split lines in a file based on some delimiter, and I'm sure it checks for the delimiter character by character as it scans through the file. However, the problem is that there's essentially two delimiters that I'm looking for, and fgetcsv()
only accepts one. For example:
I could use ',' as the delimiter. Provided I changed the file format to also have commas with a backslash, I could read the entire line into an array of fields. The problem, then, is I need to reiterate over all of the fields to determine where records start and end and to prepare the sql. Similarly, if I use '\' as the delimiter (a single backslash, escaped here), then I'll need to reiterate over all of the records to extract the fields and prepare the sql.
What I am trying to do is to check for both commas and backslashes (and perhaps other things, like the [tablename]) in one fell swoop for maximum performance. If fgetcsv()
allowed to me specify multiple delimiters (or a regex) or allowed me to change what it considers to be the "end of a line" (from \n or \n\r to just \), then it would work perfectly, but that doesn't seem possible.