views:

151

answers:

2

I use currently Vim to read SO's data dump. However, my Macbook slows down when I roll down just a few rows. This suggests me that there must be more efficient ways to read the data.

I know little MySQL. The files are in .xml -format. It is rather hard to read the data at the moment in .xml. It may be more efficient to convert the xml -files to MySQL and then read the files. I know only MS db -tool for such actions. However, I would like to know another tool too.

Problems

  1. to parse .xml to SQL -queries such that MySQL understand it. We need to know data structures of the data.
  2. to run the data in MySQL
  3. to find some tool similar to MS db -tool by which we can read the data effectively

How do you read SO's data dump effectively?

--

[edit]

  1. How can you run the 523 SQL queries to create the database in your terminal? I have the commands at the moment in a text -file.
  2. How can you "switch to [the recovery mode] to a simple recovery mode in the database?
+1  A: 

I found this link.

Masi
+2  A: 

I made my first ever python program to read them and output SQL insert statements for use with mysql (It's ugly but worked). You'll need to create the tables first though by hand.

import xml.sax.handler
import xml.sax
import sys
class SOHandler(xml.sax.handler.ContentHandler):
        def __init__(self):
                self.errParse = 0

        def startElement(self, name, attributes):
                if name != "row":
                        self.table = name;
                        self.outFile = open(name+".sql","w")
                        self.errfile = open(name+".err","w")
                else:
                        skip = 0
                        currentRow = u"insert into "+self.table+"("
                        for attr in attributes.keys():
                                currentRow += str(attr) + ","
                        currentRow = currentRow[:-1]
                        currentRow += u") values ("
                        for attr in attributes.keys():
                                try:
                                        currentRow += u'"{0}",'.format(attributes[attr].replace('\\','\\\\').replace('"', '\\"').replace("'", "\\'"))
                                except UnicodeEncodeError:
                                        self.errParse += 1;
                                        skip = 1;
                                        self.errfile.write(currentRow)
                        if skip != 1:
                                currentRow = currentRow[:-1]
                                currentRow += u");"
                                #print len(attributes.keys())
                                self.outFile.write(currentRow.encode("utf-8"))
                                self.outFile.write("\n")
                                self.outFile.flush()
                                print currentRow.encode("utf-8");

        def characters(self, data):
                pass

        def endElement(self, name):
                pass

if len(sys.argv) < 2:
        print "Give me an xml file argument!"
        sys.exit(1)

parser = xml.sax.make_parser()
handler = SOHandler()
parser.setContentHandler(handler)
parser.parse(sys.argv[1])
print handler.errParse
nos
Your code parses the XML -syntax out, and converts the syntax apparently to HTML. --- How did you manage to put, for instance, the posts.xml to MySQL after running the to it. --- I have now the file posts.sql. --- I run unsuccessfully: `mysql -D posts.sql`.
Masi
I run `sudo mysql; \. /users/cs/SO_db/posts.sql` . I get a lot of the following error message `ERROR 1046 (3D000): No database selected`.
Masi
@ I get now the following error after selecting a database: `ERROR 1146 (42S02): Table 'personal.posts' doesn't exist`
Masi
I opened a new thread based on the error message at http://stackoverflow.com/questions/1147955/to-run-a-sql-file-in-mysql
Masi
I don't convert anything to HTML.Also. You NEED to manually create the tables first. Either peek at the XML directly or peek at the SQL output to see what columns you'll need.
nos