views:

240

answers:

2

I have a word document in docx format with data in repeating format pattern.

I would like to take each data from the repeating set and upload to a row in the SQL table.

Sample of data here:

Question No : 1
How is LINQ to SQL different from Entities?

A. Answer 1
B. Answer 1
C. Answer 1
D. Answer 1

Answer : D
Explanations : 
Some explanation.

Question No : 2
How is NVARCHAR different from VARCHAR

A. Answer 1
B. Answer 1
C. Answer 1
D. Answer 1

Answer : D
Explanations : 
Some explanation.

I could think of few approaches:
- Read the document as docx using the Office API
- Save the document as XML from Word and parse XML [the converted XML document doesn't seem have a structure/schema]
- Save the document as HTML from Word and parse HTML [DOM structure not well formed]

Which among above would you suggest and why? Are there any tools to help convert a document and upload to a SQL table or access DB?

Thanks!

+1  A: 

DOCX is just a ZIP directory tree of XML files. Use WinZip or 7-Zip to extract it to a set of subdirectories. Upload those XML files to SQL Server, adding their file name and folder path. the use the SQL Server XML methods (.node, etc) to shred them into the relational form that you want.

Note that these do have XML schemas and structures.

RBarryYoung
A: 

If you are going to process these files not very often, then I'd say save it to a different format (easier to process by SQL) - maybe even a plain text format. If this process (of importing this file to DB) is going to be performed on a regular basis - go for the native DOCX processing without converting it to a intermediate format. Quick Google search revealed that there are components available that can read docx format into a database (e.g. http://www.brothersoft.com/code-library-for-.net-%28sql-server-msde%29-22050.html)

DmitryK