DBuser 게시판 그룹 > 정보자료실

Recently, I've been working on an application that reads Word documents, extracts key information, and stuffs that into a database. I know that sounds like a joke definition of a brain, but it's for real.

I have a client that does a lot of technology assessments, all conforming to a small number of Word templates. In the past, investigators could look at old assessment documents on a network share and pick out any information that interested them. That didn't work very well once there were thousands of old documents, until one of the administrators built a Google Desktop index of the archive directory.

The local Google index made finding things very easy, at least for the in-house people. Before we could figure out a solution for the outside contractors, however, the lawyers put in their two cents, and all of a sudden nobody was allowed to read old assessments except the executives and editors, on the theory that the documents might contain proprietary information.

Fortunately, it turned out that the genuinely useful information in old assessments is non-proprietary: market size estimates, the contact information of the people the investigator talked to, the patents the investigator found, and the references the investigator consulted. All of that stuff happens to be contained in tables or footnotes, and all the indexing information that describes the area of the assessment--the Library of Congress headings, the patent classes, and so on--is held in one table on the first page.

Parsing a Word Document

I've known for a long time that Microsoft Word had an extensive document object model, but as a C++ developer I found that model difficult to use. It was really designed for VB programmers, and it wasn't supported very well with professional development tools.

In the last couple of years, Microsoft has come out with Visual Studio Tools for Office, which makes the experience of programming against Word, Excel, Outlook, and InfoPath more like writing a managed code application. This is documented nicely in Visual Studio Tools for Office, by Eric Carter and Eric Lippert (Addison Wesley, 2006). Full disclosure: This is a book in the .NET Development Series, of which I am Series Editor.

Following the excellent advice I gleaned from the book, I built a console application in C# using Visual Studio 2005 Team Edition for Developers, added references to Microsoft.Office.Core, VBIDE, and Word, and set up the following using statements:

using Office = Microsoft.Office.Core;
using Word = Microsoft.Office.Interop.Word;

With these in place, I could open a Word document with:

Word.Application theWordApp = new Word.Application();
object missing = Type.Missing;
object read_only = true;
 ...
Word.Document theDoc = null;
try
  {
   theDoc = theWordApp.Documents.Open(ref filename, ref confirm, 
ref read_only, ref missing,... ref missing);
  }
catch (COMException e)
  {
   Console.WriteLine("COM exception opening document\n{0}", e);
   theWordApp.Quit(ref saveChanges, ref missing, ref missing);
   return 3;
  }

Once the document was open, I could walk through the various collections in the object model and grab what I needed, for example:

string Customer = theDoc.Tables[1].Rows[1].Cells[1].Range.Text.TrimEnd(trimChars).Trim();

This is not pretty code, by any means, but it isn't quite as hard to write as you'd think from looking at it, since Visual Studio's IntelliSense pops up options every time you type a period.

An Easy Data Access Layer

Carter and Lippert don't address building a data access layer, but another book in my series does: Brian Noyes' Data Binding with Windows Forms 2.0 (Addison Wesley, 2006). The book is about Windows Forms applications, and I was writing a console application, but the advice worked just the same: Brian recommends building a data access layer as a DLL using Visual Studio 2005's ability to create a Typed Data Set from a Data Source graphically with the Data Set Designer.

It's trivially easy to drag tables from a SQL Server 2005 database schema to the Visual Studio 2005 Data Set Designer. In this version of the Designer, the relations between tables carry over from the database schema automatically, and strongly typed TableAdapter classes are generated when you save the design. If you do everything right, working with the database from your application code becomes almost automatic:

//for all footnotes: extract footnote and store
FootnoteTableAdapter fta = new FootnoteTableAdapter();
foreach (Word.Footnote footnote in theDoc.Footnotes)
{
  string Note = footnote.Range.Text.TrimEnd(trimChars).Trim();
  fta.Insert(Proj_nr, Note); //ID is generated automatically
  Console.WriteLine("Note saved: {0}", Note);
}

On the other hand, if you don't do things in the right order, the code generation can get fouled up. I forgot to mark the primary key fields of some of my tables as "autoincrement" before I added them to the Data Set Designer, so the Designer generated Insert statements for them that included the key fields.

When I changed the field properties in both the database and the Designer and regenerated the TableAdapter classes, the Insert statements didn't update, and the incorrect Insert statements threw SQL exceptions at run time. I worked around the problem by adding my own InsertQuery methods to the TableAdapter classes, without the primary key fields. That worked, but it was ugly.

What I really should have done was to save the SQL for my other custom query methods elsewhere, delete the changed table from the diagram in the Data Set Designer, drag the updated table from the SQL schema to the Data Set Designer, and finally add my other custom query methods back to the TableAdapter class. Live and learn.

Why is a strongly typed TableAdapter an advantage over sending SQL strings to the database directly through ADO.NET? There are two good reasons--security and maintainability. If you have strings that run as SQL queries, then you have to be very careful to check your code for possible SQL injection attacks, especially if you ever let users enter strings. In addition, if you change your database in the future and forget to change your SQL strings, you will break the application and only find out about it at runtime, which is potentially dangerous. With a strongly typed TableAdapter, the SQL strings are much less vulnerable, and belong to the TableAdapter, not to your application. You still have to keep everything in synch, but the tools, compiler, and runtime system help you out quite a bit.

Your World In your Briefcase

I didn't know that David Em would be looking at Seagate's 120-GB Portable Drive when I asked Seagate for a review unit of a "paperback" 160-GB USB 2.0 Portable Drive, but I independently came to essentially the same conclusion that he did: these Seagate USB-2 drives are absolutely killer peripherals if you need to take your "world" on the road. It takes about a minute to hook one up, they don't need any power other than the USB connection, they're plenty fast, and they run cool and quietly. They're small (1x3.7x5 inches), they're light (about 10 ounces), they're sturdy, and they fit in a briefcase.

They're also very effective as external backup devices, although a little expensive for the purpose. The Seagate 160-GB Portable External Hard Drive currently sells online for about $350, or a little over $2/GB.

Over at Information Week, Fred Langa has built a rather non-portable terabyte PC server for $500, or about $0.50/GB, using four generic drives and a bare-bones free-standing PC kit. In between, you can find larger brand name external hard drives for $0.50-1.00/GB. These typically have a larger case (2.08x6.81x6.45 inches for a single-drive 400-MB unit I looked at online, and 6.6x8.7x9.5 inches for a 4-drive 1 TB unit), and need external power. They aren't nearly as convenient to carry around as the "paperback" drives, but at 3 pounds or so for a single-drive unit they won't give you a hernia.

The Seagate 160-GB Portable External Hard Drive is absolutely worth the price premium if you need to transport lots of data conveniently, but it's not if you just need more storage on your network. Portability has its price in storage, just as it does in computers.

WiFi Woes

In my testing of the super-portable Nokia 770 Internet Tablet last month, I ran into a wireless networking issue which I didn't write up then because I was unsure of the resolution. The issue simmered at the non-reproducible edge of annoyance for a week, but one day the device would simply no longer connect with my office WiFi network, and I had something solid that I could try to diagnose and fix.

My office wireless router is a D-Link DI-624, which is capable of going into "Super-G" mode. If you enable Super-G with turbo, the device is restricted to broadcasting on Channel 6.

I was optimistic when I configured the router over a year ago, and set it up for the fastest possible mode. At the time, there was only one other wireless network that could be detected from my office, its signal was weak, and none of the laptops that passed through my office had any trouble connecting to my wireless network once I typed in the WPA key.

When the Nokia 770 had problems connecting, it was seeing three other WiFi networks as well as mine, and one was reasonably strong. On general principles, I set the router to disable Super-G mode completely and to auto-select a channel, I rebooted the router, and I tried connecting the device again. It authenticated immediately, Bob's your uncle, all's right with the world.

Well, not so fast. My PhD is in experimental Physics, and I can't just let things go. Why was there a problem, and why did my bubble-gum fix work? Without more information about the wireless signals, I couldn't tell, so I asked Canary Wireless for a Digital Hotspotter.

A Scan with the Digital Hotspotter

My first scan of the office with the Hotspotter told me what I needed to know: there were five networks getting to my office (one more than the Nokia 770 could identify), including my own, and three of them were on Channel 6, which is the default for many routers. The fifth network was on Channel 11, another common default. Two of the three foreign Channel 6 networks were strong enough to interfere with my network, so when I did my ad-hoc router settings change and reboot, my router had selected Channel 9 to avoid them.

Shamefully, two of the outside networks visible in my office were (and still are) wide open, one called "NETGEAR", and one called "linksys". These are the default SSID names for those brands of router, and what that told me is that two of the businesses in my building installed wireless routers, turned them on, and never configured them for security. I may go on a snark hunt with the Hotspotter to find them and fix them some day when I have the time, although I don't know how friendly my reception will be. Meanwhile, I'm satisfied that my own network is secure, and is now working pretty well.

As a lark, I took the Hotspotter on a walk home with me. There isn't anywhere in my neighborhood that lacks a WiFi signal of some sort. There isn't anywhere in my neighborhood where you would have to go more than 100 yards to find an open WiFi network that you could hop onto and use to browse the Internet if you had the inclination. I didn't try to do that, and I certainly didn't and wouldn't try to hack into my neighbor's networks and computers - but I wouldn't be at all surprised to find at least one that shares its root drive to all and sundry.

O tempora, o mores!

BIO: Martin Heller is a Web and Windows programming consultant, an author, and a Senior Contributing Editor at BYTE.com. You can reach Martin at MrCLP@mheller.com.

Alarums and Excursions

By Martin Heller

May 1, 2006

Parsing a Word Document

An Easy Data Access Layer

Your World In your Briefcase

WiFi Woes

A Scan with the Digital Hotspotter