views:

447

answers:

6

Hi,

I'm pretty new to c#. Can somebody please give me the right direction on how can I parse the following text file?

The program I am trying to implement will do the following:

It will ask the user to enter a directory. It will search the directory for text files. It will loop through the text files, parse them, and save them in a one table database. The text files have the following structure:

(This is text file 1)

001 - Milan (Citizens)

Pitch Street

  John Doe               15, F1 2             35022I        
  Janette Doe            17, F7 2             32345I            

Angel Street

  Mark Skate             12, F3 2             35532I        
  Jacqueline Skate       18, F6 2             54343I

(This is text file 2)

002 - Rome (Citizens)

Colosseum Street

  Christian Troy         21, F8 5             21354I        
  Janette Doe            17, F7 2             23453T            

Pope Street

  Sean McNamara          Villa McNamara       12424I        
  Julia McNamara         Villa McNamara       43344I

etc...

001 - Milan etc... is the town. This is found once at the beginning of every text file. Colosseum Street etc... is the street name. Then for every street there is a list with 3 columns: name, address, id card.

What I need is to insert every citizen into a database. the database will have one table with the following format:

name, surname, address, id_card, town, street

Therefore, every citizen must be stored in some kind of an array and the array will contain the citizen's respective town and citizen.

If somebody can give me some ideas on how to parse the format of this text file it would be great, since it has a bit of an unusual format. Also please note that the spaces between name, address and id card are actual spaces and not tabs.

Many thanks in advance!

Regards, Chris

A: 

Are you stuck with this file format? (Becuase it's terrible! ;) At the moment there is no clear way for the parser to distinguish between a street or a person. If you are are creating this file structure from scratch, it would be better to do it in XML, or even CSV.

UpTheCreek
vote +1 for changing to XML
Simon
Looks pretty structured to me.
Kev
streets don't have padding whereas persons do.
Kris
Streets have padding too. This is only going to work based on fixed columns, I wouldn' like to write a regex to match this. Some names will have 3 parts.
Henk Holterman
There is a clear way to distinguish them: '\t'
Lars Mæhlum
@Lars - that was my first thought, but no tabs in the file.
Kev
Just for the record, when I commented on this it had much less structure (I don't think it was in a code block). You couldn't see anything that looked like a tab for instance. Now it's better, but still far from ideal.
UpTheCreek
+8  A: 

Try breaking the problem into smaller problems

Matt Breckon
+1 for common sense.
Kev
Thanks for the tutorial links.Currently I managed to parse a text file with thousands of rows and it looks good.
Chris
+1  A: 

You have two options:

  1. Read one line at time; first line will be your city information, next line starting on column 0 (no leading spaces) will be your address and lines beginning with two spaces will be your citizen information
  2. You can build a regular expression to match that file format and to match all file at once
Rubens Farias
or *lines beginning with 3 digits* are city information', lines beginning with *a letter* are the street name, lines beginning with *spaces* are citizen information. You can do it with regular expressions.
pavium
@pavium, ty for your comments; I'm trying to craft a unique regex to do that; can you help?
Rubens Farias
A: 

Here's some code that might help you get started. I've made a number of assumptions based on the data file format:

  1. Each row in the person address has the Name, Building/Flat and Card ID at fixed positions.
  2. The name of the person is Firstname and Surname (although can cope with any number of middle names/initials)
  3. The Town ID and name are on the first row
  4. The person row always starts with at least two spaces
  5. Empty rows are just that, empty

It's a bit of a hack, doesn't use any regular expressions but does work for the layout examples given above (I'm presuming these are machine generated). The code just parses a single file to a Citizen class which you can then insert to a database table, I'm assuming you know how to do that.

I'm sure there's plenty of optimisations, but it's there to get you going:

using System;
using System.IO;

namespace AddressParser
{
  class Program
  {
    public class TownInfo
    {
      public int TownID { get; set; }
      public string TownIDAsString { get; set; }
      public string Town { get; set; }
    }

    public class Citizen
    {
      public TownInfo Town { get; set; }
      public string Street { get; set; }
      public string FirstName { get; set; }
      public string Surname { get; set; }
      public string Building { get; set; }
      public string Flat { get; set; }
      public string CardID { get; set; }
    }

    static void Main(string[] args)
    {
      string dataFile = @"d:\testdata\TextFile1.txt";

      ParseAddressFileToDatabase(dataFile);
    }

    static void ParseAddressFileToDatabase(string dataFile)
    {
      using(StreamReader sr = new StreamReader(dataFile))
      {
        string line;
        bool isFirstLine = true;

        string currentStreet = null;
        TownInfo townInfo = null;

        while((line = sr.ReadLine()) != null)
        {
          if(isFirstLine)
          {
            townInfo = ParseTown(line);
            isFirstLine = false;
          }

          if(line.Trim() == String.Empty)
            continue;

          while(line != null && line.StartsWith("  "))
          {
            Citizen citizen = ParseCitizen(line, townInfo, currentStreet);

            //
            // Insert record into DB here
            //

            line = sr.ReadLine();
          }

          currentStreet = line;
        }
      }
    }

    private static TownInfo ParseTown(string line)
    {
      string[] town = line.Split('-');
      return new TownInfo()
      {
        TownID = Int32.Parse(town[0].Trim()),
        TownIDAsString = town[0].Trim(),
        Town = town[1].Replace("(Citizens)","").Trim()
      };
    }

    private  static Citizen ParseCitizen(string line, TownInfo townInfo, string currentStreet)
    {
      string[] name = line.Substring(2, 23).Trim().Split(' ');

      string firstName = name[0];
      string surname = name[name.Length - 1];

      // Assumes fixed positions for some fields
      string buildingOrFlat = line.Substring(24, 22).Trim();
      string cardID = line.Substring(46).Trim();

      // Split building or flat
      string[] flat = buildingOrFlat.Split(',');

      return new Citizen()
      {
        Town = townInfo,
        Street = currentStreet,
        FirstName = firstName,
        Surname = surname,
        Building = flat.Length == 0 ? buildingOrFlat : flat[0],
        Flat = flat.Length == 2 ? flat[1].Trim() : "",
        CardID = cardID
      };
    }
  }
}
Kev
+1  A: 

It would be nice if the OP could change the format, but that is not stated as a possibility.

I think ONE approach is to ...

  1. Generate a lot of examples of the text file that cover all the possible scenarios.
  2. Use that as a guide to compose regular expressions for structure of the text (or parts of it).
  3. Write parsing code that takes, as input, text that expressions have matched-- one for each regex you created.
  4. Stuff the parsed stuff into whatever data structure.

The regex expressions serve as a cheap and fast way to get validation of format and also as a "staging" step to make your parser more simple.

Angelo
A: 

I hope I'm not too late to suggest that your database structure needs work (there should be plenty of answers to help you solve your main problem).

You shouldn't store your address against your citizen - you'll come a cropper in the future. Instead, have a separate table:

Citizen: ID, Name, Surname, IDCard

Address: ID, Address, Town, Street

CitizenAddress: CitizenID, AddressID

So you have one table with the name and id card details of the citizen and another that holds addresses - then the address is linked to the citizen using the "CitizenAddress" table.

What benefit does this give you?

Well, if you have two citizens at one address, you only need to store the address once. Also, if you have a scenario where a citizen may be listed at two addresses, the same applies. You can expand this structure to maintain a history of where a citizen lived at a point in time - as you don't need to overwrite the address when they move.

Sohnee