views:

41

answers:

2

Looking to create a hash table from a text output that looks like this (whitespace between words are tabs):

GCOLLECTOR     123456     77889     uno  
BLOCK     unique111    error     fullunique111     ...     ...     ...  
DAY     ... ... ...  
LABEL     detail     unique111     Issue     Broken - The truck broke  
LABEL     detail     unique111     Folder    3c1  
LABEL     detail     unique111     Datum     bar_1666.9  
GCOLLECTOR     234567     77889     uno  
BLOCK     unique222    error     fullunique111     ...     ...     ...  
DAY     ... ... ...  
DAY     ... ... ...
LABEL     detail     unique222     Issue     Broken - The truck broke  
LABEL     detail     unique222     Datum     bar_9921.2
LABEL     detail     unique222     Folder    6a3  
GCOLLECTOR     345678     77889     uno  
BLOCK     unique333    error     fullunique111     ...     ...     ...    
LABEL     detail     unique333     Datum     bar_7766.2
LABEL     detail     unique333     Folder    49k  
LABEL     detail     unique333     Issue     Broken - The truck broke

I would like to create a hash table that assigns each of the following to the hash:
gcollectors = Hash.new
gcollectors = { "UniqueID" => uniqueXXX,
"Datum" => bar_XXXX.X,
"FullUniqueID" => fulluniqueXXX,
"IssueGroup" => Broken
}

The uniqueXXX fields always match for the BLOCK and associated LABELs.

I am having a couple issues:
1- How do I assign just those fields to the hashes?
2- How can I split the text prior to the hyphen (in LABEL ... Issue) and assign it to IssueGroup?
3- How can this be done reliably when the order of the LABEL lines is different?
.. same question for when there are multiple DAY lines or no DAY lines.

A: 

This is how I'd go about it:

records     = [] # init an array to hold everything
gcollectors = {} # init the hash holding info for one record

# loop over the file
File.readlines('text.txt').each do |l|

  # split the line into columns
  columns = l.chomp.split("\t")

  # if the first column is...
  case columns[0]
  when 'GCOLLECTOR'
    # we don't care about the columns, but instead use this record to tell us to
    # store the hash and reinitialize it.
    if (gcollectors.any?)
      records << gcollectors
      gcollectors = {}
    end
  when 'BLOCK'
    gcollectors['UniqueID']     = columns[1]
    gcollectors['FullUniqueID'] = columns[3]
  when 'LABEL'
    # a LABEL record could have two different values we care about so figure out
    # which it is.
    case columns[3]
    when 'Datum'
      gcollectors['Datum'] = columns[4]
    when 'Issue'
      gcollectors['IssueGroup'] = columns[4].split('-').first.strip
    end
  end

  # get the next record
  next
end

require 'ap'
ap records
# >> [
# >>     [0] {
# >>             "UniqueID" => "unique111",
# >>         "FullUniqueID" => "fullunique111",
# >>           "IssueGroup" => "Broken",
# >>                "Datum" => "bar_1666.9"
# >>     },
# >>     [1] {
# >>             "UniqueID" => "unique222",
# >>         "FullUniqueID" => "fullunique111",
# >>           "IssueGroup" => "Broken",
# >>                "Datum" => "bar_9921.2"
# >>     }
# >> ]
Greg
thank you, perfect!
I've needed this sort of ability many times over the years. Not all incoming data is symmetrical or a standard/constant format unfortunately so we have to find ways to determine the begin or end of a block that constitutes a record.
Greg
Exactly, if it were in the same order for each record, I could have figured it out, but I like your solution even better. Do you have a quick way for each unique value of the records (ie. IssueGroup) to display it and then the count? Again, really appreciate the assistance.
If you want to keep track of its progress or accumulate a count add the code in the `when 'GCOLLECTOR'` block before `gcollectors = {}`.
Greg
...or, for an accumulated count after the fact just see how many elements are in the records array, i.e., `records.size`
Greg
A: 
gcollectors = text.scan(/^GCOLLECTOR.+\n(?:(?:BLOCK|DAY|LABEL).+\n?)+/).map { |collector|
    /^BLOCK\t(?<uniqueid>\S+)\t\S+\t(?<fulluniqueid>\S+).+/ =~ collector
    /^LABEL\t\S+\t\S+\tDatum\t(?<datum>.+)/ =~ collector
    /^LABEL\t\S+\t\S+\tIssue\t(?<issue>\S+)/ =~ collector
    Hash[
        "UniqueID",uniqueid,
        "Datum",datum,
        "FullUniqueID",fulluniqueid,
        "IssueGroup",issue
    ]
}

gcollectors.each{|i|p i}
{"UniqueID"=>"unique111", "Datum"=>"bar_1666.9", "FullUniqueID"=>"fullunique111", "IssueGroup"=>"Broken"}
{"UniqueID"=>"unique222", "Datum"=>"bar_9921.2", "FullUniqueID"=>"fullunique111", "IssueGroup"=>"Broken"}
{"UniqueID"=>"unique333", "Datum"=>"bar_7766.2", "FullUniqueID"=>"fullunique111", "IssueGroup"=>"Broken"}
Nakilon
thank you, but i prefer Greg's answer.