tags:

views:

95

answers:

5

I have a text file:

DATE 20090105
1 2.25 1.5
3 3.6 0.099
4 3.6 0.150
6 3.6 0.099
8 3.65 0.0499
DATE 20090105
DATE 20090106
1 2.4 1.40
2 3.0 0.5
5 3.3 0.19
7 2.75 0.5
10 2.75 0.25
DATE 20090106
DATE 20090107
2 3.0 0.5
2 3.3 0.19
9 2.75 0.5
DATE 20100107

On each day I have:

Time Rating Variance

I want to work out the average variance at a specific time on the biggest time scale.

The file is massive and this is just a small edited sample. This means I don't know the latest time and the earliest time (it's around 2600) and the latest time may be around 50000.

So for example on all the days I only have 1 value at time t=1, hence that is the average variance at that time.

At time t=2, on the first day, the variance at time t=2 takes value 1.5 as it last until t=3, on the second day it takes value=0.5 and on the third day it takes value ((0.5+0.18)/2). So the avg variance over all the days at time t=2 is the sum of all the variances at that time, divided by the number of different variances at that time.

For the last time in the day, the time scale it takes is t=1.

I'm just wondering as to how I would even go about this.

As a complete beginner I'm finding this quite complicated. I am a Uni Student, but university is finished and I am trying to learn Java to help out with my Dads business over the summer. So any help with regards to solutions is greatly appreciated.

A: 

You have to follow below steps

  • Create a class with date and trv property
  • Craete a list of above class
  • Read the file using IO classes.
  • Read in chunks and convert to string
  • Split whole string by "DATE" and trim
  • Split by space (" ")
  • The first item would be your date.
  • Convert all other items to float and find average.
  • Add it to list. Now you have a list of daily average.
  • You can persist it to disk and query it for your required data.

EDIT you have edited your question and now it looks totaly diffrent. I think you need help in parsing the file. Correct me if i am wrong.

Manjoor
Thanks for your answer but I don't think that's quite what I'm after.I want the average at that specific time, over all the days. For Example.MONDAY People in store9:00am 59:05am 109:10am 15Tuesday9:00am 19:05am 2 9:10am 1So an average model to predit how many people would be in the shop at these times is as below.9:00am 39:05am 69:10am 8As I have taken the average of the number of people in the store at that time. I think what you are describing is just the daily average?Thanks
Sam Hank
Then you need to parse the file and transfer the data to a database (ex mySQL). Then you can query it using your parameter
Manjoor
I literally have no idea, I've been doing Java for a week just trying to help out. I'm just trying to get this done as soon as possible
Sam Hank
A: 

If I understand you correctly, you are after a moving average that is calculated on a stream of data. The following class I wrote provides some such statistics.

  • moving average
  • decaying average (reflects the average of the last few samples, based on the decay factor).
  • moving variance
  • decaying variance
  • min and max.

Hope it helps.

/**
 * omry 
 * Jul 2, 2006
 * 
 * Calculates:
 * 1. running average 
 * 2. running standard deviation.
 * 3. minimum
 * 4. maximum
 */
public class Statistics
{
    private double m_lastValue;
    private double m_average = 0;
    private double m_stdDevSqr = 0;

    private int m_n = 0;
    private double m_max = Double.NEGATIVE_INFINITY;
    private double m_min = Double.POSITIVE_INFINITY;

    private double m_total;

    // decay factor.
    private double m_d;
    private double m_decayingAverage;
    private double m_decayingStdDevSqr;

    public Statistics()
    {
        this(2);
    }

    public Statistics(float d)
    {
        m_d = d;
    }

    public void addValue(double value)
    {
        m_lastValue = value;
        m_total += value;

        // see http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance
        m_n++;
        double delta = value - m_average;
        m_average = m_average + delta / (float)m_n;
        double md = (1/m_d);
        if (m_n == 1)
        {
            m_decayingAverage = value;
        }
        m_decayingAverage = (md * m_decayingAverage + (1-md)*value);

        // This expression uses the new value of mean
        m_stdDevSqr = m_stdDevSqr + delta*(value - m_average);

        m_decayingStdDevSqr = m_decayingStdDevSqr + delta*(value - m_decayingAverage);

        m_max = Math.max(m_max, value);
        m_min = Math.min(m_min, value);     
    }

    public double getAverage()
    {
        return round(m_average);
    }

    public double getDAverage()
    {
        return round(m_decayingAverage);
    }   

    public double getMin()
    {
        return m_min;
    }

    public double getMax()
    {
        return m_max;
    }

    public double getVariance()
    {
        if (m_n > 1)
        {
            return round(Math.sqrt(m_stdDevSqr/(m_n - 1)));
        }
        else
        {
            return 0;
        }
    }


    public double getDVariance()
    {
        if (m_n > 1)
        {
            return round(Math.sqrt(m_decayingStdDevSqr/(m_n - 1)));
        }
        else
        {
            return 0;
        }
    }

    public int getN()
    {
        return m_n;
    }

    public double getLastValue()
    {
        return m_lastValue;
    }

    public void reset()
    {
        m_lastValue = 0;
        m_average = 0;
        m_stdDevSqr = 0;
        m_n = 0;
        m_max = Double.NEGATIVE_INFINITY;
        m_min = Double.POSITIVE_INFINITY;
        m_decayingAverage = 0;
        m_decayingStdDevSqr = 0;
        m_total = 0;
    }

    public double getTotal()
    {
        return round(m_total);
    }

    private double round(double d)
    {
        return Math.round((d * 100))/100.0;
    }
}
Omry
Hi, thank you for your input and that is very useful and can help me learn about how to programme moving averages. However I am after an average AT a specific time. Not a moving average. But once again thank you, this will probably help me with my next task!
Sam Hank
well, the average at a specific time is the value returned by getAverage() after you inserted all the values prior to that time.
Omry
What suggestions do you have for my code Ive added below?
Sam Hank
A: 

This is definitely wrong, but it outlines my ideas (and how little i know about Java...)

But yep so as I said, any help is appreciated thanks.

import java.io.BufferedReader; import java.io.FileReader; import java.io.IOException; import java.util.ArrayList; import java.util.List; import java.util.Scanner; import java.util.AbstractMap.SimpleEntry;

public class VarEvolution {

public static void main(String[] args) throws IOException {
    BufferedReader br = null;
    try {
        String InputFile = "C:\\Textfile.txt";
        br = new BufferedReader(new FileReader(InputFile));
        String line;
        String date = null;
        while ((line = br.readLine()) != null) {
            line = line.trim();
            if (!line.startsWith("DATE")) {
                if (!line.equals(date)){
                Scanner s = new Scanner(line);
                int time = s.nextInt();
                s.nextDouble();
                double var = s.nextDouble();
                ArrayList<Integer> Time = new ArrayList<Integer>();
                ArrayList<Double> Vars[] = new ArrayList<Double>(time);//I want that spread to correspond to that Time
                ArrayList<Double> AvgVars[] = new ArrayList<Double>(time);//The Average spread corresponding to that Time
                List<SimpleEntry<Integer, ArrayList<ArrayList<Double>>>> VarEvolution = new List<SimpleEntry<Integer, ArrayList<ArrayList<Double>>>>();
                /* I'm trying to get a list that looks like this for example:
                 * Time     AvgVars
                 * 1          x_1
                 * 2          x_2
                 * 3          x_3 
                 * 
                 * Where x_1 etc are the average variancess for that specific time given all the data in the days.
                 */
                if(!Time.contains(time)){
                    if(time>0 && time<11){ 
                        Time.add(time);
                        Vars[time].add(var);
                    }
                    else(Time.contains(time)){
                        Vars[time].add(var);
                    }
                AvgVars[time].add(Vars[time]/(Vars[time].size(Vars));
                /*Here I want to take the sum of the vars at that time and then divide by the number of vars at that time. 
                 * I.e. the size of the list?
                 * But I don't think I've summed up the vars at that specific time or if 
                 * the size is available... 
                 */

                    }
        VarEvolution.addAll(Time,AvgVars);
        //This is just adding them to the arraylist. No idea if its right...
                }               

}

        }
    }finally {
        if (br != null)
            br.close();
    }
}

}

Sam Hank
Imaginary -1 because your coding style made me throw up in my mouth a little.(no offense, j/k :P)
atamanroman
haha please, ive been coding for one week...
Sam Hank
have a look at clean cody by robert c. martin
atamanroman
A: 

I think i understand. You want to

  1. find the average variance at a given time t on each day - which is given by the highest timestamp on that day that is less than t
  2. deal with cases where multiple readings at the same time by averaging them.
  3. find the average variance on all days at time t

So I'd suggest, once you parse the data as @Manjoor suggested, then, (pseudocode!)

function getAverageAt(int t)
  float lastvariance = 0; // what value to start on, 
                        // if no variance is specified at t=1 on day 1
                        // also acts as accumulator if several values at one 
                        // timestamp
  float allDaysTotal = 0; // cumulative sum of the variance at time t for all days
  for each day {
    float time[], rating[], variance[];
    //read these from table
    int found=0; //how many values found at time t today
    for(int i=0;i<time.length;i++){
       if(time[i]<t) lastvariance=variance[i];  // find the most recent value
                        // before t.
                        // This relies on your data being in order!
       else if(time[i]==t){  // time 
         found++;
         if (found==1) lastvariance=variance[i]; // no previous occurrences today
         else lastvariance+=variance[i];
       }
       else if(time[i]>t) break;
    }
    if(found>1) lastvariance/=found;  // calculate average of several simultaneous
    // readings, if more than one value found today at time t.
    // Note that: if found==0, this means you're using a previous
    // timestamp's value.
    // Also note that, if at t=1 you have 2 values of variance, that 
    // averaged value will not continue over to time t. 
    // You could easily reimplement that if that's the behaviour you desire,
    // the code is similar, but putting the time<t condition along with the 
    // time==t condition 
    allDaysTotal+=lastvariance;
  }
  allDaysMean = allDaysTotal / nDays

Your problem isn't a simple one, as the catch-cases I pointed out show.

Sanjay Manohar
Thank you for your help. I don't know what parsing is, but I'm sure I'll be able to find it out.is this fairly complicated then? I mean after a week of coding, should i be able to do this?
Sam Hank
Sam Hank
yes. the 'timestamp' of each data point is the first column of your data. So 'the highest timestamp on that day that is less than t' is a more computer-friendly way of writing what you just said. Parsing, in your case, is turning text into tables of numbers. The task isn't complicated if you follow the steps I've said, but you need to consider the catch-cases like when the last recorded point in time was itself an average of several data points. I mean, if you strip my code down, you could get it to 10 lines of code!
Sanjay Manohar
Hi Sanjay, what do you think of the code Ive written?It's very slow but it works...
Sam Hank
A: 

Ok, I've got a code which works. But it takes a very long time(around 7 months worth of day, with 30,000 variances a day) because it has to loop round so many times. Are there any other better suggestions?

I mean this code, for something seemingly simple, would take around 24-28 hours...

package VarPackage;

import java.io.BufferedReader; import java.io.FileReader; import java.util.ArrayList;

public class ReadText {

public static void main(String[] args) throws Exception {
    String inputFileName="C:\\MFile";


    ArrayList<String> fileLines = new ArrayList<String>();
    FileReader fr;
    BufferedReader br;

    // Time
    int t = 1;


    fr = new FileReader(inputFileName);
    br = new BufferedReader(fr);
    String line;


    while ((line=br.readLine())!=null) {
     fileLines.add(line);
    }

    AvgVar myVar = new AvgVar(fileLines);

    for(t=1; t<10; t++){ 
    System.out.print("Average Var at Time t=" + t + " = " + myVar.avgVar(t)+"\n");

}

} }

===================================

NewClass

package VarPackage;

import java.util.ArrayList;

public class AvgVar { // Class Variables private ArrayList inputData = new ArrayList();

// Constructor AvgVar(ArrayList fileData){ inputData = fileData; }

public double avgVar(int time){

 double avgVar = 0;

 ArrayList<double[]> avgData = avgDuplicateVars(inputData);

 for(double[] arrVar : avgData){
 avgVar += arrVar[time-1];
 //System.out.print(arrVar[time-1] + "," + arrVar[time] + "," + arrVar[time+1] + "\n");
 //System.out.print(avgVar + "\n");
 }

 avgVar /= numDays(inputData);

 return avgVar;
}

private int numDays(ArrayList<String> varData){

 int n = 0;
 int flag = 0;

for(String line : varData){

String[] myData = line.split(" ");

if(myData[0].equals("DATE") && flag == 0){

    flag = 1;

   }
   else if(myData[0].equals("DATE") && flag == 1){

    n = n + 1;
    flag = 0;

   }

}

return n;

}

private ArrayList<double[]> avgDuplicateVars(ArrayList<String> varData){

 ArrayList<double[]> avgData = new ArrayList<double[]>();

 double[] varValue = new double[86400];
 double[] varCount = new double[86400];

 int n = 0;
 int flag = 0;

for(String iLine : varData){

String[] nLine = iLine.split(" ");
   if(nLine[0].equals("DATE") && flag == 0){

    for (int i=0; i<86400; i++){
    varCount[i] = 0;
    varValue[i] = 0;
    }

    flag = 1;

   }
   else if(nLine[0].equals("DATE") && flag == 1){

    for (int i=0; i<86400; i++){
    if (varCount[i] != 0){
    varValue[i] /= varCount[i];
    }
    }

    varValue = fillBlankSpreads(varValue, 86400);

    avgData.add(varValue.clone());

    flag = 0;

   }
   else{

    n = Integer.parseInt(nLine[0])-1;

    varValue[n] += Double.parseDouble(nLine[2]);
    varCount[n] += 1;

   }

}

return avgData;

}

private double[] fillBlankSpreads(double[] varValue, int numSpread){
//Filling the Data with zeros to make the code faster
 for (int i=1; i<numSpread; i++){
 if(varValue[i] == 0){
 varValue[i] = varValue[i-1];
 }
 }

 return varValue;
}

}

Sam Hank
The problem arises in the for loop, where it goes through 86400 iterations for each second in the day.I don't know, maybe an array of an array of an aray?The coding is fine and gets me the correct answer as I said, but it doesnt get the me the right answer fast at well. So i tak e it slower
Sam Hank