Announcement

**Mercury** · March 24, 2011, 08:34 AM

Re: Custom SQL aggregate functions

Why wouldn't you join the pre-calculated deviation file in the Select stm together with your UDF ?

**BenThurley** · March 24, 2011, 09:02 AM

Re: Custom SQL aggregate functions

Sorry I think I've confused you with my pseudo SQL. The requests file is too big to query over, this is why I have created the pre-calculated file.

The key to the pre-calculated file is something like:
Date:Time:Program:Action

There will be a record for each program/action combination. The calculated mean, standard deviation and number of requests are all for ten minute periods. This means the times are values like 1300, 1310, 1320 etc.

So to get the mean and standard deviation for the day I might want something like this.

PHP Code:


select date, combinedMean(mean), combinedStdDev(stddev) from precalcd
group by date
order by date

This all has to return within a short time so that I can display it within a web page.

I thought I would be ok. Oracle has the ODCIAggregate interface, SQL server has CREATE AGGREGATE, even Postgre SQL which is free supports CREATE AGGREGATE. It would seem that of the big vendors it's just DB2 that doesn't support this. Isn't it always the way!

Perhaps there is another way round this?

*** Edit ***
I've also just seen that MySQL supports user defined aggregate functions!

**Mercury** · March 24, 2011, 09:41 AM

Re: Custom SQL aggregate functions

Shouldn't the UDFs accept the date as input parameter and return the deviations in order to get something like :

PHP Code:


select date, sum(combinedMean(date)) / count(date) as Mean, 
             sum(combinedStdDev(date)) / count(date) as StdDev 
from precalcd
group by date
order by date

sum(combinedMean(date)) / count(date) would return the average of the sum of mean dev in a day
and
sum(combinedStdDev(date)) / count(date) would return the average of the sum of std dev for the same day.

Anyway something near of that.

Am I way way far in this issue ?

BTW DB2 supports aggregate functions.

**BenThurley** · March 24, 2011, 10:10 AM

Re: Custom SQL aggregate functions

Those calculations will only work if the sample sizes are identical and the variances are similar.

For example with differing means and standard deviations you get this problem.
If you have the following two datasets
5, 10, 15 (mean=10, std dev=4.08)
and
50, 100, 150 (mean=100, std dev=40.8)

You can take the average of the means which gives you 55. This is correct but if you take the average of the standard deviations you get 22.44. This is wrong, the correct value should be 53.54.

You get similar problems if the sample sizes vary.
5, 10, 15 (mean=10, std dev=4.08)
and
50, 100, 150, 200 (mean=125, std dev=55.9)

Now the average of the means is 67.5 whereas the correct value is 75.7. Notice it is higher because there are more results in the set with larger numbers. Similarly the standard deviation will be off again.

I wrote a Java class to test the equations I have for combining the means and standard deviations. I did this so I could use jUnit to unit test this method against calculating over the whole population. This is the method that combines the mean and standard deviation.

PHP Code:


    private DataSet combineDataSets(DataSet ds1, DataSet ds2)

    {

        DataSet newDS = new DataSet();

        

        // Combined mean        

        double n1 = ds1.getSize();

        double n2 = ds2.getSize();

        double xbar1 = ds1.getMean();

        double xbar2 = ds2.getMean();

        

        double xbarMerged = (n1 / (n1+n2))*xbar1 + (n2 / (n1+n2))*xbar2;

        

        newDS.setMean(xbarMerged);

        

        // Combined variance

        double s1 = ds1.getVariance();

        double s2 = ds2.getVariance();

        

        double varianceMerged = ( Math.pow(n1, 2)*s1 + Math.pow(n2, 2)*s2 

                - n1*s1 - n1*s2 - n2*s1 -

                n2*s2 + n1*n2*s1 + n1*n2*s2 + n1*n2 * 

                Math.pow((xbar1 - xbar2), 2) ) 

                / ( (n1+n2-1)*(n1+n2) );

        

        // Combined std dev

        newDS.setStandardDeviation(Math.sqrt(varianceMerged));

        

        return newDS;

    }

Basically, for each row I would need to know the mean, standard deviation and number of results. Given this for each row in the group you can work out an aggregate mean and standard deviation.

**BenThurley** · March 24, 2011, 10:11 AM

Re: Custom SQL aggregate functions

Sorry, just seen the last bit in your post too. DB2 does support aggregate functions but the difference is it doesn't support user defined aggregate functions.

**Mercury** · March 24, 2011, 12:31 PM

Re: Custom SQL aggregate functions

Try this out please (last attempt).

My precalcd test file:

Code:

DATE           DATA  
2011-03-24        5
2011-03-24       10
2011-03-24       15
2011-03-24       50
2011-03-24      100
2011-03-24      150
2011-03-24      200
[COLOR="deepskyblue"]2011-03-25       20
2011-03-25       25
2011-03-25       30
2011-03-25       35
2011-03-25       40
2011-03-25       45
2011-03-25       50
2011-03-25       55
2011-03-25       60
2011-03-25       65
2011-03-25       70[/COLOR]

PHP Code:


select date, dec(sum(data)/count(data), 7, 2) as mean,
dec(stddev(data), 7, 2) as stddev           
from precalcd 
group by date
order by date

This is what I get :

Code:

DATE             MEAN      STDDEV 
2011-03-24      75,00       70,93 
[COLOR="deepskyblue"]2011-03-25      45,00       15,81[/COLOR]

**K2r400** · March 24, 2011, 03:59 PM

Re: Custom SQL aggregate functions

You can do this :

Code:

[COLOR=#000000][SIZE=3][FONT=Courier New][COLOR=#0000bb]With t1 as (select date[/COLOR][COLOR=#007700], [/COLOR][COLOR=#0000bb]combinedMean[/COLOR][COLOR=#007700]([/COLOR][COLOR=#0000bb]mean[/COLOR][COLOR=#007700]) cMEAN, [/COLOR][COLOR=#0000bb]combinedStdDev[/COLOR][COLOR=#007700]([/COLOR][COLOR=#0000bb]stddev[/COLOR][COLOR=#007700]) cSTDDEV [/COLOR][/FONT][/SIZE][/COLOR][COLOR=#000000][COLOR=#0000bb][FONT=Courier New][SIZE=3]from precalcd)[/SIZE][/FONT][/COLOR][/COLOR]
[COLOR=#000000][COLOR=#0000bb][FONT=Courier New][SIZE=3]Select date, sum(cMean), sum(cSTDDEV) from t1[/SIZE][/FONT][/COLOR]
[SIZE=3][FONT=Courier New][COLOR=#000000]group by date[/COLOR][/FONT][/SIZE]
[SIZE=3][FONT=Courier New][COLOR=#000000]order by date  [/COLOR][/FONT][/SIZE]
[/COLOR]

**BenThurley** · March 25, 2011, 08:47 AM

Re: Custom SQL aggregate functions

Merci � vous deux.

RE: the sub select
Before I realised you couldn't do custom aggregate functions I did try something similar myself. I downloaded an example UDF from IT Jungle that calculated rolled up prices and then tried testing it with a group by. It is a different calculation to the one I wanted but the theory was the same.

The inner query worked a bit like this:

PHP Code:


SELECT col1,                   

SumVal(col1, col2)           

FROM myfile order by col1

This gave results a bit like this:

PHP Code:

I can surround this query with a "with x as" but I then found it wasn't possible to reliably get the correct col2 value. Here we are simply adding so a max() ought to do it, although I couldn't get this to work. But in my example max wouldn't work because the correct mean and std dev may not be the max value. I really need the opposite to a TOP function (which is another one I don't think the AS400 supports).

RE: using the avg() and stddev() functions
I have looked at these too but these will only work over the raw data. If you ran these over the pre-calculated values, you would just be calculating the average mean and standard deviation of the standard deviations. This would be wrong.

I've had to go back to the drawing board on this one. I think the only way of coding this (on the AS400), is to add some more fields and indexes to the large requests file so that I can group by on that. At the moment that file performs like a dog but maybe with some tweaking I can get it to return something in an acceptable time-frame. It's the only option I have now. I may yet need those equations for calculating a total of the groups.

I'll let you know how I get on. Thanks for the suggestions.

**MichaelCatalani** · March 25, 2011, 09:13 AM

Re: Custom SQL aggregate functions

I just want you guys to know that my head exploded.

**Mercury** · March 25, 2011, 09:23 AM

Re: Custom SQL aggregate functions

Yeah, reading your latter post I sahre your opinion that you need in fact to work out deeply on the big requests file.

Tiens-nous au courant stp.

Keep us informed pls.

**BenThurley** · April 6, 2011, 09:46 AM

Re: Custom SQL aggregate functions

Ok, bit of an update. I wrote a specification based on improving the requests file for performance, but I forgot that this file is purged more frequently than the rolled up stats file. In other words doing it this way would mean we would not be able to look as far back when comparing the statistics. I still wasn't that happy with having to use this file because we really do have a lot of requests to process.

Instead I've had a rethink on calculating the mean and standard deviation. A colleague of mine pointed out that the mean could be aggregated something like this:

select hour, sum(numRequests*responseTime)/sum(numRequests)
from XPSTATS
group by hour

When researching how to calculate and combine these I was looking at best practices. Best practice with computations like this usually involve avoiding large numbers due to loss of precision. In the case of the means here it may not be too bad. The biggest problem area was with the standard deviations. I had forgotten that I had seen an alternative to aggregate these which uses the sum of the squares.

If I store the sum of the squares on the rolled up file which has the mean and number of requests, then I should be able to use this method to get the standard deviation.

stddev = sqrt( (sumofsquares - (sum*sum) / numberofrequests ) / numberofrequests )

The big problem with this method is that I need a new field on my file to store the sum of the squares. This could end up quite a large number so there are potentially some bad rounding errors. Sadly I've tried the best practice route and it isn't really possible on this system so I think I'll have to plump with this approach and just make this field as big (or precise) as possible.

**Mercury** · April 7, 2011, 05:03 AM

Re: Custom SQL aggregate functions

Thanks Ben for the feedback.

**jamief** · April 7, 2011, 05:22 AM

Re: Custom SQL aggregate functions

Hey Ben, when will your complex SQL be put into production at the Department of Energy's Fermi National Accelerator Laboratory?
My head hurt tooo

Nice job guys..
jamie

**BenThurley** · April 7, 2011, 07:17 AM

Re: Custom SQL aggregate functions

My head hurts when looking at subfiles so it's all relative

Announcement

Custom SQL aggregate functions

Custom SQL aggregate functions

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment