ibmi-brunch-learn

Announcement

Collapse
No announcement yet.

Custom SQL aggregate functions

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Custom SQL aggregate functions

    I have an interesting problem and I'm not sure how best to solve it.

    Background:
    We have a file that contains each request and how long that request takes. The problem is that this file gets quite big so any group by queries take an age. To get round this I have another file that pre-calculates the mean and standard deviation for request times for ten minute periods. This works fine and the group bys are much much quicker.

    The problem:
    What I want now is to be able to do an SQL group by query and call an aggregate function to combine the means and standard deviations. I've already done the maths and found the equations I need to calculate this.

    I want to be able to say something like this
    PHP Code:
    select datecombinedMean(mean), combinedStdDev(stddevfrom requests
    group by date
    order by date 
    The solution?
    I've just spent a bit of time looking at UDF's but you can only do scalar functions it seems. In other words I can't pass a field to my customer function that isn't included in the select and group statements. Bummer!

    So it's now looking like perhaps I need to use native I/O or muck about and create a third file that maybe performs a bit better. I don't really want to have to do complicated groupings using native I/O and neither do I really want to be querying over that large file.

    Any other ideas guys?
    Thanks
    Ben

  • #2
    Re: Custom SQL aggregate functions

    Why wouldn't you join the pre-calculated deviation file in the Select stm together with your UDF ?
    Philippe

    Comment


    • #3
      Re: Custom SQL aggregate functions

      Sorry I think I've confused you with my pseudo SQL. The requests file is too big to query over, this is why I have created the pre-calculated file.

      The key to the pre-calculated file is something like:
      Date:Time:Program:Action

      There will be a record for each program/action combination. The calculated mean, standard deviation and number of requests are all for ten minute periods. This means the times are values like 1300, 1310, 1320 etc.

      So to get the mean and standard deviation for the day I might want something like this.
      PHP Code:
      select datecombinedMean(mean), combinedStdDev(stddevfrom precalcd
      group by date
      order by date 
      This all has to return within a short time so that I can display it within a web page.

      I thought I would be ok. Oracle has the ODCIAggregate interface, SQL server has CREATE AGGREGATE, even Postgre SQL which is free supports CREATE AGGREGATE. It would seem that of the big vendors it's just DB2 that doesn't support this. Isn't it always the way!

      Perhaps there is another way round this?

      *** Edit ***
      I've also just seen that MySQL supports user defined aggregate functions!
      Last edited by BenThurley; March 24, 2011, 09:19 AM.
      Ben

      Comment


      • #4
        Re: Custom SQL aggregate functions

        Shouldn't the UDFs accept the date as input parameter and return the deviations in order to get something like :

        PHP Code:
        select datesum(combinedMean(date)) / count(date) as Mean
                     
        sum(combinedStdDev(date)) / count(date) as StdDev 
        from precalcd
        group by date
        order by date 
        sum(combinedMean(date)) / count(date) would return the average of the sum of mean dev in a day
        and
        sum(combinedStdDev(date)) / count(date) would return the average of the sum of std dev for the same day.

        Anyway something near of that.

        Am I way way far in this issue ?

        BTW DB2 supports aggregate functions.
        Last edited by Mercury; March 24, 2011, 09:45 AM.
        Philippe

        Comment


        • #5
          Re: Custom SQL aggregate functions

          Those calculations will only work if the sample sizes are identical and the variances are similar.

          For example with differing means and standard deviations you get this problem.
          If you have the following two datasets
          5, 10, 15 (mean=10, std dev=4.08)
          and
          50, 100, 150 (mean=100, std dev=40.8)

          You can take the average of the means which gives you 55. This is correct but if you take the average of the standard deviations you get 22.44. This is wrong, the correct value should be 53.54.

          You get similar problems if the sample sizes vary.
          5, 10, 15 (mean=10, std dev=4.08)
          and
          50, 100, 150, 200 (mean=125, std dev=55.9)

          Now the average of the means is 67.5 whereas the correct value is 75.7. Notice it is higher because there are more results in the set with larger numbers. Similarly the standard deviation will be off again.

          I wrote a Java class to test the equations I have for combining the means and standard deviations. I did this so I could use jUnit to unit test this method against calculating over the whole population. This is the method that combines the mean and standard deviation.

          PHP Code:
              private DataSet combineDataSets(DataSet ds1DataSet ds2)
              {
                  
          DataSet newDS = new DataSet();
                  
                  
          // Combined mean        
                  
          double n1 ds1.getSize();
                  
          double n2 ds2.getSize();
                  
          double xbar1 ds1.getMean();
                  
          double xbar2 ds2.getMean();
                  
                  
          double xbarMerged = (n1 / (n1+n2))*xbar1 + (n2 / (n1+n2))*xbar2;
                  
                  
          newDS.setMean(xbarMerged);
                  
                  
          // Combined variance
                  
          double s1 ds1.getVariance();
                  
          double s2 ds2.getVariance();
                  
                  
          double varianceMerged = ( Math.pow(n12)*s1 Math.pow(n22)*s2 
                          
          n1*s1 n1*s2 n2*s1 -
                          
          n2*s2 n1*n2*s1 n1*n2*s2 n1*n2 
                          
          Math.pow((xbar1 xbar2), 2) ) 
                          / ( (
          n1+n2-1)*(n1+n2) );
                  
                  
          // Combined std dev
                  
          newDS.setStandardDeviation(Math.sqrt(varianceMerged));
                  
                  return 
          newDS;
              } 
          Basically, for each row I would need to know the mean, standard deviation and number of results. Given this for each row in the group you can work out an aggregate mean and standard deviation.
          Ben

          Comment


          • #6
            Re: Custom SQL aggregate functions

            Sorry, just seen the last bit in your post too. DB2 does support aggregate functions but the difference is it doesn't support user defined aggregate functions.
            Ben

            Comment


            • #7
              Re: Custom SQL aggregate functions

              Try this out please (last attempt).

              My precalcd test file:
              Code:
              DATE           DATA  
              2011-03-24        5
              2011-03-24       10
              2011-03-24       15
              2011-03-24       50
              2011-03-24      100
              2011-03-24      150
              2011-03-24      200
              [COLOR="deepskyblue"]2011-03-25       20
              2011-03-25       25
              2011-03-25       30
              2011-03-25       35
              2011-03-25       40
              2011-03-25       45
              2011-03-25       50
              2011-03-25       55
              2011-03-25       60
              2011-03-25       65
              2011-03-25       70[/COLOR]
              PHP Code:
              select datedec(sum(data)/count(data), 72) as mean,
              dec(stddev(data), 72) as stddev           
              from precalcd 
              group by date
              order by date 
              This is what I get :

              Code:
              DATE             MEAN      STDDEV 
              2011-03-24      75,00       70,93 
              [COLOR="deepskyblue"]2011-03-25      45,00       15,81[/COLOR]
              Philippe

              Comment


              • #8
                Re: Custom SQL aggregate functions

                You can do this :

                Code:
                [COLOR=#000000][SIZE=3][FONT=Courier New][COLOR=#0000bb]With t1 as (select date[/COLOR][COLOR=#007700], [/COLOR][COLOR=#0000bb]combinedMean[/COLOR][COLOR=#007700]([/COLOR][COLOR=#0000bb]mean[/COLOR][COLOR=#007700]) cMEAN, [/COLOR][COLOR=#0000bb]combinedStdDev[/COLOR][COLOR=#007700]([/COLOR][COLOR=#0000bb]stddev[/COLOR][COLOR=#007700]) cSTDDEV [/COLOR][/FONT][/SIZE][/COLOR][COLOR=#000000][COLOR=#0000bb][FONT=Courier New][SIZE=3]from precalcd)[/SIZE][/FONT][/COLOR][/COLOR]
                [COLOR=#000000][COLOR=#0000bb][FONT=Courier New][SIZE=3]Select date, sum(cMean), sum(cSTDDEV) from t1[/SIZE][/FONT][/COLOR]
                [SIZE=3][FONT=Courier New][COLOR=#000000]group by date[/COLOR][/FONT][/SIZE]
                [SIZE=3][FONT=Courier New][COLOR=#000000]order by date  [/COLOR][/FONT][/SIZE]
                [/COLOR]


                Patrick

                Comment


                • #9
                  Re: Custom SQL aggregate functions

                  Merci à vous deux.

                  RE: the sub select
                  Before I realised you couldn't do custom aggregate functions I did try something similar myself. I downloaded an example UDF from IT Jungle that calculated rolled up prices and then tried testing it with a group by. It is a different calculation to the one I wanted but the theory was the same.

                  The inner query worked a bit like this:
                  PHP Code:
                  SELECT col1,                   
                  SumVal(col1col2)           
                  FROM myfile order by col1 
                  This gave results a bit like this:
                  PHP Code:
                  Col1  Col2
                   1     1
                   1     2
                   1     3
                   1     4
                   1     5
                   2     1 
                   2     2
                   3     1
                   3     2
                   3     3 
                  I can surround this query with a "with x as" but I then found it wasn't possible to reliably get the correct col2 value. Here we are simply adding so a max() ought to do it, although I couldn't get this to work. But in my example max wouldn't work because the correct mean and std dev may not be the max value. I really need the opposite to a TOP function (which is another one I don't think the AS400 supports).

                  RE: using the avg() and stddev() functions
                  I have looked at these too but these will only work over the raw data. If you ran these over the pre-calculated values, you would just be calculating the average mean and standard deviation of the standard deviations. This would be wrong.


                  I've had to go back to the drawing board on this one. I think the only way of coding this (on the AS400), is to add some more fields and indexes to the large requests file so that I can group by on that. At the moment that file performs like a dog but maybe with some tweaking I can get it to return something in an acceptable time-frame. It's the only option I have now. I may yet need those equations for calculating a total of the groups.

                  I'll let you know how I get on. Thanks for the suggestions.
                  Ben

                  Comment


                  • #10
                    Re: Custom SQL aggregate functions

                    I just want you guys to know that my head exploded.
                    Michael Catalani
                    IS Director, eCommerce & Web Development
                    Acceptance Insurance Corporation
                    www.AcceptanceInsurance.com
                    www.ProvatoSys.com

                    Comment


                    • #11
                      Re: Custom SQL aggregate functions

                      Yeah, reading your latter post I sahre your opinion that you need in fact to work out deeply on the big requests file.

                      Tiens-nous au courant stp.

                      Keep us informed pls.
                      Philippe

                      Comment


                      • #12
                        Re: Custom SQL aggregate functions

                        Ok, bit of an update. I wrote a specification based on improving the requests file for performance, but I forgot that this file is purged more frequently than the rolled up stats file. In other words doing it this way would mean we would not be able to look as far back when comparing the statistics. I still wasn't that happy with having to use this file because we really do have a lot of requests to process.

                        Instead I've had a rethink on calculating the mean and standard deviation. A colleague of mine pointed out that the mean could be aggregated something like this:

                        select hour, sum(numRequests*responseTime)/sum(numRequests)
                        from XPSTATS
                        group by hour

                        When researching how to calculate and combine these I was looking at best practices. Best practice with computations like this usually involve avoiding large numbers due to loss of precision. In the case of the means here it may not be too bad. The biggest problem area was with the standard deviations. I had forgotten that I had seen an alternative to aggregate these which uses the sum of the squares.

                        If I store the sum of the squares on the rolled up file which has the mean and number of requests, then I should be able to use this method to get the standard deviation.

                        stddev = sqrt( (sumofsquares - (sum*sum) / numberofrequests ) / numberofrequests )

                        The big problem with this method is that I need a new field on my file to store the sum of the squares. This could end up quite a large number so there are potentially some bad rounding errors. Sadly I've tried the best practice route and it isn't really possible on this system so I think I'll have to plump with this approach and just make this field as big (or precise) as possible.
                        Ben

                        Comment


                        • #13
                          Re: Custom SQL aggregate functions

                          Thanks Ben for the feedback.
                          Philippe

                          Comment


                          • #14
                            Re: Custom SQL aggregate functions

                            Hey Ben, when will your complex SQL be put into production at the Department of Energy's Fermi National Accelerator Laboratory?
                            My head hurt tooo

                            Nice job guys..
                            jamie
                            All my answers were extracted from the "Big Dummy's Guide to the As400"
                            and I take no responsibility for any of them.

                            www.code400.com

                            Comment


                            • #15
                              Re: Custom SQL aggregate functions

                              My head hurts when looking at subfiles so it's all relative
                              Ben

                              Comment

                              Working...
                              X