ibmi-brunch-learn

Announcement

Collapse
No announcement yet.

Job performance Issue

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Job performance Issue

    Hi all

    we are encountering a very strange behavior on the iSeries and any help would be appreciated. I'll explain the situation:

    We have developed a web application (JAVA) that uses a socket server on the iseries, so from the iSeries point of view, each connection is seen as a job. The Java application is calling COBOL services.
    Everything is OK.
    We are trying to test the limit of the application, by using a test loader tool, increasing the simultaneous number of users working on the app.

    now here goes the awkward situation: reaching a certain point, the iseries is getting very very slow, up to the point that even a READ statement in COBOL is consuming up to 150 or 200 ms! creating thus peaks.

    Upon monitoring and adding traces to the application, we have found that the after a certain time, the job is less performant (same programs are consuming 10x more time than initial calls).

    We can't understand what is happening, it is like the job is getting tired :P, or more logically as if the resources allocated to the job have been used compeltely!
    Is there a system value to increase the resources allocated to jobs?
    Is our understanding making any sense at all? Is it true that resources allocated to a job are limited?

    We have excluded until now the following:
    CPU (it is OK)
    Memory (OK)
    record locking (we have checked this point)
    the HDD arms (it is a data center)

    We have also checked the following system values:
    QMAXJOB: 163520 (default value by IBM)
    QACTJOB: 200
    QTOTJOB: 200
    QADLTOTJ: 30
    QDYNPTYSCD: 1
    QJOBMSGQMX: 64
    QMAXACTLVL: *NOMAX
    QPFRADJ: 2


    any ideas?
    thanks

  • #2
    Re: Job performance Issue

    Originally posted by Imad_M2014
    now here goes the awkward situation: reaching a certain point, the iseries is getting very very slow, up to the point that even a READ statement in COBOL is consuming up to 150 or 200 ms! creating thus peaks.
    It sounds like you have reached a max number of server jobs allowed, so certain requests are getting queued until a previous request completes.


    Let's get some clarification on the statement above. Is the read statement taking 200ms of CPU to complete, or is it taking 200ms of time to complete?

    I'm also a little confuse about the setup, so a little more info there would probably be helpful. You say the java application is calling cobol services. Can you explain that a bit. Are these cobol programs wrapped into a webservice?
    Michael Catalani
    IS Director, eCommerce & Web Development
    Acceptance Insurance Corporation
    www.AcceptanceInsurance.com
    www.ProvatoSys.com

    Comment


    • #3
      Re: Job performance Issue

      Originally posted by Imad_M2014
      ...by using a test loader tool, increasing the simultaneous number of users working on the app.
      All servers will reach a point where significant performance degradation will be seen when the load is continually increased. Nothing surprising with that.

      We have excluded until now the following:
      CPU (it is OK)
      Memory (OK)
      record locking (we have checked this point)
      the HDD arms (it is a data center)
      Well, you have effectively excluded everything. It's not clear what we can do.

      That's especially true since there is no description of the environment. We would need to know what your system model is, what its hardware specifications are, how many disk arms are available, what the DASD utilization is, what OS level you're working with, what your general PTF levels are and perhaps other details. We need to know more than that you've already checked everything and found that all is well.

      What is your system environment?

      We have also checked the following system values:
      QMAXJOB: 163520 (default value by IBM)
      QACTJOB: 200
      QTOTJOB: 200
      QADLTOTJ: 30
      QDYNPTYSCD: 1
      QJOBMSGQMX: 64
      QMAXACTLVL: *NOMAX
      QPFRADJ: 2
      Those probably are unrelated to the problem. The last item, though, might need some expansion.

      You have the system set to attempt performance adjustments. Now, how have you configured your subsystems? What work management settings have been changed to allow performance adjustments to be helpful? The performance adjuster could be causing more trouble than helping if your subsystems have never been configured for the types of workloads you have.

      How are your subsystems configured for work management?
      Tom

      There are only two hard things in Computer Science: cache invalidation, naming things and off-by-one errors.

      Why is it that all of the instruments seeking intelligent life in the universe are pointed away from Earth?

      Comment


      • #4
        Re: Job performance Issue

        Thank you all for your participation

        to be more specific:
        our application is a J2EE application that is implemented on websphere application servers, and that communicate with the iserie machine through a socket server.
        On the iseries, we have C and Cobol programs to manage the XML data received from the application server (so it is not a web service), and we are using a pool of connections on the socket side to allow the connections to be shared among users, just for the call of the services.

        Once the XML data is received by the front programs, it is processed and a chain of COBOL programs is called, performing certain tasks and then returning data or responses to the caller program on the application server.

        @Michael: yes, it seems we've reached a certain max, but we can't see which?!
        it is 200 ms in time, not CPU time.
        Which is very awkward really, because normally COBOL statement on iSeries are very very fast usually.
        And we have noticed something also, that upon having the worst results, the max active jobs (WRKACTJOB) was stable and not moving (1380), while normally it goes up and down (increase in the case of additional users, so more connections, decreased because we clean unused connections), it is like the iSeries was not able to give us more connections at this time! (yet, this was not the maximum number we have seen, we saw 1426 active jobs...)

        @Tom:
        yes, we don't understand what is happening as we have gone through many possibilities trying to explain the strange behavior!
        as for the hardware: it is a very powerful machine (POWER 7), but I have to mention that the application servers are on this machine (using different partitions through VMWare), and the HDD are external data center, very powerful also.

        As for the workload, we have configured our servers in order to support (normally) the workload (8 instances of application servers with a load balancer), and I think that the iSeries (especially the POWER 7) is able to support 600 simultaneous users.

        On the statistics side: we didn't record any CPU overcharge neither on the application servers nor on the iseries, the same thing goes for the memory on both sides.

        But What I can't underdstand really:
        let's suppose there is something wrong with the sockets, or with any other layer of the applciation, once we are on the natural iSeries environment, these issues should be irrelevant to the COBOL programs that have been running for years in optimal conditions!
        the OPEN OUTPUT statement in iSeries works without locking records issue, the READ statement should not take this long, and it is all over the jobs, it is like the machine is hanging at some level, then goes back to normal, than hangs again, and while it is hanging, we are recording these extreme times all over the system, though all other tasks are working just fine (for example: while we were seeing a "hang" stage, we were doing SQL requests on other files (using STRSQL), and we didn't see any delays... And of course it is not the SQL that is hanging the system, because with or without it, these peaks are occuring, and driving us crazy )

        what can we look at on the subsystems? (btw, we have also created 4 subsystems to distribute the workload on the iSeries)

        Comment


        • #5
          Re: Job performance Issue

          This sounds like a problem I had with some long-running SQL. It seems the performance adjustment was killing the subsystem. The SQL was optimizing for the memory available, but after a few minutes the available memory had changed. The 2-hour query went to 24+ hours.
          I don't know if this relates to Java at all, but it is an easy test. In the subsystem that is running the job, set the MIN and MAX memory to the same value. Run the job and see if performance flattens out.
          Maybe won't help, but i is an easy test and you might get lucky on it.

          Comment


          • #6
            Re: Job performance Issue

            Originally posted by Imad_M2014 View Post
            @Michael: yes, it seems we've reached a certain max, but we can't see which?!
            it is 200 ms in time, not CPU time.
            Which is very awkward really, because normally COBOL statement on iSeries are very very fast usually.
            And we have noticed something also, that upon having the worst results, the max active jobs (WRKACTJOB) was stable and not moving (1380), while normally it goes up and down (increase in the case of additional users, so more connections, decreased because we clean unused connections), it is like the iSeries was not able to give us more connections at this time! (yet, this was not the maximum number we have seen, we saw 1426 active jobs...)
            200ms is a blink of the eye to us, but it is an eternity to a Power7 machine. If it's not using CPU during this time, it means the job is sitting idle waiting for something to occur. (ie a disk read, the CPU to become available to it, some resource which is locked, etc.)

            This has all the markings of being a contention issue. (Maybe multiple issues) I would probably focus on the subsystems you have set up. What would help is to get two screen snapshots of WRKSYSSTS. (We need the view that shows the DB and non_DB faults.) The first snapshot is under a low stress load. The second would be where the stress load is high to where you are seeing performance issues. This would allow us to rule out memory or subsystem settings that are the contention.

            I have always disliked the automatic performance adjustment. It can cause a machine to wobble due to a workload change, especially one in which it should not react to at all. And as Arrow pointed out, the wobbling effect can affect how the SQL optimizer works.
            Michael Catalani
            IS Director, eCommerce & Web Development
            Acceptance Insurance Corporation
            www.AcceptanceInsurance.com
            www.ProvatoSys.com

            Comment


            • #7
              Re: Job performance Issue

              Originally posted by Imad_M2014 View Post
              what can we look at on the subsystems? (btw, we have also created 4 subsystems to distribute the workload on the iSeries)
              First thing I'd want to see is the the initial WRKSBS display. If that looks good, then WRKSYSSTS as Michael asked would be useful.

              The WRKSBS display gives a very basic view of how workloads might be distributed across memory pools. That's the first needed detail in determining if the performance adjuster is actually able to help or if it's simply stealing CPU cycles without the ability to make a difference.
              Tom

              There are only two hard things in Computer Science: cache invalidation, naming things and off-by-one errors.

              Why is it that all of the instruments seeking intelligent life in the universe are pointed away from Earth?

              Comment


              • #8
                Re: Job performance Issue

                Ok

                I will take a snapshot when this strange behavior occurs again.

                I have though something new, I don't know if it helps:
                while tracing the logs on the iSeries yesterday, I have discovered that a simple program (COBOL), receiving a string and converting it to uppercase, is taking over 700 ms to complete! There are absolutly no other instructions in this program.
                So, in this case, there is no Disk access, no complicated algorithm...

                The only thing, is that program is COBOL ILE, and it was compiled with QILE option as for the activation group, so I changed it to *CALLER and will run the test again.

                But I don't think that this could be the solution, but it's extraordinary... this simple pgm is requiring as much, while other programs are very very fast, I wonder what is happening!!!

                Comment


                • #9
                  Re: Job performance Issue

                  last updates:

                  the *CALLER in ACTGRP did the trick, now the program is finishing in 10 ms or less.

                  The other things that is still consuming too much time:
                  - the creation of a file in QTEMP and the OFRDBF related to it (too long)
                  - the other statements (READ...) but I think that something is bringing the system to its knees...

                  We can't control the activation group of the programs creating the file in QTEMP, as they are written in OPM COBOL, not ILE

                  Comment


                  • #10
                    Re: Job performance Issue

                    Originally posted by Imad_M2014 View Post
                    last updates:

                    the *CALLER in ACTGRP did the trick, now the program is finishing in 10 ms or less.
                    That makes it sound like something is removing QILE activation group at the end of the process. Changing from QILE to *caller should not have made that kind of performance gain unless QILE is getting wiped out after every call. If it is, then that is likely a big part of the rest of your problem. (Especially if many of the other programs used in the process are activating in QILE.)
                    Last edited by MichaelCatalani; May 22, 2014, 09:09 AM.
                    Michael Catalani
                    IS Director, eCommerce & Web Development
                    Acceptance Insurance Corporation
                    www.AcceptanceInsurance.com
                    www.ProvatoSys.com

                    Comment

                    Working...
                    X