Announcement

**MichaelCatalani** · May 20, 2014, 09:59 AM

Re: Job performance Issue

Originally posted by Imad_M2014

now here goes the awkward situation: reaching a certain point, the iseries is getting very very slow, up to the point that even a READ statement in COBOL is consuming up to 150 or 200 ms! creating thus peaks.

It sounds like you have reached a max number of server jobs allowed, so certain requests are getting queued until a previous request completes.

Let's get some clarification on the statement above. Is the read statement taking 200ms of CPU to complete, or is it taking 200ms of time to complete?

I'm also a little confuse about the setup, so a little more info there would probably be helpful. You say the java application is calling cobol services. Can you explain that a bit. Are these cobol programs wrapped into a webservice?

**tomliotta** · May 20, 2014, 04:01 PM

Re: Job performance Issue

Originally posted by Imad_M2014

...by using a test loader tool, increasing the simultaneous number of users working on the app.

All servers will reach a point where significant performance degradation will be seen when the load is continually increased. Nothing surprising with that.

We have excluded until now the following:
CPU (it is OK)
Memory (OK)
record locking (we have checked this point)
the HDD arms (it is a data center)

Well, you have effectively excluded everything. It's not clear what we can do.

That's especially true since there is no description of the environment. We would need to know what your system model is, what its hardware specifications are, how many disk arms are available, what the DASD utilization is, what OS level you're working with, what your general PTF levels are and perhaps other details. We need to know more than that you've already checked everything and found that all is well.

What is your system environment?

We have also checked the following system values:
QMAXJOB: 163520 (default value by IBM)
QACTJOB: 200
QTOTJOB: 200
QADLTOTJ: 30
QDYNPTYSCD: 1
QJOBMSGQMX: 64
QMAXACTLVL: *NOMAX
QPFRADJ: 2

Those probably are unrelated to the problem. The last item, though, might need some expansion.

You have the system set to attempt performance adjustments. Now, how have you configured your subsystems? What work management settings have been changed to allow performance adjustments to be helpful? The performance adjuster could be causing more trouble than helping if your subsystems have never been configured for the types of workloads you have.

How are your subsystems configured for work management?

**Imad_M2014** · May 20, 2014, 10:45 PM

Re: Job performance Issue

Thank you all for your participation

to be more specific:
our application is a J2EE application that is implemented on websphere application servers, and that communicate with the iserie machine through a socket server.
On the iseries, we have C and Cobol programs to manage the XML data received from the application server (so it is not a web service), and we are using a pool of connections on the socket side to allow the connections to be shared among users, just for the call of the services.

Once the XML data is received by the front programs, it is processed and a chain of COBOL programs is called, performing certain tasks and then returning data or responses to the caller program on the application server.

@Michael: yes, it seems we've reached a certain max, but we can't see which?!
it is 200 ms in time, not CPU time.
Which is very awkward really, because normally COBOL statement on iSeries are very very fast usually.
And we have noticed something also, that upon having the worst results, the max active jobs (WRKACTJOB) was stable and not moving (1380), while normally it goes up and down (increase in the case of additional users, so more connections, decreased because we clean unused connections), it is like the iSeries was not able to give us more connections at this time! (yet, this was not the maximum number we have seen, we saw 1426 active jobs...)

@Tom:
yes, we don't understand what is happening as we have gone through many possibilities trying to explain the strange behavior!
as for the hardware: it is a very powerful machine (POWER 7), but I have to mention that the application servers are on this machine (using different partitions through VMWare), and the HDD are external data center, very powerful also.

As for the workload, we have configured our servers in order to support (normally) the workload (8 instances of application servers with a load balancer), and I think that the iSeries (especially the POWER 7) is able to support 600 simultaneous users.

On the statistics side: we didn't record any CPU overcharge neither on the application servers nor on the iseries, the same thing goes for the memory on both sides.

But What I can't underdstand really:
let's suppose there is something wrong with the sockets, or with any other layer of the applciation, once we are on the natural iSeries environment, these issues should be irrelevant to the COBOL programs that have been running for years in optimal conditions!
the OPEN OUTPUT statement in iSeries works without locking records issue, the READ statement should not take this long, and it is all over the jobs, it is like the machine is hanging at some level, then goes back to normal, than hangs again, and while it is hanging, we are recording these extreme times all over the system, though all other tasks are working just fine (for example: while we were seeing a "hang" stage, we were doing SQL requests on other files (using STRSQL), and we didn't see any delays... And of course it is not the SQL that is hanging the system, because with or without it, these peaks are occuring, and driving us crazy

)

what can we look at on the subsystems? (btw, we have also created 4 subsystems to distribute the workload on the iSeries)

**arrow483** · May 21, 2014, 06:53 AM

Re: Job performance Issue

This sounds like a problem I had with some long-running SQL. It seems the performance adjustment was killing the subsystem. The SQL was optimizing for the memory available, but after a few minutes the available memory had changed. The 2-hour query went to 24+ hours.
I don't know if this relates to Java at all, but it is an easy test. In the subsystem that is running the job, set the MIN and MAX memory to the same value. Run the job and see if performance flattens out.
Maybe won't help, but i is an easy test and you might get lucky on it.

**MichaelCatalani** · May 21, 2014, 10:05 AM

Re: Job performance Issue

Originally posted by Imad_M2014 View Post

@Michael: yes, it seems we've reached a certain max, but we can't see which?!
it is 200 ms in time, not CPU time.
Which is very awkward really, because normally COBOL statement on iSeries are very very fast usually.
And we have noticed something also, that upon having the worst results, the max active jobs (WRKACTJOB) was stable and not moving (1380), while normally it goes up and down (increase in the case of additional users, so more connections, decreased because we clean unused connections), it is like the iSeries was not able to give us more connections at this time! (yet, this was not the maximum number we have seen, we saw 1426 active jobs...)

200ms is a blink of the eye to us, but it is an eternity to a Power7 machine. If it's not using CPU during this time, it means the job is sitting idle waiting for something to occur. (ie a disk read, the CPU to become available to it, some resource which is locked, etc.)

This has all the markings of being a contention issue. (Maybe multiple issues) I would probably focus on the subsystems you have set up. What would help is to get two screen snapshots of WRKSYSSTS. (We need the view that shows the DB and non_DB faults.) The first snapshot is under a low stress load. The second would be where the stress load is high to where you are seeing performance issues. This would allow us to rule out memory or subsystem settings that are the contention.

I have always disliked the automatic performance adjustment. It can cause a machine to wobble due to a workload change, especially one in which it should not react to at all. And as Arrow pointed out, the wobbling effect can affect how the SQL optimizer works.

**tomliotta** · May 21, 2014, 02:30 PM

Re: Job performance Issue

Originally posted by Imad_M2014 View Post

what can we look at on the subsystems? (btw, we have also created 4 subsystems to distribute the workload on the iSeries)

First thing I'd want to see is the the initial WRKSBS display. If that looks good, then WRKSYSSTS as Michael asked would be useful.

The WRKSBS display gives a very basic view of how workloads might be distributed across memory pools. That's the first needed detail in determining if the performance adjuster is actually able to help or if it's simply stealing CPU cycles without the ability to make a difference.

**Imad_M2014** · May 22, 2014, 01:03 AM

Re: Job performance Issue

Ok

I will take a snapshot when this strange behavior occurs again.

I have though something new, I don't know if it helps:
while tracing the logs on the iSeries yesterday, I have discovered that a simple program (COBOL), receiving a string and converting it to uppercase, is taking over 700 ms to complete! There are absolutly no other instructions in this program.
So, in this case, there is no Disk access, no complicated algorithm...

The only thing, is that program is COBOL ILE, and it was compiled with QILE option as for the activation group, so I changed it to *CALLER and will run the test again.

But I don't think that this could be the solution, but it's extraordinary... this simple pgm is requiring as much, while other programs are very very fast, I wonder what is happening!!!

**Imad_M2014** · May 22, 2014, 06:05 AM

Re: Job performance Issue

last updates:

the *CALLER in ACTGRP did the trick, now the program is finishing in 10 ms or less.

The other things that is still consuming too much time:
- the creation of a file in QTEMP and the OFRDBF related to it (too long)
- the other statements (READ...) but I think that something is bringing the system to its knees...

We can't control the activation group of the programs creating the file in QTEMP, as they are written in OPM COBOL, not ILE

**MichaelCatalani** · May 22, 2014, 09:06 AM

Re: Job performance Issue

Originally posted by Imad_M2014 View Post

last updates:

the *CALLER in ACTGRP did the trick, now the program is finishing in 10 ms or less.

That makes it sound like something is removing QILE activation group at the end of the process. Changing from QILE to *caller should not have made that kind of performance gain unless QILE is getting wiped out after every call. If it is, then that is likely a big part of the rest of your problem. (Especially if many of the other programs used in the process are activating in QILE.)

Announcement

Job performance Issue

Job performance Issue

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment