Tech

20120810

Node Manager Starting Weblogic Server life cycle explained


Node Manager, is an java instance that ensure high availability of WLS servers with whole server migration and  automatic restarting WLS nodes in case of unwanted shutdown. NM uses the ServerMBean, ServerStartMBean, and SSLMBean to set the JAVA_VENDOR, JAVA_HOME, JAVA_OPTIONS, SECURITY_POLICY, CLASSPATH, and ADMIN_URL. Therefore by using the NM to start a managed server does not necessary uses the domain's bin directory startWebLogic.sh or startWebLogic.cmd to start a node.

This is due because the flag StartScriptEnabled nodemanager.properties file is set to false by default; So this means that all the settings modification done, such as adding debug flags or adding a new jar on the classpath, on startWeblogic.sh will not be pick up or better saying will not be provided to the JVM.

So solve this unpleasant situation you can just go to wlserver_10.3/common/nodemanager directory and edit the nodemanager.properties as following:

StartScriptEnabled=true

Also, you need to pay attention on another configuration on the same file; The StartScriptName actually tells the NM which script should be use to start the WLS node. By default it uses the startWebLogic.sh, but you can create a new file in the domain's bin directory with all needed start-up configurations;

StartScriptName=startWebLogicModified.sh

Understand the some situations you are using the NM to start the WLS nodes without knowing; Sometimes this process seems to be unseeing by administrators. Bellow is the list of situations in which you may be using NM to start servers:

Admin Console - You need to have the NM running to start any instance.

WLST -
1. When you use the following WLST's commands:
start NM:
offline> startNodeManager()
Connect: offline>nmConnect('username','password','nmHost','nmPort','domainName','domainDir','nmType') 
Start Serverwls:/nm/mydomain>nmStart('AdminServer')
Stop Server:
wls:/nm/mydomain>nmKill('serverName')


2. When you use the WLST to connect to the AdminServer, therefore is just like when you use the Admin Console:
Connecting with AdminServer:
(offline)> connect('username','password')
Then you can start by using single server name or cluster:
wls:/mydomain/serverConfig>start('managedServerName','Server')
wls:/mydomain/serverConfig>start('clusterName', 'Cluster')

Pay attention that some basic rules must be follow, for NM to start a server:

1. Any particular server mus be assign to the NM; Do not go around setting partially the cluster to start with NM and hope that all the servers will start. 

2. Is one NM per physical machine; WLS has the great feature of extend a domain across physicals machines, so I insist again remember to assign the WLS nodes to the particular NM located on the same physical machine. 

3. Remember that NM by default uses the ServerStartMBean to set the environment, so if you have edited the startWebLogic.sh the JVM will not pick up the changes. Unless, as explained, you modify these
StartScriptEnabled and StartScriptName accordingly with your requirements on the nodemanager.properties file.

20120809

How to Install OpenJDK on Ubuntu

A. To install the OpenJDK on ubuntu is very easy and one shot deal. First you need to search for the JDK version. First open a terminal and run the following command:

$ sudo apt-cache search openjdk-

This should give you some choices like:


openjdk-6-doc - OpenJDK Development Kit (JDK) documentation
openjdk-6-jdk - OpenJDK Development Kit (JDK)
openjdk-6-jre - OpenJDK Java runtime, using Hotspot JIT
and 
openjdk-7-doc - OpenJDK Development Kit (JDK) documentation
openjdk-7-jdk - OpenJDK Development Kit (JDK)
openjdk-7-jre - OpenJDK Java runtime, using Hotspot JIT

many others packages should appear. 

OR

you can just type in the following command: 

$ sudo apt-get install openjdk- 

and tab twice!

B. To install the OpenJDK now you can run the following command and similar to the one above: 

$ sudo apt-get install openjdk-6-jdk

Just follow the screen instructions and that's it. 

C. In case you installed both OpenJDK-6 and OpenJDK-7, you can simple switch between both by using this command. 

$ sudo update-alternatives --config java

Then you should see this: 
There are 2 choices for the alternative java (providing /usr/bin/java).

  Selection    Path                                           Priority   Status
------------------------------------------------------------
  0            /usr/lib/jvm/java-6-openjdk-i386/jre/bin/java   1061      auto mode
  1            /usr/lib/jvm/java-6-openjdk-i386/jre/bin/java   1061      manual mode
* 2            /usr/lib/jvm/java-7-openjdk-i386/jre/bin/java   1051      manual mode

Press enter to keep the current choice[*], or type selection number: 

I have selected 2 to set the OpenJDK-7 as the default java. 

D. In case you want to set any program java home to the non-default, in my case OpenJDK-6. You can just provide the path: /usr/lib/jvm/java-6-openjdk-i386 




What about Service Guardian on Coherence

Service Guardian is basically a stuck thread watch dog for Coherence cluster, which consist in sent heartbeats from owned created  Coherence's thread; In case a thread from a specific node fails to respond the heartbeat some time-out flags are triggers, for corrective action to be taken.

The time-out recoveries works like:

Soft time-out _ Coh. attempts to interrupt the thread before the Hard time-out is reached. If successful normal Processing resumes.


<Error> (thread=DistributedCache, member=1): Attempting recovery (due to soft
timeout) of Daemon{Thread="Thread[WriteBehindThread:CacheStoreWrapper(com.
tangosol.examples.rwbm.TimeoutTest),5,WriteBehindThread:CacheStoreWrapper(com.
tangosol.examples.rwbm.TimeoutTest)]", State=Running}


Possible some network delay or latency.305000 milliseconds is the default value, and there is no action required, unless you frequently see this log output. Which means that you might need watch your network traffic and do some tuning. Also you may change the default value to better fit your necessities.

Hard Time-out _ after the set timing, this case the default 305000 milliseconds is reached Coh. now tries to stop the thread. 


<Error> (thread=DistributedCache, member=1): Terminating guarded execution (due 
to hard timeout) of Daemon{Thread="Thread[WriteBehindThread:CacheStoreWrapper
(com.tangosol.examples.rwbm.TimeoutTest),5,WriteBehindThread:CacheStoreWrapper
(com.tangosol.examples.rwbm.TimeoutTest)]", State=Running}

The Coh. thread is not behaving as expected, possible doing some investigation by thread dumps might help identify the issue. But first you need to identify which node in which should take thread dumps. From the log above it gives you hints like "thread=DistributedCache, member=1", the thread is DetributedCache and the member is 1. 

305000 milliseconds, if I'm not mistaking should be about 5 minutes. There fore running about 15 thread dumps, each 30 seconds should help analyse, in this case why the DistributedCache is taking too long. Do not disregard network traffic, some issues can be resolve by using Coh.'s Unicast and Coh. WKA.

Settings for Unicast:


-Dtangosol.coherence.localhost=192.168.0.1
-Dtangosol.coherence.localport=8090
-Dtangosol.coherence.localport.adjust=true


Settings for Well Known Addresses:

-Dtangosol.coherence.wka=192.168.0.100
-Dtangosol.coherence.wka.port=8088



Lastly _ The dead end, after all fails you are done for it... Naahhhh!!! At this point Coh. actually tries to follow policies like:

  • Shutting down the cluster service: 
The faulty node stop all its cluster communication on an attempt to reset all the distribution services. Depending on your logging level and size of cluster, this could be a pain.  
  • Shutting down the JVM:
I am not really experience with this behaviour, but one thing is for sure, we would know which node is the cupid; I understand that WLS's Node Manager can start Coh. cache servers, and also that Node Manager can restart WLS servers... hummm... But I am not sure if the Node Manager, but any ways bellow is some interesting links. 

Start Coh. Servers from the WLS's Admin Console:
http://docs.oracle.com/cd/E28271_01/apirefs.1111/e13952/taskhelp/coherence/StartCoherenceServers.html
How NM restart Managed Servers:
http://docs.oracle.com/cd/E23943_01/web.1111/e13740/overview.htm#i1074986
 
  • Performing a custom action:


This option means that you have known situations in which Coherence threads might take longer than expected or would like to have more control on this feature by controlling its behaviour. But is preferable that you follow the Coh.'s  documentation for this settings.
ref: http://docs.oracle.com/cd/E24290_01/coh.371/e22837/api_guardian.htm

Just known that Service Guardian is a new feature on Coherence, which was introduce on the 3.5 version. This service is reaching some good maturity on Coh. 3.7.1.xx, therefore upgrading to the latest version of Coherence is a must to avoid defects. One last thing, in case you just do not want to go so deep into this feature you can always disable or even raise the time-out value:

Shut-down Guardian:
-Dtangosol.coherence.guard.timeout=0
Raise time-out in milliseconds:
-Dtangosol.coherence.guard.timeout=700000
Hard coding time-out:
import
com.tangosol.net.GuardSupport
set heartbeat
GuardSupport.heartbeat();
known long running operation
GuardSupport.heartbeat(long cMillis);











20120806

Weblogic Server Thread pool, from my view.


Weblogic Server has two different way of handling Thread pools: 

Weblogic 8.1 Thread Pool Model

The process was performed in multiple execute queues. Different types of work were handle in different queue, basing on priority and ordering requirements to avoid dead lock. The tuning on this model is very manual and dependent on the administrator analyses and configuration to get some real performance. The control of thread was basically changing the number of threads in the default queue, or configure custom execute queues to a particular applications to have access to a fixed number of execute thread. Oracle /BEA recommends migrating to Work Manager. 

How to enable

- The configuration to use the 8.1 is not trivial as configuring from console; As mentioned above the recommendation is to use the Work Manager, since this style is very prone to human bad tune configuration. I mean, system resources changes constantly just by adding  a new program or changing parameters, maybe a good tune on a system and a small change happens you could face some performance degradation. 

1. Shut down the WLS's java instance. 
2. Edit the config.xml, by adding the use use81-style-execute-queues element to true. 
3. Start a new WLS's java instance. 
4. Explicitly create the weblogic.kernel.Default execute queue from the Console.
5. Reboot the WLS's server java instance. 

<server>
   <name>YourServer
n
</name>
   <ssl>
      <name>myserver</name>
      <enabled>true</enabled>
      <listen-port>7002</listen-port>
   </ssl>
   <use81-style-execute-queues>true</use81-style-execute-queues>
   <listen-address/>
</server>

*You might need to do the same steps for each server. 

Tuning 

The ThreadCount of the element ExecuteQueue set in the config.xml equals the number of simultaneous operations that can be performed by an application assign to the execute queue. Threads consume resources, and having too many as consequence can have a lot physical resources unnecessary use for unnecessary work. Therefore could decrease performance of the application as for the entire system. 

The ThreadCount can be different depending in what type of Mode that you start the WLS: 

Development: Mainly use to load test and development environment, this could mean that WLS can have more I/Os and other types of issues while using this mode. This mode is mainly to stress the application container it self and the default number of thread is 15

Production: As named this mode should be use for production. Changes are not on the fly, and usually the application server needs some extra steps, lock edit is one of the example; This mode is for performance, and passing on the stress to the environment system and the default value of thread is 25

*Some cases, depending on large cluster and configuration, just by changing the mode on the WLS from Development to Production you can face some performance degradation on your environment. Simple math, with default 8.1 configuration: 
10 instances of Development mode, equals to 150 threads to share the system resources. 
10 instances of Production mode, equals to 250 threads to share the system resources. 

Simple tuning of thread could resolve this issue; I would disregard having the Porduction mode being remove from an production environment. Check some scenarios for Modifying the default thread count: 

(You need to understand the nature of your business, some applications has a pick time of usage at the end of every month; example: banking, after you getting your salary).

Thread Count < numbers of CPUs :
  • Behavior _  CPUs are not being fully use on a pick time, but there is some work to be done.
  • Possible Action _ Increase the thread count. 

Thread Count == number of CPUs: 
  • Behavior _ CPUs are not being fully use on a pick time, but there is some work to be done.
  • Possible Action _ Increase the thread count. 

Thread Count > number of CPUs (by a moderate number of threads)
  • Behavior _  CPUs are being fully use on a pick time, with a moderate amount of context switching.
  • Possible Action _  Tune the thread count, and test performance. 

Thread Count > number of CPUs (by a large number of threads)
  • Behavior _  CPUs are being fully use on a pick time, with many context switching.
  • Possible Action _ Reduce the number of threads. (remember my example of Modes above...)



Work Manager

Is a single thread pool, where all types of work are executed. The tuning of an Work Manager is by demand, in which means that it happens automatically. Queue monitors the throughput base on history and determines the adjust of thread count. The work is base on defined rules, runtime metrics (history) to avoid deadlocks. This the WLS passes it on some of  the tuning responsibility from the administrators to it self. 

Administrator can manage work by configuring some types of scheduling guideline by defining some components: 

  • Fair Share Request Class
    Specifies the average time required to process a request. Default is 50. (This is base on percentage, therefore timing also will depend on the environment physical capacity to process).
  • Response Time Request Class
    Is the response time goal in milliseconds. Not applied to individual request. 
  • Min Threads Constrains
    The guarantee  number of threads the server will allocate. Default is Zero. 
  • Max Threads Constraint
    Limit the number of concurrent threads. Default is unlimited, -1. 
  • Capacity Constraint
    Forces the server to reject request when reached its capacity, either individual or global capacity is exceeded it will reject requests. The default is unlimited, -1. 
  • Context Request Class
   Assigns request classes depending on context information. 
WLS works as best effort, this means will no guarantee that configured ratio will be maintained. Its behavior can possible change depending on demand.

ref: http://docs.oracle.com/cd/E21764_01/web.1111/e13814/appb_queues.htm
ref: http://docs.oracle.com/cd/E24329_01/web.1211/e24432/self_tuned.htm#i1068066

Important: This information is based on my personal understanding of Oracle's documentation. Please refer to Oracle's documentation for further information and tuning. Remember that you as an administrator is responsible for the best tuning of your own environment. Simple word, use your gut instincts! if you have some problem with this please read the article written by Leon Watson from MailOnline:
"Researches say our first thought is often our best" 
ref: http://www.dailymail.co.uk/news/article-2031848/Why-right-trust-gut-instincts-Scientists-discover-decision-IS-right-one.html

20120802

Weblogic's console slowness performance issue.


I have notice that after installing the latest WLS with the latest JDK 1.7 on a redhat base 64 bit linux, the Admin Console is taking a long time to show on browser. The Admin server it self starts fine, but when I provide the username and password then I would have to wait for about 3 - 4 minutes to have access.

Here is my system config:

WLS 12.1.1
JDK 1.7.0_5
redhat linux based x86_64

Since this strange behavior actually happens during authentication and authorization of my user/password, this must be related with some security performance issue. From my past experience,  I learned that the linux OS's  /dev/random and /dev/urandom should have some effect on this behavior. Please, check wikipedia for further knowledge on ramdom and urandom.

whit some googling, I came across many links in which talks about starting performance by adding the following command while starting the AdminServer:

$./startWeblogic.sh -Djava.security.egd=file:/dev/./urandom

This did the trick, and then I found a open java bug 6202721 in which was closed as "not a java bug"... 


The other definitive way to change on all the servers starting on the same JDK  you can just edit the java.security file: 

#cat /usr/java/jdk1.7.0_05/jre/lib/security/java.security|grep "urandom"
# On Solaris and Linux systems, if file:/dev/urandom is specified and it
# This "NativePRNG" reads random bytes directly from /dev/urandom.
# On Windows systems, the URLs file:/dev/random and file:/dev/urandom
securerandom.source=file:/dev/urandom
#   -Djava.security.egd=file:/dev/urandom

vi the java.security file and change
from: 
securerandom.source=file:/dev/urandom

to: 
securerandom.source=file:/dev/./urandom

hopefully works for you.