Java Trouble Shooting 2 – [ Third Party Tools ]

When Java applications misbehave in production, logs are often not enough – and restarting the system is usually not an option.

Sooner or later, every Java developer faces issues that cannot be reproduced locally and must be diagnosed directly on a running JVM.

Java provides a set of powerful standard troubleshooting tools such as jps, jinfo, jstat, jstack, jmap and jcmd, which are essential for understanding JVM state, threads, memory, and performance.

However, in many real-world scenarios, these tools alone may not provide enough visibility into application-level behavior.

This is where third-party Java troubleshooting tools like BTrace and Arthas shine – allowing developers to trace execution, monitor runtime behavior, and diagnose problems live, safely, and efficiently, without modifying code or restarting the application.

1. Byte Code Tracing – BTrace

BTrace is a safe, dynamic tracing tool for the java platform.

It dynamically instruments the classes of the target application to inject tracing code.

The first way to get BTrace is to download the source code on their github repo and build a binary yourself :

https://github.com/btraceio/btrace

The second way is to download directly a binary distribution file from below link :

https://github.com/btraceio/btrace/releases

To do a demo of btrace, we changed our random test class to extract every step into a method :

import java.util.concurrent.TimeUnit;

/**
 * -agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=9999
 */
public class RandomTest {
    public static void main(String[] args) throws InterruptedException {
        while (true) {
            process();
        }
    }

    private static void process() throws InterruptedException {
        sleep(1);
        print(random());
    }

    private static void print(double random) {
        System.out.println(random);
    }

    private static double random() {
        return Math.random();
    }

    private static void sleep(int time) throws InterruptedException {
        TimeUnit.SECONDS.sleep(time);
    }
}

What we want is to trace every method call when this random test will be executed.

To do so, we need to firstly create a btrace script.

A btrace script is a normal java file in which we can use btrace api to describe what and how to trace.

Below is the btrace script to trace all method call for our random test :

import static org.openjdk.btrace.core.BTraceUtils.print;

import org.openjdk.btrace.core.BTraceUtils.Strings;
import org.openjdk.btrace.core.annotations.BTrace;
import org.openjdk.btrace.core.annotations.OnMethod;
import org.openjdk.btrace.core.annotations.ProbeClassName;
import org.openjdk.btrace.core.annotations.ProbeMethodName;

@BTrace
public class AllMethods {
    @OnMethod(clazz = "/RandomTest*/", method = "/.*/")
    public static void m(@ProbeClassName String probeClass,
            @ProbeMethodName String probeMethod) {
        print(Strings.strcat(probeClass, Strings.strcat(".", probeMethod)));
    }
}

When creating the script, we need btrace apis to make it compilable.

The jars can be found in libs of btrace folder :

  • btrace-agent.jar
  • btrace-boot.jar
  • btrace-client.jar

Once script has be created, we run : <btrace>/bin/btrace <PID> <trace_script> to attach to the java application with the given PID and compile/submit the trace script.

We can see that every method call of our random test are traced now :

btrace 28202 AllMethods.java

btrace INFO: Attaching BTrace to PID: 28202
btrace INFO: Successfully started BTrace probe: AllMethods.java
RandomTest.random
RandomTest.print
RandomTest.sleep
RandomTest.random
RandomTest.print
RandomTest.sleep
RandomTest.random
RandomTest.print
RandomTest.sleep
RandomTest.random
RandomTest.print
RandomTest.sleep
RandomTest.random
RandomTest.print
RandomTest.sleep

More example scripts are available in samples of btrace folder :

~/apps/btrace-v2.2.0-bin/samples$ ls

AllCalls1.java          DTraceRefDemo.java    OnThrow.java
AllCalls1Sampled.java   FileTracker.java      ProbeArgs.java
AllCalls2.java          FileTrackerJfr.java   ProbeExit.java
AllCalls2Sampled.java   FinalizeTracker.java  Profiling.java
AllCalls3.java          HistogramBean.java    Sizeof.java
AllCalls3Sampled.java   Histogram.java        SocketTracker1.java
AllLines.java           HistoOnEvent.java     SocketTracker.java
AllMethods1.java        java.net.socket.xml   SubtypeTracer.java
AllMethods.java         JdbcQueries.java      syscalls.d
AllMethodsLevels.java   JInfo.java            SysProp.java
AllMethodsSampled.java  JMap.java             Test.java
AllSync.java            JStack.java           ThreadBean.java
ArgArray.java           jthread.d             ThreadCounterBean.java
AWTEventTracer.java     jurls.d               ThreadCounter.java
btracedefs.h            LogTracer.java        ThreadStart.java
classload.d             MemAlerter.java       Timers.java
Classload.java          Memory.java           URLTracker.java
CommandArg.java         MultiClass.java       WebServiceTracker.java
Deadlock.java           NewArray.java
DTraceInline.java       NewComponent.java

2. Alibaba Java Diagnostic Tool – Arthas

Arthas allows developers to troubleshoot production issues for java applications without modifying code or restarting servers.

It can be downloaded from its github repo :

https://github.com/alibaba/arthas/releases

Go into arthas folder and run below to start it :

java -jar arthas-boot.jar

Select the java process to which we are interested.

java -jar arthas-boot.jar

[INFO] JAVA_HOME: /home/ovo/Apps/jdk-21.0.3
[INFO] arthas-boot version: 4.1.3
[INFO] Found existing java process, please choose one and input the serial number of the process, eg : 1. Then hit ENTER.
* [1]: 6120 /home/ovo/.vscode/extensions/redhat.java-1.50.0-linux-x64/server/plugins/org.eclipse.equinox.launcher_1.7.100.v20251111-0406.jar
  [2]: 8204 RandomTest

Once attached to the process, it will enter into CLI mode.

java -jar arthas-boot.jar

[INFO] JAVA_HOME: /home/ovo/Apps/jdk-21.0.3
[INFO] arthas-boot version: 4.1.3
[INFO] Found existing java process, please choose one and input the serial number of the process, eg : 1. Then hit ENTER.
* [1]: 6120 /home/ovo/.vscode/extensions/redhat.java-1.50.0-linux-x64/server/plugins/org.eclipse.equinox.launcher_1.7.100.v20251111-0406.jar
  [2]: 8204 RandomTest
2
[INFO] arthas home: /home/ovo/Apps/arthas-bin
[INFO] Try to attach process 8204
Picked up JAVA_TOOL_OPTIONS: 
[INFO] Attach process 8204 success.
[INFO] arthas-client connect 127.0.0.1 3658
  ,---.  ,------. ,--------.,--.  ,--.  ,---.   ,---.                           
 /  O  \ |  .--. ''--.  .--'|  '--'  | /  O  \ '   .-'                          
|  .-.  ||  '--'.'   |  |   |  .--.  ||  .-.  |`.  `-.                          
|  | |  ||  |\  \    |  |   |  |  |  ||  | |  |.-'    |                         
`--' `--'`--' '--'   `--'   `--'  `--'`--' `--'`-----'                          

wiki         https://arthas.aliyun.com/doc                                      
tutorials    https://arthas.aliyun.com/doc/arthas-tutorials.html                
version      4.1.3                                                              
main_class   RandomTest                                                         
pid          8204                                                               
start_time   2025-12-26 10:46:53.694                                            
current_time 2025-12-26 10:49:17.011                                            

[arthas@8204]$

We can then use different commands to do diagnostics.

2.1 dashboard – real time statistics dashboard for the current system

ID      NAME                                          GROUP                  PRIORITY       STATE          %CPU            DELTA_TIME     TIME           INTERRUPTED     DAEMON         
35      Timer-for-arthas-dashboard-403b5444-2451-403a system                 5              RUNNABLE       0.77            0.038          0:0.188        false           true           
33      arthas-NettyHttpTelnetBootstrap-3-2           system                 5              RUNNABLE       0.3             0.014          0:0.200        false           true           
1       main                                          main                   5              TIMED_WAITING  0.04            0.002          0:0.169        false           false          
9       Reference Handler                             system                 10             RUNNABLE       0.0             0.000          0:0.001        false           true           
10      Finalizer                                     system                 8              WAITING        0.0             0.000          0:0.000        false           true           
11      Signal Dispatcher                             system                 9              RUNNABLE       0.0             0.000          0:0.000        false           true           
16      Notification Thread                           system                 9              RUNNABLE       0.0             0.000          0:0.000        false           true           
17      Common-Cleaner                                InnocuousThreadGroup   8              TIMED_WAITING  0.0             0.000          0:0.002        false           true           
18      Attach Listener                               system                 9              RUNNABLE       0.0             0.000          0:0.093        false           true           
20      arthas-timer                                  system                 9              WAITING        0.0             0.000          0:0.000        false           true           
27      arthas-NettyHttpTelnetBootstrap-3-1           system                 5              RUNNABLE       0.0             0.000          0:0.023        false           true           
28      arthas-NettyWebsocketTtyBootstrap-4-1         system                 5              RUNNABLE       0.0             0.000          0:0.001        false           true           
29      arthas-NettyWebsocketTtyBootstrap-4-2         system                 5              RUNNABLE       0.0             0.000          0:0.001        false           true           
30      arthas-shell-server                           system                 9              TIMED_WAITING  0.0             0.000          0:0.002        false           true           
31      arthas-session-manager                        system                 9              TIMED_WAITING  0.0             0.000          0:0.001        false           true           
34      arthas-command-execute                        system                 5              TIMED_WAITING  0.0             0.000          0:0.005        false           true           
                                                                                                                                                                                        
                                                                                                                                                                                        
                                                                                                                                                                                        
                                                                                                                                                                                        
                                                                                                                                                                                        
Memory                                 used          total        max          usage        GC                                                                                          
heap                                   15M           48M          3988M        0.38%        gc.g1_young_generation.count                  5                                             
g1_eden_space                          4M            26M          -1           15.38%       gc.g1_young_generation.time(ms)               30                                            
g1_old_gen                             8M            18M          3988M        0.21%        gc.g1_concurrent_gc.count                     2                                             
g1_survivor_space                      2M            4M           -1           72.39%       gc.g1_concurrent_gc.time(ms)                  2                                             
nonheap                                37M           40M          -1           92.74%       gc.g1_old_generation.count                    0                                             
codeheap_'non-nmethods'                1M            2M           5M           27.54%       gc.g1_old_generation.time(ms)                 0                                             
metaspace                              27M           27M          -1           98.78%                                                                                                   
codeheap_'profiled_nmethods'           4M            4M           117M         4.16%                                                                                                    
compressed_class_space                 3M            3M           1024M        0.30%                                                                                                    
codeheap_'non-profiled_nmethods'       899K          2496K        120036K      0.75%                                                                                                    
mapped                                 0K            0K           -            0.00%                                                                                                    
direct                                 4M            4M           -            100.00%                                                                                                  
mapped - 'non-volatile memory'         0K            0K           -            0.00%                                                                                                    
Runtime                                                                                                                                                                                 
os.name                                                                                     Linux                                                                                       
os.version                                                                                  6.8.0-90-generic                                                                            
java.version                                                                                21.0.3                                                                                      
java.home                                                                                   /home/ovo/Apps/jdk-21.0.3                                                                   
systemload.average                                                                          1.37                                                                                        
processors                                                                                  3                                                                                           
timestamp/uptime                                                                            Fri Dec 26 10:51:47 CET 2025/293s

2.2 thread – check the basic info and stack trace of the target thread

Below is an example to find out top 3 busiest threads with stack traces printed :

[arthas@6870]$ thread -n 3

"arthas-command-execute" Id=44 cpuUsage=0.47% deltaTime=0ms time=11ms RUNNABLE
    at java.management@25.0.1/sun.management.ThreadImpl.dumpThreads0(Native Method)
    at java.management@25.0.1/sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:469)
    at com.taobao.arthas.core.command.monitor200.ThreadCommand.processTopBusyThreads(ThreadCommand.java:206)
    at com.taobao.arthas.core.command.monitor200.ThreadCommand.process(ThreadCommand.java:122)
    at com.taobao.arthas.core.shell.command.impl.AnnotatedCommandImpl.process(AnnotatedCommandImpl.java:82)
    at com.taobao.arthas.core.shell.command.impl.AnnotatedCommandImpl.access$100(AnnotatedCommandImpl.java:18)
    at com.taobao.arthas.core.shell.command.impl.AnnotatedCommandImpl$ProcessHandler.handle(AnnotatedCommandImpl.java:113)
    at com.taobao.arthas.core.shell.command.impl.AnnotatedCommandImpl$ProcessHandler.handle(AnnotatedCommandImpl.java:110)
    at com.taobao.arthas.core.shell.system.impl.ProcessImpl$CommandProcessTask.run(ProcessImpl.java:385)
    at java.base@25.0.1/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:545)
    at java.base@25.0.1/java.util.concurrent.FutureTask.run(FutureTask.java:328)
    at java.base@25.0.1/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:309)
    at java.base@25.0.1/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1090)
    at java.base@25.0.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:614)
    at java.base@25.0.1/java.lang.Thread.runWith(Thread.java:1487)
    at java.base@25.0.1/java.lang.Thread.run(Thread.java:1474)


"main" Id=3 cpuUsage=0.0% deltaTime=0ms time=143ms TIMED_WAITING
    at java.base@25.0.1/java.lang.Thread.sleepNanos0(Native Method)
    at java.base@25.0.1/java.lang.Thread.sleepNanos(Thread.java:509)
    at java.base@25.0.1/java.lang.Thread.sleep(Thread.java:577)
    at java.base@25.0.1/java.util.concurrent.TimeUnit.sleep(TimeUnit.java:446)
    at app//RandomTest.main(RandomTest.java:9)


"Reference Handler" Id=13 cpuUsage=0.0% deltaTime=0ms time=1ms RUNNABLE
    at java.base@25.0.1/java.lang.ref.Reference.waitForReferencePendingList(Native Method)
    at java.base@25.0.1/java.lang.ref.Reference.processPendingReferences(Reference.java:246)
    at java.base@25.0.1/java.lang.ref.Reference$ReferenceHandler.run(Reference.java:208)

2.3 jad – decompile the specified classes

[arthas@6870]$ jad RandomTest

ClassLoader:                                                                                                                                                                                                                                            
+-jdk.internal.loader.ClassLoaders$AppClassLoader@341d43cd                                                                                                                                                                                              
  +-jdk.internal.loader.ClassLoaders$PlatformClassLoader@d792ed7                                                                                                                                                                                        

Location:                                                                                                                                                                                                                                               
/home/ovo/.config/Code/User/workspaceStorage/2dd0ba3c459d2c4a6e96b3607036ee65/redhat.java/jdt_ws/jdt.ls-java-project/bin/                                                                                                                               

       /*
        * Decompiled with CFR.
        */
       import java.util.concurrent.TimeUnit;
       
       public class RandomTest {
           public static void main(String[] args) throws InterruptedException {
               while (true) {
/* 9*/             TimeUnit.SECONDS.sleep(1L);
/*10*/             double random = Math.random();
/*11*/             System.out.println(random);
               }
           }
       }

Affect(row-cnt:1) cost in 710 ms.

2.4 stack – print out the full call stack of the current method

[arthas@6099]$ stack RandomTest

Press Q or Ctrl+C to abort.
Affect(class count: 1 , method count: 6) cost in 216 ms, listenerId: 1
ts=2025-12-28 06:24:51.040;thread_name=main;id=3;is_daemon=false;priority=5;TCCL=jdk.internal.loader.ClassLoaders$AppClassLoader@341d43cd
    @RandomTest.random()
        at RandomTest.process(null:15)
        at RandomTest.main(null:9)

ts=2025-12-28 06:24:51.042;thread_name=main;id=3;is_daemon=false;priority=5;TCCL=jdk.internal.loader.ClassLoaders$AppClassLoader@341d43cd
    @RandomTest.print()
        at RandomTest.process(null:15)
        at RandomTest.main(null:9)

ts=2025-12-28 06:24:52.044;thread_name=main;id=3;is_daemon=false;priority=5;TCCL=jdk.internal.loader.ClassLoaders$AppClassLoader@341d43cd
    @RandomTest.sleep()
        at RandomTest.process(RandomTest.java:14)
        at RandomTest.main(null:9)

2.5 trace – trace method calling path, and output the time cost for each node in the path

[arthas@6099]$ trace RandomTest process

Press Q or Ctrl+C to abort.
Affect(class count: 1 , method count: 1) cost in 82 ms, listenerId: 2
`---ts=2025-12-28 06:26:41.248;thread_name=main;id=3;is_daemon=false;priority=5;TCCL=jdk.internal.loader.ClassLoaders$AppClassLoader@341d43cd

`---ts=2025-12-28 06:26:41.250;thread_name=main;id=3;is_daemon=false;priority=5;TCCL=jdk.internal.loader.ClassLoaders$AppClassLoader@341d43cd
    `---[1001.12013ms] RandomTest:process()
        +---[99.96% 1000.743913ms ] RandomTest:sleep() #14
        +---[0.00% 0.01662ms ] RandomTest:random() #15
        `---[0.02% 0.153184ms ] RandomTest:print() #15

`---ts=2025-12-28 06:26:42.263;thread_name=main;id=3;is_daemon=false;priority=5;TCCL=jdk.internal.loader.ClassLoaders$AppClassLoader@341d43cd
    `---[1000.533521ms] RandomTest:process()
        +---[99.97% 1000.254762ms ] RandomTest:sleep() #14
        +---[0.00% 0.01386ms ] RandomTest:random() #15
        `---[0.02% 0.173234ms ] RandomTest:print() #15

`---ts=2025-12-28 06:26:43.265;thread_name=main;id=3;is_daemon=false;priority=5;TCCL=jdk.internal.loader.ClassLoaders$AppClassLoader@341d43cd
    `---[1001.112942ms] RandomTest:process()
        +---[99.96% 1000.756926ms ] RandomTest:sleep() #14
        +---[0.00% 0.018639ms ] RandomTest:random() #15
        `---[0.02% 0.220952ms ] RandomTest:print() #15

2.6 watch – monitor methods in data aspect including return values, exceptions and parameters …

With the help of OGNL (Object-Graph Navigation Language), it is easy to check the details of variables when methods being invoked.

Below is an example to watch on the return values of random method of RandomTest class :

[arthas@6099]$ watch RandomTest random "{returnObj}"

Press Q or Ctrl+C to abort.
Affect(class count: 1 , method count: 1) cost in 49 ms, listenerId: 3
method=RandomTest.random location=AtExit
ts=2025-12-28 06:30:08.643; [cost=0.075991ms] result=@ArrayList[
    @Double[0.6787309192433806],
]
method=RandomTest.random location=AtExit
ts=2025-12-28 06:30:09.647; [cost=0.02542ms] result=@ArrayList[
    @Double[0.35815648410827783],
]
method=RandomTest.random location=AtExit
ts=2025-12-28 06:30:10.648; [cost=0.0272ms] result=@ArrayList[
    @Double[0.10915218803808524],
]

2.7 profiler – generate a flame graph using async-profiler

[arthas@6099]$ profiler start --event wall
Profiling started

[arthas@6099]$ profiler stop --format html
OK
profiler output file: /home/ovo/github/BlogTests/java.troubleshooting.thirdparty.tools/arthas-output/20251228-063206.html

[arthas@6099]$

The output of profilling is flame graph in html format :

The details about async-profiler can be found in its github repo :

https://github.com/jvm-profiling-tools/async-profiler

The exhaustive list of arthas commands can be found by below link :

https://arthas.aliyun.com/doc/en/commands.html

Conclusion

BTrace and Arthas are both very powerful open source third party tools for production java process trouble shootings.

Personally, i prefer Arthas over BTrace.

Because BTrace needs to create scripts firstly and less documented.

Especially, it has to restart the process after diagnostic done in order to get rid of its instruments.

However Arthas is much easier to use and well documented.

Its stop command allows to detach the process and erase its instruments.

No matter which one you prefer, please use it carefully and cautiously validate in the local before using it in the production environment.