Some troubles — especially those happening only sporadically — are not so easy to shoot and call for a deeper understanding of the matter. In the following real-world example this means: SQL*Net Tracing and some knowledge about the inner workings of the server’s operating system, particularly random number generation.
This case was suited well to demonstrate an approach to trouble-shoot connections to Oracle databases.
Scenario
A batch job written in Java does regular data synchronizations with a remote Oracle database. It opens up to five connections one after another.
Occasionally, timeouts happen while opening those connections, leading to errors in the batch process. A closer look at the log files of this application shows that the connections are almost all opened within less than a second while a few take over 20s to open.
Differential Diagnosis
Our first source of information was the application’s log file. Its timestamps showed that some connections took ~30s to establish:
--- 2013-11-04 10:50:45,479 register driver oracle.jdbc.driver.OracleDriver 2013-11-04 10:50:45,638 connect to jdbc:oracle:thin:@dbsvr:1521:ORCL 2013-11-04 10:51:13,045 connection established ---
If this time span exceeds 60s (assuming default settings), the DB server terminates the connection request which shows up in the Alert Log:
Fatal NI connect error 12170.
VERSION INFORMATION:
TNS for Linux: Version 11.2.0.3.0 - Production
Oracle Bequeath NT Protocol Adapter for Linux: Version 11.2.0.3.0 - Production
TCP/IP NT Protocol Adapter for Linux: Version 11.2.0.3.0 - Production
Time: 30-OCT-2013 08:47:36
Tracing not turned on.
Tns error struct:
ns main err code: 12535
TNS-12535: TNS:operation timed out
ns secondary err code: 12606
nt main err code: 0
nt secondary err code: 0
nt OS err code: 0
Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=10.20.30.40)(PORT=33600))
WARNING: inbound connection timed out (ORA-3136)
Insights
Here’s the workflow of opening an Oracle Session:
- Client sends connection request to Oracle listener
- Listener checks whether there’s an existing SID or DB service matching the request.
- Listener calls the “pmon” background process of the matching DB instance.
- pmon starts an OS process (can be found in the server’s task list).
- This process allocates a free IP port on the DB server within the port range of typically 9000-64000 (33600 in the alert log above).
- pmon sends process id and IP port back to the listener.
- Listener sends this information to the client.
- At this point the listener has done its job. Further progress of connection establishment is the client’s “responsibility”!
- Client starts a new IP connection DIRECTLY to the server process using given port.
- Server process requests client to authenticate.
In case the client doesn’t authenticate within a given limit (60s being the default), pmon then terminates the server process and closes the IP connection to the client. This is logged in the alert log of the DB instance, just like you saw above.
As there were no other errors in the log files it’s clear that the connection between client and server always got established but occasionally the authentication did not complete within the timeout period.
First Workaround
A possible measure to work around timeout issues with Oracle Net is to set higher timeouts in the configuration files “listener.ora” and “sqlnet.ora” (in $ORACLE_HOME/network/admin):
sqlnet.ora: SQLNET.INBOUND_CONNECT_TIMEOUT = 120
listener.ora: INBOUND_CONNECT_TIMEOUT_<listener_name> = 110
Timeouts units for these parameters are seconds. The listener’s timeout should be set shorter than the DB server’s so that there’s time remaining for handing over the connection to the OS process. This is detailed in MOS Note “Troubleshooting ORA-3136: WARNING Inbound Connection Timed Out” (Doc ID 465043.1).
Although there were no more timeout errors with these settings, the root cause of the issue wasn’t resolved yet. Furthermore, the batch jobs took much longer when “hanging” connection requests happened.
Trace as further evidence
To clearly demonstrate that the issue isn’t on the DB server’s side, tracing Oracle Net is a good idea. In the trace files every step of the connection creation is shown with timestamps.
Tracing can be done on both ends of the connection (client and server) and on the server side listener and server process can be traced. From our first insights we already know that the listener worked fine. Thus, the only place where tracing makes sense in this case is the server process.
To trace the server process, another parameter has to be set in “sqlnet.ora”:
TRACE_LEVEL_SERVER=16
This setting becomes immediately active for every new server process. Beware that this can lead to many large trace files, so switch it off as soon as possible!
To get the trace files in a human-readable format, Oracle provides the command line tool “trcasst“. For our scenario “trcasst -o” is a good option and gives us the following output:
--- <--- Received 158 bytes - Data packet timestamp=04-NOV-2013 10:50:46:024 Start of user function (TTIFUN) Get the session key (OSESSKEY) ---> Send 210 bytes - Data packet timestamp=04-NOV-2013 10:50:46:031 Return opi parameter (TTIRPA) <--- Received 992 bytes - Data packet timestamp=04-NOV-2013 10:51:13:023 Start of user function (TTIFUN) Generic authentication call (OAUTH) ---
“Send” refers to the server, “Receive” to received messages from the client.
The above excerpt shows that the server received a request that it answered after 7 ms. It then took another 27 seconds until the next response from the client came in. The further authentication (after “OAUTH”) went quickly and successfully and therefore isn’t shown here.
To find out what happened on the network during this time, Wireshark was used. Its output showed that the sent 210 bytes were delivered with usual latency of roughly 1ms. This was great evidence to prove that the client was the source of the problem!
Diagnosis
With enough evidence at hand, an SR was filed with
the JDBC-Team at Oracle Support. The root cause of the intermittent timeouts was soon discovered:
- Beginning with Oracle 11g, the old authentication method “O3logon” got superseded by “O5logon” abgelöst (and backported to Oracle 10.2.0.5, by the way).
- The new method requires a certain amount of random numbers for its encryption algorithms. Those random numbers are generated by the OS.
- Newer Linuxes provide those random numbers via the devices “/dev/random” and “/dev/urandom”.
- The encryption algorithms require a certain amount of Entropy, which is generated using external sources like e.g. keyboard and mouse (which seems pretty unreasonable on rack-mounted servers). Further reading under “References” below.
- Like many authentication methods, O5logon uses “/dev/random” as its source of randomness.
- “/dev/random” blocks (waits), when there’s not enough entropy and continues to provide random numbers only when the “Entropy Pool” is re-filled to a certain amount. This might take well up to a minute.
- The “Entropy Pool” can dry up quickly, if too many O5logon requests have to be handled in a short time span; other applications that consume random numbers might aggravate this drying-up.
The final proof for the Entropy Pool being the source of our issue was made using several approaches:
- Using the old authentication method. In JDBC you can configure this by using the command line argument:
-Doracle.jdbc.thinLogonCapability=o3
(due to a security issue in the new method this would even be a good thing to do if you’re not already using Oracle 12c clients!) - Use the non-blocking device “/dev/urandom”. In JDBC you can configure this by using the command line arguments:
-Djava.security.egd=file:/dev/./urandom -Dsecurerandom.source=file:/dev/./urandom - Monitoring the Entropy Pool; is size is usually around 3000 Bytes. When it drops below 500 Bytes, it’s a sign for a drying-up pool. To monitor the pool every second, use the Linux command line tool “watch”:
watch -n 1 cat /proc/sys/kernel/random/entropy_avail
Be aware, though, that the measurement itself influences the measurement; greater intervals (e.g. 3s) reduce this distortion.
Ensuring Entropy (Conclusion)
Three approaches were considered to solve the issue:
- Permanent change of the batch job so that the non-blocking “/dev/urandom” is used. This is achieved with the Java command line argument:
-Djava.security.egd=file:/dev/./urandom -Dsecurerandom.source=file:/dev/./urandom
This became the preferred solution. - Using “rngd” to generate random numbers faster.
- Change the application to not open so many connections simultaneously.
This was postponed (indefinitely) because developers argued that we couldn’t be 100% sure whether the consumption of entropy wasn’t caused by other applications.
Generating random numbers in deterministic systems like computers is no trivial task. On the other hand, there’s high demand for randomness by security algorithms. Randomness helps to mitigate the risk that adversaries can circumvent security measures by making “educated guesses”. For highly sensitive environments commercial randomness generators exist (like this one).
References
- Oracle 12.2: Troubleshooting Oracle Net Services
- http://linux.die.net/man/4/random
- http://bugs.sun.com/view_bug.do?bug_id=6202721
- http://lwn.net/Articles/184925/
- http://serverfault.com/questions/303935/how-to-fix-the-entropy-pool-issue-with-rhel-5-x
- http://www.noah.org/wiki/Random_Notes
- Franck Pachot experienced a similar issue on Oracle 12.2 Server within a VM. One commenter posted a very interesting talk on the topic: https://media.ccc.de/v/32c3-7441-the_plain_simple_reality_of_entropy