The past few weeks have been rather crazy for me as a Blackboard system admin. First our biggest issue seemed to come after our upgrade to Blackboard 9.1 Service Pack 8. During that week I awoke to a phone call from one of our ET@MO team members letting me know that several of our Blackboard application servers were not functioning correctly.
Now I must share that every morning our application servers stop the Blackboard services and rotate our log files, then restart the services. This has been happening for years. If we didn’t do this, some of the log files would just grow and grow in one single file. I wish Blackboard could find a way to fix this, but I digress.
So the morning in question about four of our six application servers would not allow the Blackboard services to restart. The biggest problem of all, our load balancer would not stop pointing users to these four servers. The headache that welcomed me that morning wasn’t getting any better. We did get the servers restarted and fixed before the start of classes that morning, but I still didn’t have an answer to what had happened.
We started to find more and more information because our Oracle DBA team found that a table was locked and once we cleared the lock on the peer_sessions table, the services started and the issue was gone. We started to chalk it up to an issue with the PushConfigUpdates process done during the past weekend. That day I opened a ticket with Blackboard Support hoping that we could find a answer quickly. I (with the help of our server administrators) submitted the log and configuration files along with the open ticket. My hope was to find a quick answer.
The next morning, I was awoken again to the same issues with the same four application servers. I knew this issue was bigger and before the day was done, I needed an answer. After spending time with Blackboard Support I got a recommendation to implement the Java Messaging Service called, ActiveMQ. If you are like me, you said what is ActiveMQ and why do I need to have it up and running on my Blackboard instance? To my surprise, ActiveMQ has been around in the Blackboard framework since Service Pack 3, but has been mostly used by institutions that have purchased the content management product from Blackboard. ActiveMQ runs on every Blackboard instance by default in a peer discovery configuration. However when ActiveMQ fails to function properly in this mode, it can affect stability to the Blackboard Learn instance.
So the server administrator team and I started work on reconfiguring the ActiveMQ service. We practiced the process in our staging environment to make sure we were ready, then applied the new configuration to our production instance. After doing this, we were supposed to restart services, but decided to wait for the next morning and use the log rotation as the chance to implement our reconfiguration. I had a funny feeling that after two days of waking up to the sound of my cell phone ringing that I might be able to sleep until my alarm went off.
I got ready for bed and something told me to open a browser window and load the monitoring website that our team uses to see the health of our Blackboard instances. I awoke at around 5:00am, slipped my glasses on, and looked at the screen to find that one of our application servers was not coming back up. Anyone who has awoken to find this will be able to quote the barrage of expressive statements that I used. Luckily, it had only been down for five minutes, but I couldn’t get the server to restart. I called my server administrator (whom I’m sure had the same expressive statements when he saw a call from me) and asked him if he could firewall off the application server so no one could reach it. He stopped all the services, but then they magically started again. By this time I started thinking about getting a priest and some holy silicon.
The server administrator and I decided to stop the log rotation process during the weekend to allow both of us to get some needed sleep and come back to the issue on Monday. The weekend came and went without an issue to our Blackboard instance. On Monday, we decided to firewall off the server that had caused issues and then run the rotate logs script to see if the server would come back. To my surprise and delight, it did. We put it back into rotation and turned our log rotation back on, but we stagger the rotations to prevent two or more servers being down at the same time.
Now back to the load balancer. During this process, I learned that our load balancer was only listening to each application server on ports 80 and 443, while our tomcat process runs on a completely different port. This explained why the load balancer would continue to send users to servers that had failed to start the tomcat process. Luckily our networking team was able to setup a new rule on the load balancer that if it got no response on the tomcat port, it would not send users to that server.
All in all, it was a good learning experience for everyone involved and helped us improve how we serve our Blackboard customers. Let me know if you have any questions or comments.
P.S. – A special thanks to Sonja in Tier I Blackboard Support for all her help during this case.