Alerting on read-only filesystem errors

Troubleshooting filesystem issues can be a complicated task, and it is made almost impossible if the system is unable to create logs or if you're unable to access the logs.

There are few instances in which this might occur. This article will focus on a specific instance: when the file system is remounted as read-only.

Read-only Filesystem Issue / Errors

By default, Linux will mount the filesystems as read+write, but some times, due to failures, they can be automatically remounted as read-only. When that happens, well, you can only read your files, not write to them.

There a couple of different reasons why an issue like this might occur. One of the more common reasons is when the system boots after a crash (or loses power) where it is unable to shutdown correctly. The system will try to auto correct during boot, but if it is unable to do so, it will mount in read-only mode.

A real-world example might be if your hosting provider has an infrastructure failure. Your servers come online, but all of a sudden the file-system is in a read-only mode.



Working with Read-only Filesystem Logs

Building on the above scenario, let's assume your server filesystem is in read-only mode. What do you do? How do you identify that is what is happening?

First, keep an eye out for "cannot create file: read-only file system" warnings in terminal. Such an example might occur if you try to do something like this:

root@test-server:~# tail /va-bash: cannot create temp file for here-document: Read-only file system
-bash: cannot create temp file for here-document: Read-only file system


This is a good indicator that a problem exists. A good way to confirm is by trying to touch a file to see if works:

root@test-server:~# touch test
touch: cannot touch 'test': Read-only file system


If it does not allow you to create the file you know where the issue is, you're likely working a read-only file system issue.

Anther way to test is using mount. You can type the command "mount" and the output will give you something like this: , you can confirm if the server is in read-only (ro) mode:

# mount
..
/dev/sda4 on / type ext4 (ro,relatime,data=ordered)


In the entry above you see the reference to ro which is the abbreviation for read-only.

Solving the issue can be as simple as rebooting the server, but if that does not work you will have to investigate further.



Leveraging Logs to Detect Read-only file system Errors

Once your server is stuck in the read-only mode, new logs will not be written to the server, but they can be sent remotely via syslog. You can also see the logs up to the point of the read-only setting.

A great tool to use in a situation like this is dmesg, a local command on most linux machines that prints the message buffer of the kernel. By running the following command we are searching for the file system (i.e., sda4) and associated log entries:

# dmesg |grep sda4
[ 4.259651] EXT4-fs (sda4): recovery complete
[ 8.118360] EXT4-fs (sda4): Couldn't remount RDWR because of unprocessed orphan inode list. Please umount/remount instead


Here we see that a recovery occurred and after rebooting the the filesystem was not able to mount because of unprocessed inodes.

Be aware that if you're only looking at the kernel log file (like the kern.log):

kern.* -/var/log/kern.log

You might not see the entries provided by dmesg. Dmesg stores it in a different buffer, still allowing you to see the errors. If you have remote syslog, you would also be able to see them on your central logging server.

As you search for your file system (sda4) in our example, you will find different errors about mounting or unmounting. This provides guidance on what you should be looking at next. In our example, the error was unprocessed orphan inode list, which failed to remount as read and write (RDWR).



Solving read-only issues with fsck

If you have unprocessed inode orphans, like in the example above, you can run fsck, a system utility, (even on a live system) to recover the file system:

Here is an example of what that might look like:

# fsck /dev/sda4
fsck from util-linux 2.31.1
e2fsck 1.44.1 (24-Mar-2018)
/dev/sda4 contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
... Deleted inode 26738700 has zero dtime. Fix? yes
Pass 2: Checking directory structure
Entry 'f83b126d6b9ac1a47067fa292b35f63c' in /etc/nginx/.../c/63 (24773501) has deleted/unused inode 24871392. Clear? yes
Fix? yes
/dev/sda4: ***** FILE SYSTEM WAS MODIFIED *****
/dev/sda4: ***** REBOOT SYSTEM *****
/dev/sda4: 1021146/28262400 files (0.4% non-contiguous), 11715400/113048832 blocks


Once recovered, you can reboot and start working on your server again.



Importance of Remote Logging

This experience highlights the importance of remote logging in your server maintenance / security strategy. A crash that switches your filesystem to read-only should be categorized as a high priority issue but if it's not able to write to the logs because it is read-only an organization can suffer serious availability issues for prolonged hours before someone identifies the issue.

In the scenario above we had to use dmesg to pull entries from the kernel buffer because the system was no longer writing to kern.log. Alternatively, had remote logging been enabled you would be able to use your remote log management platform to identify and notify on critical issues like this one (e.g. remount RDWR because of unprocessed orphan inode list).

Another great example of the importance of having an effecting logging strategy in place for your infrastructure.





Posted in   logs   logs-to-watch     by Daniel Cid (@dcid)

Simple, affordable, log management and analysis.