Grep is the goto command to search for content on Linux and Unix systems. Most system administrators and network/security professionals use it
on a daily basis when troubleshooting issues or looking for possible signs of compromise.
Also, grep is fast. It allows you to look for multiple files recursively and often returns the results pretty immediately. In fact, there is an interesting post
from the original author of GNU grep where it explains why it is so fast:
#1 trick: GNU grep is fast because it AVOIDS LOOKING AT
EVERY INPUT BYTE.
#2 trick: GNU grep is fast because it EXECUTES VERY FEW
INSTRUCTIONS FOR EACH BYTE that it *does* look at.
..
..
The key to making programs fast is to make them do practically nothing. ;-)
And it ends with a beautiful quote: "The key to making programs fast is to make them do practically nothing.". Grep is a very simple tool that
allows you to search for keywords (or regexes) in files. Because of its simplicity, it can be very fast.
But how fast is fast? We care a lot about logs at Trunc, and some times we deal with gigabytes and hundreds of gigabytes of logs. And even though grep
is fast, some times it can take a while to get a response. How quickly do you think grep can parse through data?
We deal with a lot of text files (logs) at Trunc, so we decided to check how quickly it takes to parse through files depending on its size. We separated a few real log files with different data, one with 1.1G of logs, one with 4.0G and another with 17.0G of logs to see how quick (or slow) it would take:
1.1G Jun 3 23:59 ./testing/1.1G.log
$ time grep test123 ./testing/1.1G.log | wc -l
0
real 0m0.755s
user 0m0.405s
sys 0m0.350s
17G Jun 3 23:59 ./testing/17G.log
$ time grep test123 ./testing/17G.log | wc -l
6
real 0m10.484s
user 0m5.922s
sys 0m4.553s
4.0G Jun 4 05:32 ./testing/4G.log
$ time grep test123 ./testing/4G.log | wc -l
4
real 0m3.096s
user 0m1.554s
sys 0m1.532s
Based on these tests, on a high performance server with SSD, we can see that it took 0.75 seconds to parse 1.1G of data, at a rate of 1.4 GB/s. On the 4G log file, it took 3 seconds, at a rate of 1.3GB/s. On the 17G log file, it took 10.4 seconds, at a rate of 1.6 GB/s. We repeated the same test on different servers and at different log files and we were always getting a rate in the range of 1.2GB/s to 1.7 GB/s.
That gives us a good indication on how fast it can load data out of the SSD (non-SSD drives would take a lot longer) , parse the data and print out the results. Like Mike Haertel (author of Grep) said, when you don't do much, it makes your code run a lot faster. Most of the time it takes is reading the data out of the disk, because even when you add more complex searches (like -E for extended regex), it barely increases the time of execution.
Even though grep is fast, for large datasets, it can take a long time. If you are analysing web server logs, for example, which can easily grow to 100's of GB, it can take minutes to get a simple response. I remember restoring hacked servers and taking forever to do our investigations mostly because of the time it takes to parse through the data.
That's why in fact, Trunc was born. For example, for that 17G log file that takes 10 seconds in grep, the results are almost immediatly at the Trunc terminal:
> search test123
... results ...
found 6 logs in 0.68 seconds.
In fact, even when we go through large data sets, it returns the results right away:
> search test123
... results ...
found 180 logs in 0.89 seconds.
In this one, it went in over 800GB in 0.89 seconds. That would have taken minutes in the terminal via grep. And the main difference is that grep has to go through all the data every time, while we store the logs into a specialized database where all the keywords are indexed (think Google , but for logs), allowing us to skip re-parsing all the data every search requests.
In summary, grep is a great tool, pretty fast and can easily search through ~ 1-2GB of data per second on a fast disk. If you have logs in the hundreds of GB's, you might need to use a specialized logging storage to optimize your queries instead of waiting.