Attempts to speed up gethostbyaddr

Published on Wednesday, 02 October 2013 in Python, Trick ; tagged with trick, python, gethostbyaddr, threading, internship, timeout, semaphore, gil, gethostbyname_ex ; text version

The workframe

Among my missions in Société Générale, a key element is to collect information about websites in order to pentest them.

The group Société Générale counts more than 150 000 employees.
It has a really complex organizations, like every big companies I guess.

Therefore it becomes really hard to have an overall point of view on all its servers, since they are spreaded across all the group's branches and their sectors.
That is to say that I have to deal with hundreds, thousands of domain names and IP addresses, and gathering information about them takes a lot of time.

What we mean by collect information about a website is to identify several main information like:

In this post, I assume that I only have IP addresses.

During some tests about getting the hostname of an IP address, I had to use gethostbyaddr.
If you ever tried to use gethostbyaddr, you must have seen that it can take long time to answer.
The problem with this function is that it takes a high amount of time before giving up on the domain name resolution.

Among the thousands of IP addresses, a bunch of them come from ranges reserved by the SG group.
Not all of them point to a running machine, therefore a lot of them don't have a host name.

When you can wait like 5 to 10 seconds for 1 or 2 addresses, it is not viable to wait hours and hours for thousands of them.

What is gethostbyaddr?

The function gethostbyaddr retrieve the primary host name from a given IP address.
It returns the aliases of the alternative host name if any too.

The use of gethostbyaddr function is really straightforward.
It expects a single parameter which is a string containing an IP address.
And it returns a tuple wich holds the host name, the aliases and a list of IP addresses for the same interface on the same host.

For instance, trying gethostbyaddr on 8.8.8.8 (a DNS of Google) returns the following result:

>>> import socket
>>> socket.gethostbyaddr('8.8.8.8')
('google-public-dns-a.google.com', [], ['8.8.8.8'])

It can be really usefull but as I just told you above, it times out too slowly :/
Last example only took an instant to run.
Now let's see how long it takes when it can not retrieve the host name:

Host not found for 192.168.0.0
(elapsed time: 10.01 seconds)

Do you believe it?! 10 seconds for a single address!
Let's say we have a thousand IP addresses and half of them don't have a host name:
500 * 10 seconds = 5000 seconds = 83,3 minutes = 1 hour 23 minutes

Could you imagine how long it can take to find the host names of several thousands of IP addresses?
This can happen when you have to check several ranges of addresses which belong to a company.
As you know, time is money.
I'm not sure my supervisor would like the idea of me waiting a day for some whatever host names...

First attempt to speed up the results

I checked the python doc to find a way of re-setting the timeout value of gethostbyaddr.
Well, there is none :/
The function that is closest to what I aim to do is setdefaulttimeout:

# Set the socket default timeout value to 2 seconds
socket.setdefaulttimeout(2)

Checking the timeout value of socket with getdefaulttimeout, I do have a 2 seconds but gethostbyaddr still ends after 10 seconds. Argh...
So gethostbyaddr doesn't really have a 10 seconds timeout, but it might just take 10 seconds to realize that it can't find the host name.

I thought then that I should find a way of keeping track of the execution time of the function and then stop the process when it exceeds a pre-defined time value.
The only plausible way to do that was to use a thread.

I mean, how hard can it be to start a thread, wait a certain amount of time and then terminate it?

Timeout on a thread

First of all, let us take a moment to describe the threading module and how to use it in a common way.
The threading module is built on the low-level thread one.
It gives us handier means to code threaded programs in python.

If you can give a function as a parameter of the Thread class, I rather choose to define a class which inherits from the threading.Thread one.

I find the code better looking even if it is not more straightforward.

import threading
import logging


logging.basicConfig(
    level=logging.DEBUG,
    format='[%(levelname)s] (%(threadName)-10s) %(message)s',
)


class MyThread(threading.Thread):
    def __init__(self):
        threading.Thread.__init__(self)

    def run(self):
        logging.debug('Starting')
        logging.debug('Doing my stuff')
        logging.debug('Exiting')


if __name__ == '__main__':
    my_thread = MyThread()
    my_thread.start()
    my_thread.join()
    logging.debug('Exiting')

The logging module is thread-safe and allows us to print debug information whitout having problems with the display.

With the code above, we have the following result:

[DEBUG] (Thread-1  ) Starting
[DEBUG] (Thread-1  ) Doing my stuff
[DEBUG] (Thread-1  ) Exiting
[DEBUG] (MainThread) Exiting

So let's take a closer look:

  1. In order to define our thread, we inherit our class from threading.Thread.
  2. Then, we have to define the run function which will holds our code.
  3. We finally run the thread using the start function and join it to make the main waits for it.

If we try the join's parameter to set up the timeout value, we can manage to have the main to finish before the thread does.
It leads us to the following code:

#[. . .]


class MyThread(threading.Thread):
    def __init__(self):
        threading.Thread.__init__(self)

    def run(self):
        logging.debug('Starting')
        logging.debug('Waiting for 5 seconds')
        time.sleep(5)
        logging.debug('Exiting')


if __name__ == '__main__':
    my_thread = MyThread()
    my_thread.start()
    my_thread.join(2)  # Times out after 2 seconds
    logging.debug('Exiting')

And give us the following output:

[DEBUG] (Thread-1  ) Starting
[DEBUG] (Thread-1  ) Waiting for 5 seconds
[DEBUG] (MainThread) Exiting
[DEBUG] (Thread-1  ) Exiting

Our walk-around works fine.
The main continues and doesn't wait more than 2 seconds for the thread.

But there is a problem and I am convinced that you see it too.
Well, for one thread, it is not a real problem, but can you extrapole this situation for hundreds of them?
Right, the threads are still running! And they still consume ressources!

On the Internet, I didn't find a nice way to kill a thread after a certain amount of time (here or here).
And there is a good reason for that: you should never kill a thread!

Why?
Well, first of all, python doesn't provide a way for terminating a thread.
You can find that in the pthread framework but even them advice us not to do it.
Second, killing a thread might raise some problems when it holds the python's GIL. In fact, you can have a deadlock situation.

You may find a quick answer here or have some good explanations in this Eli Bendersky's post.

Now we are at the point that we can not decrease the execution time of gethostbyaddr nor walk-around the problem by using a thread with a given timeout value.

Therefore, I focus now on the basic use a multiple threads where each of them will gethostbyaddr a single IP.
This way will speed up the whole process, though it does not speed up the nuclear operation.

Multiple threads for multiple lookups

The main idea here is to split each lookup operation into a single thread.
This can be easily done with the threading python module.
I will use a dictionnary for holding the result since it is thread-safe in python, like pointed out in this post.

Note: The dictionnary is thread-safe in our context since we are dealing with a set of IP adresses.
Therefore there will be no situations where two concurrent threads will modify the value of the same key.

Let us see the new version, where the gethostbyaddr is included into a threaded-class.

import sys
import csv
import time
import socket
import logging
import threading


logging.basicConfig(
    level=logging.DEBUG,
    format='[%(levelname)s] (%(threadName)-10s) %(message)s',
)


def read_csv_ip(filename='./ip.txt'):
    """Read a CSV file containing the IP addresses.

    There might be one or more IP addresses per line (';' separated)
    Return the set of them.

    """

    ip_addresses = set()
    with open(filename, 'rb') as csv_file:
        data = csv.reader(csv_file, delimiter=';', quotechar='"')
        for line in data:
            ip_addresses.update([ip for ip in line if ip])
    return ip_addresses


def write_csv_host(data, filename='./result.txt'):
    """Write the results of gethostbyaddr on the IP addresses into a file.

    The results are written like: 'IP;Host;Aliases'.

    """

    with open(filename, 'wb') as csv_file:
        output = csv.writer(csv_file, delimiter=';', quotechar='"')
        for ip, lookup in data.items():
            output.writerow([ip, lookup['host'], lookup['aliases']])


class LookupThread(threading.Thread):
    def __init__(self, ip, result):
        self.ip = ip
        self.result = result
        threading.Thread.__init__(self)

    def run(self):
        logging.debug('Starting')
        self.lookup(self.ip)
        logging.debug('Exiting')

    def lookup(self, ip):
        """Try to find the host of IP.

        Returns a dict:
            {ip: {'host': host, 'aliases': aliases}}
        If host is not found, then the dict will hold:
            {ip: {'host': 'No host found', 'aliases': ''}}

        """

        try:
            host, aliases, _ = socket.gethostbyaddr(ip)
            self.result[ip] = {
                'host': host,
                'aliases': aliases if aliases else ''
            }
        except socket.herror:
            self.result[ip] = {'host': 'No host found', 'aliases': ''}


if __name__ == '__main__':
    ip_addresses = read_csv_ip()

    start = time.time()
    result = {}

    lookup_threads = [LookupThread(ip, result) for ip in ip_addresses]
    # Start the threads
    for t in lookup_threads:
        t.start()

    # Tell main to wait for all of them
    main_thread = threading.currentThread()
    for thread in threading.enumerate():
        if thread is main_thread:
            continue
        logging.debug('Joining %s', thread.getName())
        thread.join()

    elapsed = time.time() - start
    print '(elapsed time: %.2f seconds)' % elapsed

    write_csv_host(result)

Here, for each IP adress we have in our list, we instanciate our homemade thread class which will holds the lookup function.
Then, we start each thread and we join them in order to wait for all the results.
Finally, we save the result into a CSV formatted file.

Solution benchmarking

In order to give you an idea of the perfomances between the sequential code and the threaded one, I tested both implementations on the following IP addresses:

With the sequential method, we reach ~120 seconds.
In the meantime, the threaded one only takes ~10.1 seconds to complete the process :)

Let's say that we succeeded in our noble quest!

Extra

This post didn't aim to deeply explain how python's threads work, nor how to use the threading module.
I could only advice you to checkout this post about the threading module. It explains the threading 101 uses and it is well written.

If you want to know more about that GIL stuff and how python manages its threads, you really should watch this video from David Beazley which presents the whole stuff.
It is really mindblowing...

Last thing, if you want to limit the number of concurrent threads, which you should do, you might want to check out that post which explains how to use the threading's semaphore and more, like the lock system and co.

Using a basic pool for limiting the number of concurrent threads might end up like that:

class LookupThread(threading.Thread):
    def __init__(self, ip, result, pool):
        self.ip = ip
        self.result = result
        self.pool = pool
        threading.Thread.__init__(self)

    def run(self):
        self.pool.acquire()
        try:
            logging.debug('Starting')
            self.lookup(self.ip)
        finally:
            self.pool.release()
            logging.debug('Exiting')

    def lookup(self, ip):
        """Try to find the host of IP.

        Returns a dict:
            {ip: {'host': host, 'aliases': aliases}}
        If host is not found, then the dict will hold:
            {ip: {'host': 'No host found', 'aliases': ''}}

        """

        try:
            host, aliases, _ = socket.gethostbyaddr(ip)
            self.result[ip] = {
                'host': host,
                'aliases': aliases if aliases else ''
            }
        except socket.herror:
            self.result[ip] = {'host': 'No host found', 'aliases': ''}


if __name__ == '__main__':
    ip_addresses = read_csv_ip()

    start = time.time()
    result = {}

    # Limit the number of concurrent threads to 8
    pool = threading.BoundedSemaphore(8)

    lookup_threads = [LookupThread(ip, result, pool) for ip in ip_addresses]
    # Start the threads
    for t in lookup_threads:
        t.start()

    # Tell main to wait for all of them
    main_thread = threading.currentThread()
    for thread in threading.enumerate():
        if thread is main_thread:
            continue
        logging.debug('Joining %s', thread.getName())
        thread.join()

    elapsed = time.time() - start

    print '(elapsed time: %.2f seconds)' % elapsed

    write_csv_host(result)

Conclusion

All along this post, we have seen that there is no real solution to speed-up the nuclear operation that is gethostbyaddr.
But we have seen how to use python threads to increase the performance on parallel lookups, using the threading module.

Plus, we have learned more about the gears of the python threads and why there is no viable way to kill a specific threadi.

I might post soon another paper which applies this technic on gethostbyname_ex since I used this function in a script which attemps to gather information on domains this time (and not IP adresses).


contactdepier.re License WTFPL2