The workframe
Among my missions in Société Générale, a key element is to collect information about websites in order to pentest them.
The group Société Générale counts more than 150
000 employees.
It has a really complex organizations, like every big
companies I guess.
Therefore it becomes really hard to have an overall point of view on all its
servers, since they are spreaded across all the group's branches and their
sectors.
That is to say that I have to deal with hundreds, thousands of domain names and
IP addresses, and gathering information about them takes a lot of time.
What we mean by collect information about a website is to identify several main information like:
- The IP address
- The domain name
- The service running on the machine (web, ssh, dns, etc.)
- If there is any load balancing
- The sector affiliated to
- etc.
In this post, I assume that I only have IP addresses.
During some tests about getting the hostname of an IP address, I had to use
gethostbyaddr.
If you ever tried to use
gethostbyaddr,
you must have seen that it can take long time to answer.
The problem with this function is that it takes a high amount of time
before giving up on the domain name resolution.
Among the thousands of IP addresses, a bunch of them come from ranges reserved
by the SG group.
Not all of them point to a running machine, therefore a lot of them don't have
a host name.
When you can wait like 5 to 10 seconds for 1 or 2 addresses, it is not viable to wait hours and hours for thousands of them.
What is gethostbyaddr?
The function
gethostbyaddr
retrieve the primary host name from a given IP address.
It returns the aliases of the alternative host name if any too.
The use of
gethostbyaddr
function is really straightforward.
It expects a single parameter which is a string containing an IP address.
And it returns a tuple wich holds the host name, the aliases and a list of IP
addresses for the same interface on the same host.
For instance, trying gethostbyaddr on 8.8.8.8 (a DNS of Google) returns the following result:
>>> import socket >>> socket.gethostbyaddr('8.8.8.8') ('google-public-dns-a.google.com', [], ['8.8.8.8'])
It can be really usefull but as I just told you above, it times out too slowly
:/
Last example only took an instant to run.
Now let's see how long it takes when it can not retrieve the host name:
Host not found for 192.168.0.0 (elapsed time: 10.01 seconds)
Do you believe it?! 10 seconds for a single address!
Let's say we have a thousand IP addresses and half of them don't have a host
name:
500 * 10 seconds = 5000 seconds = 83,3 minutes = 1 hour 23 minutes
Could you imagine how long it can take to find the host names of several
thousands of IP addresses?
This can happen when you have to check several ranges of addresses which belong
to a company.
As you know, time is money.
I'm not sure my supervisor would like the idea of me waiting a day for some
whatever host names...
First attempt to speed up the results
I checked the python doc to find a way of re-setting the timeout value of
gethostbyaddr.
Well, there is none :/
The function that is closest to what I aim to do is
setdefaulttimeout:
# Set the socket default timeout value to 2 seconds socket.setdefaulttimeout(2)
Checking the timeout value of
socket with
getdefaulttimeout,
I do have a 2 seconds but
gethostbyaddr
still ends after 10 seconds. Argh...
So
gethostbyaddr
doesn't really have a 10 seconds timeout, but it might just take 10 seconds to
realize that it can't find the host name.
I thought then that I should find a way of keeping track of the execution time
of the function and then stop the process when it exceeds a pre-defined time
value.
The only plausible way to do that was to use a thread.
I mean, how hard can it be to start a thread, wait a certain amount of time and then terminate it?
Timeout on a thread
First of all, let us take a moment to describe the threading
module and how to use it in a
common way.
The threading module is
built on the low-level thread
one.
It gives us handier means to code threaded programs in python.
If you can give a function as a parameter of the Thread class, I rather choose to define a class which inherits from the threading.Thread one.
I find the code better looking even if it is not more straightforward.
import threading import logging logging.basicConfig( level=logging.DEBUG, format='[%(levelname)s] (%(threadName)-10s) %(message)s', ) class MyThread(threading.Thread): def __init__(self): threading.Thread.__init__(self) def run(self): logging.debug('Starting') logging.debug('Doing my stuff') logging.debug('Exiting') if __name__ == '__main__': my_thread = MyThread() my_thread.start() my_thread.join() logging.debug('Exiting')
The logging module is thread-safe and allows us to print debug information whitout having problems with the display.
With the code above, we have the following result:
[DEBUG] (Thread-1 ) Starting [DEBUG] (Thread-1 ) Doing my stuff [DEBUG] (Thread-1 ) Exiting [DEBUG] (MainThread) Exiting
So let's take a closer look:
- In order to define our thread, we inherit our class from threading.Thread.
- Then, we have to define the run function which will holds our code.
- We finally run the thread using the start function and join it to make the main waits for it.
If we try the join's parameter to set up the timeout value, we can manage to
have the main to finish before the thread does.
It leads us to the following code:
#[. . .] class MyThread(threading.Thread): def __init__(self): threading.Thread.__init__(self) def run(self): logging.debug('Starting') logging.debug('Waiting for 5 seconds') time.sleep(5) logging.debug('Exiting') if __name__ == '__main__': my_thread = MyThread() my_thread.start() my_thread.join(2) # Times out after 2 seconds logging.debug('Exiting')
And give us the following output:
[DEBUG] (Thread-1 ) Starting [DEBUG] (Thread-1 ) Waiting for 5 seconds [DEBUG] (MainThread) Exiting [DEBUG] (Thread-1 ) Exiting
Our walk-around works fine.
The main continues and doesn't wait more than 2 seconds for the thread.
But there is a problem and I am convinced that you see it too.
Well, for one thread, it is not a real problem, but can you extrapole this
situation for hundreds of them?
Right, the threads are still running! And they still consume ressources!
On the Internet, I didn't find a nice way to kill a thread after a certain
amount of time (here or
here).
And there is a good reason for that: you should never kill a thread!
Why?
Well, first of all, python doesn't provide a way for terminating a thread.
You can find that in the pthread framework but even them advice us not to do
it.
Second, killing a thread might raise some problems when it holds the python's
GIL. In fact, you can have a deadlock situation.
You may find a quick answer here or have some good explanations in this Eli Bendersky's post.
Now we are at the point that we can not decrease the execution time of gethostbyaddr nor walk-around the problem by using a thread with a given timeout value.
Therefore, I focus now on the basic use a multiple threads where each of them
will
gethostbyaddr
a single IP.
This way will speed up the whole process, though it does not speed up the
nuclear operation.
Multiple threads for multiple lookups
The main idea here is to split each lookup operation into a single thread.
This can be easily done with the threading python
module.
I will use a dictionnary for holding the result since it is thread-safe in
python, like pointed out in this
post.
Note: The dictionnary is thread-safe in our context since we are dealing
with a set of IP adresses.
Therefore there will be no situations where two concurrent threads will modify
the value of the same key.
Let us see the new version, where the gethostbyaddr is included into a threaded-class.
import sys import csv import time import socket import logging import threading logging.basicConfig( level=logging.DEBUG, format='[%(levelname)s] (%(threadName)-10s) %(message)s', ) def read_csv_ip(filename='./ip.txt'): """Read a CSV file containing the IP addresses. There might be one or more IP addresses per line (';' separated) Return the set of them. """ ip_addresses = set() with open(filename, 'rb') as csv_file: data = csv.reader(csv_file, delimiter=';', quotechar='"') for line in data: ip_addresses.update([ip for ip in line if ip]) return ip_addresses def write_csv_host(data, filename='./result.txt'): """Write the results of gethostbyaddr on the IP addresses into a file. The results are written like: 'IP;Host;Aliases'. """ with open(filename, 'wb') as csv_file: output = csv.writer(csv_file, delimiter=';', quotechar='"') for ip, lookup in data.items(): output.writerow([ip, lookup['host'], lookup['aliases']]) class LookupThread(threading.Thread): def __init__(self, ip, result): self.ip = ip self.result = result threading.Thread.__init__(self) def run(self): logging.debug('Starting') self.lookup(self.ip) logging.debug('Exiting') def lookup(self, ip): """Try to find the host of IP. Returns a dict: {ip: {'host': host, 'aliases': aliases}} If host is not found, then the dict will hold: {ip: {'host': 'No host found', 'aliases': ''}} """ try: host, aliases, _ = socket.gethostbyaddr(ip) self.result[ip] = { 'host': host, 'aliases': aliases if aliases else '' } except socket.herror: self.result[ip] = {'host': 'No host found', 'aliases': ''} if __name__ == '__main__': ip_addresses = read_csv_ip() start = time.time() result = {} lookup_threads = [LookupThread(ip, result) for ip in ip_addresses] # Start the threads for t in lookup_threads: t.start() # Tell main to wait for all of them main_thread = threading.currentThread() for thread in threading.enumerate(): if thread is main_thread: continue logging.debug('Joining %s', thread.getName()) thread.join() elapsed = time.time() - start print '(elapsed time: %.2f seconds)' % elapsed write_csv_host(result)
Here, for each IP adress we have in our list, we instanciate our homemade
thread class which will holds the lookup function.
Then, we start each thread and we join them in order to wait for all the
results.
Finally, we save the result into a CSV formatted file.
Solution benchmarking
In order to give you an idea of the perfomances between the sequential code and the threaded one, I tested both implementations on the following IP addresses:
- 8.8.8.8
- 8.8.4.4
- 192.175.150.159
- 192.175.150.158
- 192.175.150.157
- 192.175.150.156
- 192.175.150.155
- 192.175.150.154
- 192.175.150.153
- 192.175.150.152
- 192.175.150.151
- 192.175.150.150
- 192.175.150.149
- 192.175.150.148
With the sequential method, we reach ~120 seconds.
In the meantime, the threaded one only takes ~10.1 seconds to complete the
process :)
Let's say that we succeeded in our noble quest!
Extra
This post didn't aim to deeply explain how python's threads work, nor how to
use the threading module.
I could only advice you to checkout this post about the threading
module. It explains the
threading 101 uses and it is
well written.
If you want to know more about that GIL stuff and how python manages its
threads, you really should watch this video from David Beazley which presents
the whole stuff.
It is really mindblowing...
Last thing, if you want to limit the number of concurrent threads, which you should do, you might want to check out that post which explains how to use the threading's semaphore and more, like the lock system and co.
Using a basic pool for limiting the number of concurrent threads might end up like that:
class LookupThread(threading.Thread): def __init__(self, ip, result, pool): self.ip = ip self.result = result self.pool = pool threading.Thread.__init__(self) def run(self): self.pool.acquire() try: logging.debug('Starting') self.lookup(self.ip) finally: self.pool.release() logging.debug('Exiting') def lookup(self, ip): """Try to find the host of IP. Returns a dict: {ip: {'host': host, 'aliases': aliases}} If host is not found, then the dict will hold: {ip: {'host': 'No host found', 'aliases': ''}} """ try: host, aliases, _ = socket.gethostbyaddr(ip) self.result[ip] = { 'host': host, 'aliases': aliases if aliases else '' } except socket.herror: self.result[ip] = {'host': 'No host found', 'aliases': ''} if __name__ == '__main__': ip_addresses = read_csv_ip() start = time.time() result = {} # Limit the number of concurrent threads to 8 pool = threading.BoundedSemaphore(8) lookup_threads = [LookupThread(ip, result, pool) for ip in ip_addresses] # Start the threads for t in lookup_threads: t.start() # Tell main to wait for all of them main_thread = threading.currentThread() for thread in threading.enumerate(): if thread is main_thread: continue logging.debug('Joining %s', thread.getName()) thread.join() elapsed = time.time() - start print '(elapsed time: %.2f seconds)' % elapsed write_csv_host(result)
Conclusion
All along this post, we have seen that there is no real solution to speed-up
the nuclear operation that is
gethostbyaddr.
But we have seen how to use python threads to increase the performance on
parallel lookups, using the threading
module.
Plus, we have learned more about the gears of the python threads and why there is no viable way to kill a specific threadi.
I might post soon another paper which applies this technic on gethostbyname_ex since I used this function in a script which attemps to gather information on domains this time (and not IP adresses).