GSoC 2014 - 3rd month

Published on Wednesday, 02 July 2014 in Gsoc 2014, Python ; tagged with gsoc, 2014, project, owasp, owtf, security, pentest, python ; text version

The third month of GSoC 2014 is now over. Sadly I had a lot of exams these last couple of weeks (school projects, oral presentations, theoretical exams, etc.) which means that this post will be brief.

Nevertheless, I managed to enhance PTP's architecture and I think this could be interesting. I also completed the documentation of the project using Sphinx not that anyone cares.

Thought on PTP's architecture

The limits of the previous architecture

As you may have read in my previous post about my GSoC project, I found that the previous PTP's architecture would be hard to maintain as the project would grow.

Below is the previous architecture:

PTP UML Diag v1

Since the report and the parser of one tool are defined in a unique class, the more PTP would have to support different versions and report formats of a same tool, the harder it would become to have a clean and stable code.

That is why I thought it would be better for PTP to follow a new architecture in which the report and the parsing functions would be split into two distinct classes like in the following diagram:

PTP UML Diag v2

When implementing such architecture, I thought it would be even better to factorize the code dealing with the opening/closing of the report files into another layer. Therefore, each parser dealing with a XML report would inherit from a specialized XML abstract parser class. The same for JSON, HTML. etc.

PTP's new architecture

First, PTP's supports was not enhanced at all. It still supports only four different tools (and only a specific version of these tools).

But I think that with the new architecture I implemented, it will be easier for me to add new supports, which is the main mission for the next month ;)

Anyway, PTP's new architecture implements the different extra layers I presented in the previous post, plus the one in the previous section.

The complete architecture is shown below:

PTP UML Diag v3

With the AbstractReport class comes the AbstractParser one. The functions like is_mine and check_version have been moved to the parser. They still exist in the report but now act as proxy-functions for their siblings.

The AbtractParser defines a new function, handle_file, which creates a stream on the report data. I use the word stream because it might be anything, from some XML handles to list containing each line of the report.

Of course, since AbstractParser is an abstract layer and handle_file is a specialized function, it has to be overridden.

Speaking about XML, I implemented a XMLParser class that overrides the handle_file abstract function:

class XMLParser(AbstractParser):
    """Specialized parser for XML formatted report."""

    #: str -- XMLParser only supports XML files.
    __format__ = 'xml'

    def __init__(self, pathname):
        """Initialize XMLParser.

        :param str pathname: path to the report file.

        AbstractParser.__init__(self, pathname)

    def handle_file(cls, pathname):
        """Specialized file handler for XML files.

        :param str pathname: path to the report file.
        :raises ValueError: if the report file has not the right extension.
        :raises LxmlError: if Lxml cannot parse the XML file.

        if not pathname.endswith(cls.__format__):
            raise ValueError(
                "This parser only supports '%s' files" % cls.__format__)
        return etree.parse(pathname).getroot()

As you can see, in this context, a stream is a lxml tree object.

I have an exception for Skipfish because its parsing is unique.

Instead of having one report file, it generates a tree structure of directories. The simplest way I found to retrieve the ranking values is to only read the issue_samples file.

But this file is a js one and I had to write an ugly hack to read it. That is why there is no specialized parser class between the abstract and the Skipfish ones.


Useless Info class

Apart from the modification of the architecture, I tried to have a honest look on the current implementation of PTP. I was trying to answer question like Is this pertinent?, Is it mandatory to have this?, Could it be simpler?, etc.

I read a couple of articles about python, what to do, what to avoid, and I stumbled on a conference about the classes and how often we could avoid them.

Then I took a look at my Info class and realized that it was exactly a case shown in the video.

Jack Diederich explains that instead of writing a class that inherits from a standard python type (dict in my case) because maybe it might offer something more later, I should use the standard type instead and implement that class when I will need it.

Therefore I removed the Info class, which was in fact just a dictionary, and replaced each occurrence by a simple dict.

Use iterators whenever possible

Also, I realized that I was using copy-based functions too often.

For instance, in python 2.x, I used the dict.values when looping over the dictionary values. But before python 3.x, this function creates a copy of the values before iterating over them, which can become memory inefficient when dealing with big dictionaries.

Therefore, I replaced each function of this kind with its iterator version (e.g. values to itervalues).

But since python 3.x, this has been modified and functions like itervalues have been moved to values (i.e. the default values function's behavior is to generate iterators and not to copy the values anymore).

So, in order to keep the compatibility between 2.x and 3.x and force the use of iterators, I modified PTP's code to wisely decide which one to use.

This wisdom is given by the following snippet for instance:

def check_version(cls, metadata, key='version'):
    """Checks the version from the metadata against the supported ones.

    :param dct metadata: The metadata in which to find the version.
    :param str key: The :attr:`metadata` key containing the version value.
    :return: `True` if it support that version, `False` otherwise.
    :rtype: :class:`bool`

        parsers = cls.__parsers__.itervalues()
    except AttributeError:  # Python3 then.
        parsers = cls.__parsers__.values()
    if metadata[key] in parsers:
        return True
    return False

Documentation using sphinx

The last point I worked on during this month is creating the documentation for PTP.

A couple of weeks earlier, I discussed the fact that OWTF should use a tool that automatically generates the documentation from the docstrings. We were hesitating between doxygen and sphinx.

Since I already used doxygen in other projects, I thought I could use sphinx for PTP as a beta test, before deciding which one to use in OWTF.

Therefore I spent a day or two writing the full technical documentation for PTP, as you may have guessed based on the snippets I presented in this post.

I was glad to see that the configuration of sphinx for a project is pretty easy and the syntax is really simple but so is doxygen. For now, I think both are equals and this beta test did not give me enough elements to decide on which one to use for OWTF.

Another point I would like to briefly discussed is that I really want to have something clean for PTP. It is for now the cleanest project I have created in my opinion and I want to keep it as clean as possible.

That is why, after a friend of mine linked me the repository of SyncThing, which is one of the cleanest repo I have ever see, I decided to follow the same release system as it uses. The release I have pushed and every next ones will therefore follow the Semantic Versioning Guidelines


I cannot really conclude on anything after this month. With my exams I did not really have a lot of free times to fully work on my GSoC project.

Nevertheless I managed to updated PTP's architecture like I aimed to do in my previous post. I feel confident about the robustness of the new one and I think that it will ease any further work.

Also I simplified the code base and ensured the inter-compatibility between python 2.x and 3.x versions and I was able to configure sphinx and write the technical documentation of the project.

I hope I will have more to say for the next monthly GSoC post :) License WTFPL2