title: GSoC 2014 - 3rd month
author: depierre
published: 2014-07-02
categories: Gsoc 2014, Python
keywords: gsoc, 2014, project, owasp, owtf, security, pentest, python


The third month of GSoC 2014 is now over. Sadly I had a lot of exams these last
couple of weeks (school projects, oral presentations, theoretical exams, etc.)
which means that this post will be brief.

Nevertheless, I managed to enhance [PTP](https://github.com/owtf/ptp)'s
architecture and I think this could be interesting. I also completed the
documentation of the project using [Sphinx](http://sphinx-doc.org/) not that
anyone cares.

<!---summary-->

# Thought on PTP's architecture

## The limits of the previous architecture

As you may have read in my previous post about my GSoC project, I found that
the previous PTP's architecture would be hard to maintain as the project would
grow.

Below is the previous architecture:

![PTP UML Diag v1](/static/images/gsoc2k14/ptp_uml_v1.png)

Since the report and the parser of one tool are defined in a unique class, the
more PTP would have to support different versions and report formats of a same
tool, the harder it would become to have a clean and stable code.

That is why I thought it would be better for PTP to follow a new architecture
in which the report and the parsing functions would be split into two distinct
classes like in the following diagram:

![PTP UML Diag v2](/static/images/gsoc2k14/ptp_uml_v2.png)

When implementing such architecture, I thought it would be even better to
factorize the code dealing with the opening/closing of the report files into
another layer. Therefore, each parser dealing with a XML report would inherit
from a specialized XML abstract parser class. The same for JSON, HTML. etc.

## PTP's new architecture

First, PTP's supports was not enhanced at all. It still supports only
four different tools (and only a specific version of these tools).

But I think that with the new architecture I implemented, it will be easier for
me to add new supports, which is the main mission for the next month ;)

Anyway, PTP's new architecture implements the different extra layers I
presented in the previous post, plus the one in the previous section.

The complete architecture is shown below:

![PTP UML Diag v3](/static/images/gsoc2k14/ptp_uml_v3.png)

With the `AbstractReport` class comes the `AbstractParser` one. The functions
like `is_mine` and `check_version` have been moved to the parser. They still
exist in the report but now act as proxy-functions for their siblings.

The `AbtractParser` defines a new function, `handle_file`, which creates a
*stream* on the report data. I use the word *stream* because it might be
anything, from some XML handles to list containing each line of the report.

Of course, since `AbstractParser` is an abstract layer and `handle_file` is a
specialized function, it has to be overridden.

Speaking about XML, I implemented a `XMLParser` class that overrides the
`handle_file` abstract function:

    :::python
    class XMLParser(AbstractParser):
        """Specialized parser for XML formatted report."""

        #: str -- XMLParser only supports XML files.
        __format__ = 'xml'

        def __init__(self, pathname):
            """Initialize XMLParser.

            :param str pathname: path to the report file.

            """
            AbstractParser.__init__(self, pathname)

        @classmethod
        def handle_file(cls, pathname):
            """Specialized file handler for XML files.

            :param str pathname: path to the report file.
            :raises ValueError: if the report file has not the right extension.
            :raises LxmlError: if Lxml cannot parse the XML file.

            """
            if not pathname.endswith(cls.__format__):
                raise ValueError(
                    "This parser only supports '%s' files" % cls.__format__)
            return etree.parse(pathname).getroot()

As you can see, in this context, a *stream* is a [*lxml*](http://lxml.de/) tree
object.

I have an exception for *Skipfish* because its parsing is unique.

Instead of having one report file, it generates a tree structure of
directories. The simplest way I found to retrieve the ranking values is to only
read the *issue_samples* file.

But this file is a *js* one and I had to write an ugly hack to read it. That is
why there is no specialized parser class between the abstract and the
*Skipfish* ones.

# Simplification

## Useless `Info` class

Apart from the modification of the architecture, I tried to have a honest look
on the current implementation of PTP. I was trying to answer question like *Is
this pertinent?*, *Is it mandatory to have this?*, *Could it be simpler?*, etc.

I read a couple of articles about python, what to do, what to avoid, and I
stumbled on [a conference about the classes](http://youtu.be/o9pEzgHorH0) and
how often we could avoid them.

Then I took a look at my [`Info`
class](https://github.com/owtf/ptp/blob/33f3f42afb3d051f6f6c4828d167fa49b1fb8fff/libptp/info.py)
and realized that it was exactly a case shown in the video.

Jack Diederich explains that instead of writing a class that inherits from a
standard python type (`dict` in my case) because maybe it might offer something
more later, I should use the standard type instead and implement that class
when I will need it.

Therefore I removed the `Info` class, which was in fact just a dictionary, and
replaced each occurrence by a simple `dict`.

## Use iterators whenever possible

Also, I realized that I was using copy-based functions too often.

For instance, in python 2.x, I used the `dict.values` when looping over the
dictionary values. But before python 3.x, this function creates a copy of the
values before iterating over them, which can become memory inefficient when
dealing with big dictionaries.

Therefore, I replaced each function of this kind with its iterator version
(e.g. `values` to `itervalues`).

But since python 3.x, this has been modified and functions like `itervalues`
have been moved to `values` (i.e. the default `values` function's behavior is
to generate iterators and not to copy the values anymore).

So, in order to keep the compatibility between 2.x and 3.x and force the use of
iterators, I modified PTP's code to wisely decide which one to use.

This wisdom is given by the following snippet for instance:

    :::python
    @classmethod
    def check_version(cls, metadata, key='version'):
        """Checks the version from the metadata against the supported ones.

        :param dct metadata: The metadata in which to find the version.
        :param str key: The :attr:`metadata` key containing the version value.
        :return: `True` if it support that version, `False` otherwise.
        :rtype: :class:`bool`

        """
        try:
            parsers = cls.__parsers__.itervalues()
        except AttributeError:  # Python3 then.
            parsers = cls.__parsers__.values()
        if metadata[key] in parsers:
            return True
        return False

# Documentation using sphinx

The last point I worked on during this month is creating the documentation for
PTP.

A couple of weeks earlier, I discussed the fact that
[OWTF](https://www.owasp.org/index.php/OWASP_OWTF) should use a tool that
automatically generates the documentation from the docstrings. We were
hesitating between [doxygen](http://www.stack.nl/~dimitri/doxygen/) and
[sphinx](http://sphinx-doc.org/).

Since I already used doxygen in other projects, I thought I could use sphinx
for PTP as a beta test, before deciding which one to use in OWTF.

Therefore I spent a day or two writing the full technical documentation for
PTP, as you may have guessed based on the snippets I presented in this post.

I was glad to see that the configuration of sphinx for a project is pretty easy
and the syntax is really simple but so is doxygen. For now, I think both are
equals and this beta test did not give me enough elements to decide on which
one to use for OWTF.

Another point I would like to briefly discussed is that I really want to have
something clean for PTP. It is for now the cleanest project I have created in
my opinion and I want to keep it as clean as possible.

That is why, after a friend of mine linked me the repository of
[SyncThing](https://github.com/calmh/syncthing), which is one of the cleanest
repo I have ever see, I decided to follow the same release system as it uses.
[The release I have pushed](https://github.com/owtf/ptp/releases/tag/v0.1.0)
and every next ones will therefore follow the [Semantic Versioning
Guidelines](http://semver.org/)

# Conclusion

I cannot really conclude on anything after this month. With my exams I did not
really have a lot of free times to fully work on my GSoC project.

Nevertheless I managed to updated PTP's architecture like I aimed to do in my
previous post. I feel confident about the robustness of the new one and I think
that it will ease any further work.

Also I simplified the code base and ensured the inter-compatibility between
python 2.x and 3.x versions and I was able to configure sphinx and write the
technical documentation of the project.

I hope I will have more to say for the next monthly GSoC post :)