26 December 2015

Keeping up to date with a forked Github repository

Ok, you have made your first pull request and it has been accepted along other pull requests from others contributers or maybe you're still working on your fork but forked repository is receiving updates you want to take to your fork. In those two case you want to import all new commits received in forked repository to your fork one.

Removing your fork repository and recloning the forked one is not a solution because you may have code unfinished that you'd lose if you remove your repository. Working in that way is messy and prone to errors.

There's another way far better than that but to be honest it is not evident looking only to the Github web interface. The trick is to import commits to forked repository into your local copy of your forked repository and merging there the changes and pushing them to your Github repository afterwards.

At first glance it might look overcomplicated but it is logical if you consider your Github repository as a definitive one leaving your local copy as the place where you mess with things.

As I said before, to proceed that way I haven't found any evident way through Github web interface. Even Pycharm lacks a feature needed to do the process entirely through GUI, so you'll have to type something in console. Thankfully the thing is solved with just a few commands.

First step involves adding forked Github repository as a remote source of your local repository. As an example I'll show the output for my local copy of my fork version of vdist. This very step is the one you cannot do visually through GUI in Pycharm. To see which sources your git repository has type:

dante@Camelot:~/vdist$ git remote -v origin https://github.com/dante-signal31/vdist.git (fetch) origin https://github.com/dante-signal31/vdist.git (push)

You can see I call origin to my fork repository in Github. Now lets add original forked repository to my git sources. Say we want call original forked repository as upstream:

dante@Camelot:~/vdist$ git remote add upstream https://github.com/objectified/vdist.git dante@Camelot:~/vdist$ git remote -v origin https://github.com/dante-signal31/vdist.git (fetch) origin https://github.com/dante-signal31/vdist.git (push) upstream https://github.com/objectified/vdist.git (fetch) upstream https://github.com/objectified/vdist.git (push)

You can see that forked repository is now among my local git repository like upstream.

To download any change from forked repository without actually applying them in local repository you should use git fetch:

dante@Camelot:~/vdist$ git fetch upstream remote: Counting objects: 1, done. remote: Total 1 (delta 0), reused 0 (delta 0), pack-reused 1 De https://github.com/objectified/vdist * [new branch] master -> upstream/master

You can do this last one step from Pycharm GUI using "VCS-> Git-> Fetch" menu.

Downloaded changes are stored by default in upstream/master branch. You can review them and if you agree with them you can apply to your local master branch doing:

dante@Camelot:~/vdist$ git checkout master Switched to branch 'master' dante@Camelot:~/vdist$ git merge upstream/master

Now, your local master branch and the one from forked repository are synced. What you need to do is update your Github fork repository with a push:

dante@Camelot:~/vdist$ git push origin master

The good news are that you only need to add upstream repository just once and afterwards you can operate over it as usual even through Pycharm GUI as it detects the new remote branches as soon as you ask her to do a fetch from all sources.

12 July 2015

Effective Python

When you learn a language there is a point where beginners books don't give you anything useful any longer, where you can develop almost anything you want with what you know so far but that level of knowledge is not enough, you want to master the language and improve your skills a little more everyday.

"Effective Python" is the kind of book to read when you get that point. It's not a book for beginners but a book for developers who want to be really pythonic.

Written by a Google engineer, it covers several developments areas like functions, classes, metaclasses, concurrency, collaboration, production, etc, through many recipes and examples. You can read this book sequentially or not. It has many similarities with books like "Python Cookbook". Some topics will be known for you, some others will be new and interesting. In the end you'll use this book as a reference when you come across situations like the depicted ones in the book.

I'm my humble opinion the money to get this book is well invested. I got examples and tricks about topics and possibilities really useful for my developments and many months after reading it I keep coming back to this book for references.

26 April 2015

Google Code shutdown

It is not fresh news, but last month Google announced they are going to turn down Google Code after 9 years of existence.

That service started in 2006 with the goal of providing a scalable and reliable way of hosting open source projects. Since that time, millions of people have contributed to open source projects hosted on that site. Nevertheless time have passed and other hosting options, like GitHub or Bitbucket, have gained greater popularity than Google Code.

No one can deny that Google likes to try new trends and technologies but the fact is that they have to keep happy their stakeholders so they end every service that doesn't stay high in popularity. Many high quality service have been shutdown by Google before Code's one: Notes or Wave are the first that come to my mind. Problem is that Google shutdowns services because they are not popular but some argue that they are not so popular because people and companies are afraid of Google changing its mind and closing service as time passes, so they don't use them. Who knows...

The facts are that since March no new project can be added to Google Code and that already existing project will keep their functionality only until August of this year. From there on, project data will be read only, an next year the site will be definitively shutdown although project data will be available for download in an archive format.

Those like me who have projects in Google Code have a Google Code Exporter to migrate project to GitHub and documentation to migrate to other services manually.

Many of my projects are rather outdated but I guess I'll import them to my private Bitbucket repositories to refactor them heavily and afterwards make them publicly available though GitHub.

17 April 2015

Packaging Python programs - PyPI packages

Once you have finished your program you probably will want to share it. There are few ways to share a Python program:
  • Zip your program and send it throught email: I'd only use this method for very short scripts that only use the built-in libraries in standard Python distribution. 
  • Place it in a web repository like GitHub or BitBucket: It's a good option if you want to share your program with other developers in order to let them make code contributions. We reviewed that option in my tutorial about Mercurial
  • Place it in Python Package Index (PyPI): It's the obvious choice if you want to share your program with other python developers not to let them modify your code but to use it as a library in their own python developments. The good point with this option is that you can install dependencies automatically.
  • Bundle your program in a native package for Linux distributions: In another article I'm going to explain ways to bundle our python program into a RPM package (for Fedora/RedHat/Suse distributions) or a DEB package (for Ubuntu/Debian distributions).
In this article I'm going to explain the third method: using Python Package Index (PyPI).

PyPI is the official python package repository and it's maintained by Python Software Foundation. So it's the central repository of reference for nearly all python packages publicly available. PyPi has a webpage from where you can download any package manually but usually you'd better use pip or easy_install to install packages from your console prompt. I usually prefer pip, because it's included by default in later Python 3.x versions and besides it's the tool recommended by Python Packaging Authority (PyPA), the main source about best practices about python packaging. Using pip you can download and install any python package from PyPI repositories, along any needed dependency.

You can upload two main types of python packages to PyPI:
  • Wheel packages: It's a binary package format intended to replace older Egg format. It's the recommended format for python packages uploaded to PyPI.
  • Source distribution packages: It's a package that has to be compiled in installer's end. Wheel format is faster and easier to install because it's precompiled in packager's end. A good reason to choose sources packages instead of wheels ones could be to include C extensions, because those should be compiled in users end.
Wheel packages are the recommended format for main users but source distribution are useful to build native packages for popular linux distros (deb packages for Debian/Ubuntu, rpm packages for Red Hat/Fedora/Suse, etc.). We are going to see both of them in this article.

First thing you should get to make sure your application is ready to be packaged is to structure it in an standard folder tree structure. You can order your files inside your project in the way you want but if you review some of the more popular python projects in Bitbucket or GitHub you'll that they follow a similar way to place their files across their folder structure. That way is a best practice that Python people has been adopting as time passed. To see and example of that structure you can check a sample project structure developed by PyPA, Following that best practice you are supposed to put your installation script and all files describing your project at the project root folder. Your application files (the ones you actually develop) should be inside a folder called as the project. Other files related to your development, but not the developed application itself, should be in their own folders at the same level than your application folder but not inside it.

Let's see another example: check Geolocate GitHub repository. There you can see that I put compiling, packaging and installation scripts at root level, along with files like README.rst or REQUIREMENTS.txt that describe the application. Development files are inside application folder (geolocate folder) instead. Some people prefer to place their unittest files in their own folder outside application folder and other prefer to put them inside it. If tests are not intended to be distributed to final users I think it's better to keep them off application folder. In case of Geolocate they are inside to solve some problems with imports, but now I know the causes so I guess in my next project I will keep my tests appart in their own folder.

Once you have structured your project in an standard folder tree, it is a good idea to create a virtualenv to run your application. If you don't know what a virtualenv is, take a look of this tutorial. That virtualenv would let you define the list of python packages your application needs as dependencies and export that list to a REQUIREMENTS.txt file, as is explained here. That file will be really useful when you write your setup.py script, as I'm going to explain.

Setup.py script is the most important file to create PyPI compatible packages. It serves two primary functions:
  • It's the file where various aspects of your project are configured. This scripts contains a global function: setup(). The arguments you pass to that function defines details of your application: namely application author, version, description, license, etc.
  • It's the command line interface for running various commands that relate to packaging tasks.
You can get a listing of commands available through setup.py running:
dante@Camelot:~/project-directory$ python setup.py --help-commands

Setup.py depends on setuptools python package, so you have to be sure to have it installed. 

In this article I'm going to explain you how to write a functional setup.py script using as a guideline the setup.py file at geolocate. As you can see in that example, file is essentially simple: just import setup function from setuptools package and call it. The real customization comes with parameters we pass to setup(). Let's see those parameters:
  • "name": It is the package name as it is going to be identified in PyPI repository. You'd better check if your desired name is already used in PyPI before deciding your final application name.  I developed geolocate just to find that name was already used by another package in PyPI, so I had to name the package glocate although its inside executable was still named as geolocate. It was a dirty solution but next time I'll do better. 
  • "version": Package version. Try to keep it updated to let your user upgrade their package instance downloading them from PyPI.
  • "description": It's the short description that will be shown in your package page at PyPI. Try to keep it short and descriptive.
  • "long description": This is the long version of your description. Here you can put more detail. Two or three paragraphs is right.
  • "author": Your name (real or nickname).
  • "author_email": Email where you want to be contacted for things related with this application.
  • "license": Name of the license you have chosen.
  • "url": Website's URL for this application.
  • "download_url": I put here the website's url where you can find linux distro dependant versions of this package.
  • "classifiers": Categories to classify this application. Its important to set them because they help users to find the application they need when they search in PyPI database. You can find a full listing of available classifiers here.
  • "keywords": List of keywords that describe your project.
  • "install_requires": Here you place the list of dependencies you exported to REQUIREMENTS.txt.
  • "zip_safe": As an optimization, PyPI packages can be installed in a compressed format so they consume less hard disk space. Problem is that some applications don't work well that way, so I prefer set this to false.
  • "packages": It's required to list packages to be included in your project. Although they can be listed manually I prefer setuptools.find_packages() to find them automatically. "Exclude keyword is supposed to allow you omit packages that are not intended to be released and installed. Problem is that that keyword doesn't actually work because of a bug. We'll see the workaround in this article.
  • "entry_points": Here you define which function, inside your scripts, will be called by user. I use it to define console_scripts. With console-scripts setuptool "compiles" the called script making it a linux executable. For instance, if you define the main function inside your_script.py you can get a your_script executable with no py extension that can be executed directly.
  • "package_data": By default, setup only includes python files in your package. If your application contains other file types that are called by your python packages you shoud use this keyword to make they are included to. You set package_data to a python dictionary. Each key is one of you application package a it value is a list of relative path names that should be copied into the package. The paths are interpreted as relative to the directory containing the package. Setup.py is not able to create empty folders to place there files after installation, so the workaround is to create a dummy empty file in that folder and include that file in installble with this keyword.
Some people use "data_files" keyword to include files that are not placed inside any of their application python packages. Problem I found with this approach is that those files end installed in platform dependent paths so it's really hard for your scripts to find them when they are run after installation. That's why I prefer to put my files inside my packages and use "package_data" keyword instead.

Once you have written your setup.py file you can compile your packages, but you might want to check the sanity of your setup.py before. Pyroma analyzes if your setup.py complies with recognized good packaging practices, alerting you if doesn't.

If you are happy with your setup.py configuration you can create a source package just doing:
dante@Camelot:~/project-directory$ python setup.py sdist

While to create a wheel package you are only supposed to do:
dante@Camelot:~/project-directory$ python setup.py bdist_wheel

When you use setup.py to create your packages, it will create a dist folder (in the same folder as setup.py) and place packages there.

Problem arises when you try to use find_packages function, for your packages keyword, along with exclude argument. I've found that in that particular case, exclude argument doesn't work and your undesired files get included into the package. It happens this behavior is a bug, and while they fix it the workaround involves first creating source package and afterwards creating wheel package from source one with this command:

dante@Camelot:~/project-directory$ pip wheel --no-index --no-deps --wheel-dir dist dist/*.tar.gz

Your packages can be locally installed with pip:
dante@Camelot:~/project-directory$ pip install dist/your_package.whl

Trying to install locally your package in a freshly created virtualenv is a good way to check installation really works as expected.

To share your package you can send it through email or make it available in a web server, but the pythonic preferred way is make it publicly available through PyPI.

PyPI has two main web sites:

  • PyPI test site: This site is cleaned on a semiregular basis.Before releasing your package on main PyPI site you might prefer training on test site. You can try to download your package from PyPI test site but it can happen your installation fails because dependencies cannot be downloaded from the same site. That is because PyPI test site doesn't have the entire packages database, it only has packages that people have uploaded to test them. 
  • PyPI main site: It has the entire package database. If your dependencies cannot be downloaded from here them you are doing somethig wrong.
To use any of the two interfaces you have to register. Be aware that user database is not shared between two sites so you'll have to register twice. In register page you only have to fill the submission form with your project details. Don't be stressed you can modify any of the fields afterwards.

After registering you can submit your packages files through PyPI web interface but you might want a higher level of automation. To be able to submit files from console (or from any script), you'll need to create a .pypirc in your home folder (notice the dot before the file name), with this content:

repository = https://pypi.python.org/pypi
username = <username>
password = <password>

Afterwards you can run twine command to upload your packages to PyPI:
dante@Camelot:~/project-directory$ twine upload dist/*

You may need to install twine with pip before using it if your system has not installed yet. Twine will use your credentials stored in ~/.pypirc to connect to PyPI and upload your packages. Twine uses SSL to cipher its connections so it's a safer alternative to other options like using: "python setup.py upload".

After that your packages will be available through PyPI and anyone will be able to install it just doing:

dante@Camelot:~/project-directory$ pip install your_package

14 March 2015

Clean Code

Along your life there are not many books that really change your way of thinking or doing things. In my case I can count with my fingers of one hand the books like those that I've met: Kurose & Ross's "Computer Networking: a Top-Down Approach", Ross J.Anderson's "Security Engineering: A Guide to Building Dependable Distributed Systems", and the book this article is talking about Robert C. Martin's "Clean Code: a Handbook of Agile Software Craftsmanship".

I met this book in one of the PyConEs-2013 conferences. In that conference they talked about how to write code sustainable along time. The topic was very interesting to me because I was worried about a phenomenon every programmer know sooner or later: even in Python, when your code grows it gets harder to be maintenable. I had programmed applications that some months later where hard to understand when I had to make a revision over them. Many years before that I had switched to Python to avoid that same problem in Java and Perl, but then it was there again. In the conference they promised that principles explained in that book helped to prevent the problem. So I read the book and I have to admit that they were right.

Reading this book is shocking. There are so many practices that we think that are right that actually are terribly wrong that you first read some passages with a mixture of surprise and incredulity. Things like saying that code comments are a recognition of your failure to make your code readable sounds strange in the first read but afterwards you really get that author is really right.

Book examples are not in Python but in Java, nevertheless I think that no Python programmer would have any problem to grasp concepts explained there. Few of the concepts are too Java-ish but many others are useful to Python developers. Some of the main concepts are:

  • Your function names should explain clearly what the function do. No abbreviations allowed in function names.
  • Function should do one thing and one thing only. A function should have only one purpose. Of course, a function can have many steps but all of them should be focused to get function's goal, and every step should be actually implemented in it's own function. That lead to functions easier to test.
  • Functions should be short: 2 lines is great, 5 lines is good, 10 lines average, 15 poor.
  • Code comments should be restricted only to explain design decisions instead of what code does.
  • Don't mix levels of abstraction in the same function, meaning that you should not call directly python API while other steps of your function call to your own custom functions. Instead of that wrap your call to API inside another custom function.
  • Order your implementations so you can read your code from top to down.
  • Reduce as far as possible the number of arguments you pass into functions. Functions with 1 argument are good, 2  are average and 3 is likely poor.
  • Don't Repeat Yourself (well, at least this concept was known to me before reading this book).
  • Classes should be small.
  • Classes should have only one reason to change (Single Responsibility Principle). IMHO I think this principle is a logic extension of "single purpose" for functions.
  • Class attributes should be ideally used for all class methods.If you find attributes just used by an small subset of methods you should ask yourself if those attributes and methods could go in a separate class.
  • Classes should be open for extension but close to modifications. That means that we incorporate new features by subclassing existing classes not modifying them. That way we reduce the risk of breaking things when we include new features.
  • TDD or condemn yourself to hell of include further modifications in your code fearing you are going to break the whole thing.
There are many more concepts, all fully explained with examples, but those are the ones I keep in my head when a write code.

To test if principles of this book were right, I developed an application called Geolocate following these concepts and TDD ones. In the beginning it was hard to change my behaviour about writing code but as my code was getting bigger I realized it was easier than in my previous projects to find errors and fix them. Besides, when my application got a respectable size I let it rest for five months to see how easy was to retake development after so much time without reading the code. I was amazed. Although with my previous projects I would have needed some days to understand a code so big, this time I had fully recovered control of how my code worked in just an hour.

My conclusion is that this book is a "must read" that will let you improve dramatically your code quality and your peace of mind to maintain that same code afterwards.

08 January 2015

Python test coverage tools

I guess there are many metrics out there to know how effective are your unittest proofs to cover all possible cases of your developments. In the future I'm going to formalize my knowledge in TDD but nowadays I'm just playing with the concepts so I follow a very simple metric: If my tests ejecutes all my code program then I'm doing OK.

How to know if your tests executes all your code?. When code grows so your tests amount do, then it's easy to miss a fragment of code and leave it untested. There is where test coverage tools come to help you.

Those tools follow your tests while they execute your code and take note of visited code lines. That way, after tests execution you can see statistics about which percentage of your code is actually tested and which not.

We are going to see two ways to analyze your test coverage: from console and from your IDE.

If your are in an Ubuntu box you should install "python3-coverage" to get coverage analysis in the console:
dante@Camelot:~/project-directory$ sudo aptitude install python3-coverage

Once installed, python3-coverage has to be called to run your unitests:

dante@Camelot:~/project-directory$ python3-coverage run --omit "/usr/*" ./run_tests.py

I use "--omit" flag to keep "/usr" out of coverage reports. Before using that flag calls of my program to external libraries were included in the report. As I don't test external libraries because they are not developed by me, getting their coverage statistics would make my reports harder to read. The script "run_tests.py" is the same I explained in my article about unittest.

Suppose all your test run correctly, you can generate a report about your coverage in html or xml format. To get a report in html format, just run:

dante@Camelot:~/project-directory$ python3-coverage html

This command generates a folder called "htmlcov" where your html report is stored (index.html is its entry page). The only thing a bit a annoying is that htmlcov has to be removed manually before generating a new one. Besides it's boring to search generated index page to open report. That's why I prefer to use an script to automate all those boring things:

#!/usr/bin/env bash

python3-coverage run --omit "/usr/*" ./run_tests.py
echo "Removing previous reports..."
rm -rf htmlcov/*
rmdir htmlcov
echo "Removed."
echo "Building coverage report..."
python3-coverage html
echo "Coverage report built."
echo "Let's show report."
(firefox htmlcov/index.html &> /dev/null &)
echo "Bye!"

Running that script (I call it: "run_test_and_coverage.sh") I get an automatically opened firefox browser showing your just created coverage report.

If you use and IDE, chances are that it includes some sort of coverage tool. PyCharm includes coverage support in its professional version. Actually with PyCharm you get more or less the same than from console tools but integrated with your editor in a more confortable way.

At application level, default configuration should be enough:

I guess if you have not installed system package "python3-coverage" you should check "Use bundled coverage.py" option to use native coverage tool included with  PyCharm. In my case I haven't notice any difference either checking or unchecking that option (obviously I have "python3-coverage" installed).

The only tricky thing is to remember that running test with coverage support has its own button in PyCharm interface. That button is located next to "Run" and "Debug" ones:

After using that button, you get a summary panel right your main editor with percentages showing coverages in each folder. Outliner at left hand side marks coverage by each source file. Besides, your main editor window will get colored to mark lines covered with green color and not covered in red.

Click image to enlarge

Keep your test coverage as near as possible to 100% is one of your best indicator your tests are well designed. To control it you can use console or your IDE tool, its your choice, but both of then are easy enough to use them often.