Wednesday, December 22, 2010

PyPy 1.4.1

Here is PyPy 1.4.1 :-)

Update: Win32 binaries available.

Enjoy!

Release announcement

We're pleased to announce the 1.4.1 release of PyPy. This release consolidates all the bug fixes that occurred since the previous release. To everyone that took the trouble to report them, we want to say thank you.

What is PyPy

PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython. Note that it still only emulates Python 2.5 by default; the fast-forward branch with Python 2.7 support is slowly getting ready but will only be integrated in the next release.

In two words, the advantage of trying out PyPy instead of CPython (the default implementation of Python) is, for now, the performance. Not all programs are faster in PyPy, but we are confident that any CPU-intensive task will be much faster, at least if it runs for long enough (the JIT has a slow warm-up phase, which can take several seconds or even one minute on the largest programs).

Note again that we do support compiling and using C extension modules from CPython (pypy setup.py install). However, this is still an alpha feature, and the most complex modules typically fail for various reasons; others work (e.g. PIL) but take a serious performance hit. Also, for Mac OS X see below.

Please note also that PyPy's performance was optimized almost exclusively on Linux. It seems from some reports that on Windows as well as Mac OS X (probably for different reasons) the performance might be lower. We did not investigate much so far.

More highlights

  • We migrated to Mercurial (thanks to Ronny Pfannschmidt and Antonio Cuni) for the effort) and moved to bitbucket. The new command to check out a copy of PyPy is:
    hg clone http://bitbucket.org/pypy/pypy

  • In long-running processes, the assembler generated by old JIT-compilations is now freed. There should be no more leak, however long the process runs.

  • Improve a lot the performance of the binascii module, and of hashlib.md5 and hashlib.sha.

  • Made sys.setrecursionlimit() a no-op. Instead, we rely purely on the built-in stack overflow detection mechanism, which also gives you a RuntimeError -- just not at some exact recursion level.

  • Fix argument processing (now e.g. pypy -OScpass works like it does on CPython --- if you have a clue what it does there :-) )

  • cpyext on Mac OS X: it still does not seem to work. I get systematically a segfault in dlopen(). Contributions welcome.

  • Fix two corner cases in the GC (one in minimark, one in asmgcc+JIT). This notably prevented pypy translate.py -Ojit from working on Windows, leading to crashes.

  • Fixed a corner case in the JIT's optimizer, leading to Fatal RPython error: AssertionError.

  • Added some missing built-in functions into the 'os' module.

  • Fix ctypes (it was not propagating keepalive information from c_void_p).

Tuesday, December 14, 2010

PyPy migrates to Mercurial

The assiduous readers of this blog surely remember that during the last Düsseldorf sprint in October, we started the process for migrating our main development repository from Subversion to Mercurial. Today, after more than two months, the process has finally been completed :-).

The new official PyPy repository is hosted on BitBucket.

The migration has been painful because the SVN history of PyPy was a mess and none of the existing conversion tools could handle it correctly. This was partly because PyPy started when subversion was still at version 0.9 when some best-practices were still to be established, and partly because we probably managed to invent all the possible ways to do branches (and even some of the impossible ones: there is at least one commit which you cannot do with the plain SVN client but you have to speak to the server by yourself :-)).

The actual conversion was possible thanks to the enormous work done by Ronny Pfannschmidt and his hackbeil tool. I would like to personally thank Ronny for his patience to handle all the various requests we asked for.

We hope that PyPy development becomes even more approachable now, at least from a version control point of view.

Friday, December 10, 2010

Oh, and btw: PyPy gets funding through "Eurostars"

There is a supporting reason why we made so many advances in the last year: funding through Eurostars, a European research funding program. The title of our proposal (accepted in 2009) is: "PYJIT - a fast and flexible toolkit for dynamic programming languages based on PyPy". And the participants are Open End AB, the Heinrich-Heine-Universität Düsseldorf (HHU), and merlinux GmbH.

It's not hard to guess what PYJIT is actually about, is it? Quoting: "The PYJIT project will deliver a fast and flexible Just-In-Time Compiler toolkit based on PyPy to the market of dynamic languages. Our main aim is to showcase our project's results for the Open Source language Python, providing unprecedented levels of flexibility and with speed hitherto only available using statically typed languages." (Details in German or in Swedish :-)

A subgoal is to improve our development and testing infrastructure, mainly showcased by Holger's recent py.test releases, the testing tool used by PyPy for its 16K tests and the speed.pypy.org infrastructure (web app programmed by Miquel Torres on his own time).

The overall scope of this project is smaller than that of the previous EU project from 2004 to 2007. The persons that are (or were) getting money to work on PyPy are Samuele Pedroni (at Open End), Maciej Fijalkowski (as a subcontractor), Carl Friedrich Bolz, Armin Rigo, Antonio Cuni (all at HHU), and Holger Krekel (at merlinux) as well as Ronny Pfannschmidt (as a subcontractor).

The Eurostars funding lasts until August 2011. What comes afterwards? Well, for one, many of the currently funded people have done work without getting funding in previous years. This will probably continue. We also have non-funded people in the core group right now and we'll hope to enlarge it further. But of course there are still large tasks ahead which may greatly benefit from funding. We have setup a donation infrastructure and maybe we can win one or more larger organisations to provide higher or regular sums of money to fund future development work. Another possibility for companies is to pay PyPy developers to help and improve PyPy for their particular use cases.

And finally, your help, donations and suggestions are always welcome and overall we hope to convince more and more people it's worthwhile to invest into PyPy's future.

Wednesday, December 8, 2010

Leysin Winter sprint

Hi all,

The next sprint will be in Leysin, Switzerland, during the week of the 16th-22nd of January 2011.

Now that we have released 1.4, and plan to release 1.4.1 soon, the sprint is going to be mainly working on fixing issues reported by various users. Of course this does not prevent people from showing up with a more precise interest in mind.

As usual, the break day on the sprint will likely be a day of skiing :-)

Hoping to see you there.

Update: there are actually a number of branches that we want to polish and merge into trunk: at least fast-forward, jit-unroll-loops, arm-backend and jitypes2. For more details, see the announcement.

Wednesday, December 1, 2010

PyPy 1.4 release aftermath

A couple days have passed since the announcement of the 1.4 release, and this is a short summary of what happened afterwards. Let's start with numbers:

  • 16k visits to the release announcement on our blog
  • we don't have download statistics unfortunately
  • 10k visits to speed center
  • most traffic comes from referring sites, reddit alone creating above a third of our traffic

Not too bad for a project that doesn't have a well-established user base.

Lessons learned:

  • Releases are very important. They're still the major way projects communicate with community, even if we have nightly builds that are mostly stable.
  • No segfaults were reported, no incompatibilities between JIT and normal interpretation. We think that proves (or at least provides a lot of experimental evidence) that our write-once-and-then-transform method is effective.
  • A lot of people complained about their favorite module in C not working, we should have made it clearer that CPyExt is in alpha state. Indeed, we would like to know which C extension modules do work :-).
  • Some people reported massive speedups, other reported slowdowns compared to CPython. Most of those slowdowns relate to modules being inefficient (or doing happy nonsense), like ctypes. This is expected, given that not all modules are even jitted (although having them jitted is usually a matter of a couple of minutes).
  • Nobody complained about a lack of some stdlib module. We implemented the ones which are used more often, but this makes us wonder if less used stdlib modules have any users at all.

In general feedback has been overwhelmingly positive and we would like to thank everyone trying (and especially those reporting problems)

Cheers,
fijal

We are not heroes, just very patient

Inspired by some of the comments to the release that said "You are heroes", I though a bit about the longish history of PyPy and hunted around for some of the mailing list posts that started the project. Then I put all this information together into the following timeline:

There is also a larger version of the timeline. Try to click on some of the events, the links usually go to the sprint descriptions. I also tried to find pictures for the sprints but succeeded for only half of them, if anybody still has some, I would be interested. It's kind of fun to browse around in some of the old sprint descriptions to see how PyPy evolved. Some of the current ideas have been around for a long time, some are new. In the description of the releases I put estimates for the speed of the release.

Friday, November 26, 2010

PyPy 1.4: Ouroboros in practice

We're pleased to announce the 1.4 release of PyPy. This is a major breakthrough in our long journey, as PyPy 1.4 is the first PyPy release that can translate itself faster than CPython. Starting today, we are using PyPy more for our every-day development. So may you :) You can download it here:

http://pypy.org/download.html

What is PyPy

PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython. It is fast (pypy 1.4 and cpython 2.6 comparison).

New Features

Among its new features, this release includes numerous performance improvements (which made fast self-hosting possible), a 64-bit JIT backend, as well as serious stabilization. As of now, we can consider the 32-bit and 64-bit linux versions of PyPy stable enough to run in production.

Numerous speed achievements are described on our blog. Normalized speed charts comparing pypy 1.4 and pypy 1.3 as well as pypy 1.4 and cpython 2.6 are available on the benchmark website. For the impatient: yes, we got a lot faster!

More highlights

  • PyPy's built-in Just-in-Time compiler is fully transparent and automatically generated; it now also has very reasonable memory requirements. The total memory used by a very complex and long-running process (translating PyPy itself) is within 1.5x to at most 2x the memory needed by CPython, for a speed-up of 2x.
  • More compact instances. All instances are as compact as if they had __slots__. This can give programs a big gain in memory. (In the example of translation above, we already have carefully placed __slots__, so there is no extra win.)
  • Virtualenv support: now PyPy is fully compatible with virtualenv: note that to use it, you need a recent version of virtualenv (>= 1.5).
  • Faster (and JITted) regular expressions - huge boost in speeding up the re module.
  • Other speed improvements, like JITted calls to functions like map().

Cheers,
Carl Friedrich Bolz, Antonio Cuni, Maciej Fijalkowski, Amaury Forgeot d'Arc, Armin Rigo and the PyPy team

Improving Memory Behaviour to Make Self-Hosted PyPy Translations Practical

In our previous blog post, we talked about how fast PyPy can translate itself compared to CPython. However, the price to pay for the 2x speedup was an huge amount of memory: actually, it was so huge that a standard -Ojit compilation could not be completed on 32-bit because it required more than the 4 GB of RAM that are addressable on that platform. On 64-bit, it consumed 8.3 GB of RAM instead of the 2.3 GB needed by CPython.

This behavior was mainly caused by the JIT, because at the time we wrote the blog post the generated assembler was kept alive forever, together with some big data structure needed to execute it.

In the past two weeks Anto and Armin attacked the issue in the jit-free branch, which has been recently merged to trunk. The branch solves several issues. The main idea of the branch is that if a loop has not been executed for a certain amount of time (controlled by the new loop_longevity JIT parameter) we consider it "old" and no longer needed, thus we deallocate it.

(In the process of doing this, we also discovered and fixed an oversight in the implementation of generators, which led to generators being freed only very slowly.)

To understand the freeing of loops some more, let's look at how many loops are actually created during a translation. The purple line in the following graph shows how many loops (and bridges) are alive at any point in time with an infinite longevity, which is equivalent to the situation we had before the jit-free branch. By contrast, the blue line shows the number of loops that you get in the current trunk: the difference is evident, as now we never have more than 10000 loops alive, while previously we got up to about 37000 ones. The time on the X axis is expressed in "Giga Ticks", where a tick is the value read out of the Time Stamp Counter of the CPU.

The grey vertical bars represent the beginning of each phase of the translation:

  • annotate performs control flow graph construction and type inference.
  • rtype lowers the abstraction level of the control flow graphs with types to that of C.
  • pyjitpl constructs the JIT.
  • backendopt optimizes the control flow graphs.
  • stackcheckinsertion finds the places in the call graph that can overflow the C stack and inserts checks that raise an exception instead.
  • database_c produces a database of all the objects the C code will have to know about.
  • source_c produces the C source code.
  • compile_c calls the compiler to produce the executable.

You can nicely see, how the number of alive graphs drops shortly after the beginning of a new phase.

Those two fixes, freeing loops and generators, improve the memory usage greatly: now, translating PyPy on PyPy on 32-bit consumes 2 GB of RAM, while on CPython it consumes 1.1 GB. This result can even be improved somewhat, because we are not actually freeing the assembler code itself, but only the large data structures around it; we can consider it as a residual memory leak of around 150 MB in this case. This will be fixed in the jit-free-asm branch.

The following graph shows the memory usage in more detail:

  • the blue line (cpython-scaled) shows the total amount of RAM that the OS allocates for CPython. Note that the X axis (the time) has been scaled down so that it spans as much as the PyPy one, to ease the comparison. Actually, CPython took more than twice as much time as PyPy to complete the translation
  • the red line (VmRss) shows the total amount of RAM that the OS allocates for PyPy: it includes both the memory directly handled by our GC and the "raw memory" that we need to allocate for other tasks, such as the assembly code generated by the JIT
  • the brown line (gc-before) shows how much memory is used by the GC before each major collection
  • the yellow line (gc-after) shows how much memory is used by the GC after each major collection: this represent the amount of memory which is actually needed to hold our Python objects. The difference between gc-before and gc-after (the GC delta) is the amout of memory that the GC uses before triggering a new major collection

By comparing gc-after and cpython-scaled, we can see that PyPy uses mostly the same amount of memory as CPython for storing the application objects (due to reference counting the memory usage in CPython is always very close to the actually necessary memory). The extra memory used by PyPy is due to the GC delta, to the machine code generated by the JIT and probably to some other external effect (such as e.g. Memory Fragmentation).

Note that the GC delta can be set arbitrarly low (another recent addition -- the default value depends on the actual RAM on your computer; it probably works to translate if your computer has precisely 2 GB, because in this case the GC delta and thus the total memory usage will be somewhat lower than reported here), but the cost is to have more frequent major collections and thus a higher run-time overhead. The same is true for the memory needed by the JIT, which can be reduced by telling the JIT to compile less often or to discard old loops more frequently. As often happens in computer science, there is a trade-off between space and time, and currently for this particular example PyPy runs twice as fast as CPython by doubling the memory usage. We hope to improve even more on this trade-off.

On 64-bit, things are even better as shown by the the following graph:

The general shape of the lines is similar to the 32-bit graph. However, the relative difference to CPython is much better: we need about 3 GB of RAM, just 24% more than the 2.4 GB needed by CPython. And we are still more than 2x faster!

The memory saving is due (partly?) to the vtable ptr optimization, which is enabled by default on 64-bit because it has no speed penalty (see Unifying the vtable ptr with the GC header).

The net result of our work is that now translating PyPy on PyPy is practical and takes less than 30 minutes. It's impressive how quickly you get used to translation taking half the time -- now we cannot use CPython any more for that because it feels too slow :-).

Monday, November 15, 2010

Running large radio telescope software on top of PyPy and twisted

Hello.

As some of you already know, I've recently started working on a very large radio telescope at SKA South Africa. This telescope's operating software runs almost exclusively on Python (several high throughput pieces are in C or CUDA or directly executed by FPGAs). Some cool telescope pictures:


(photos courtesy of SKA South Africa)

Most of the operation software is using the KatCP protocol to talk between devices. The currently used implementation is Open Source software with a custom home built server and client. As part of the experiments, I've implemented a Twisted based version and run in on top of CPython and PyPy for both the default implementation and the one based on Twisted to see how those perform.

There are two testing scenarios: the first one is trying to saturate the connection by setting up multiple sensors that report state every 10ms, the second one is measuring a round-trip between sending a request and receiving the response. Both numbers are measuring the number of requests per 0.2s, so the more the better. On X axis there is a number of simultanously connected clients.

All benchmark code is available in the KatCP repository.

The results are as follows:


As you can see, in general Twisted has larger overhead for a single client and scales better as the number of clients increases. That's I think expected, since Twisted has extra layers of indirection. The round trip degradation of Twisted has to be investigated, but for us scenario1 is by far more important.

All across the board PyPy performs much better than CPython for both Twisted and a home-made solution, which I think is a pretty good result.

Note: we didn't roll this set up into production yet, but there are high chances for both twisted and PyPy to be used in some near future.

Cheers, fijal

Saturday, November 13, 2010

Efficiently Implementing Python Objects With Maps

As could be foreseen by my Call for Memory Benchmarks post a while ago, I am currently working on improving the memory behaviour of PyPy's Python interpreter. In this blog post I want to describe the various data a Python instance can store. Then I want to describe how a branch that I did and that was recently merged implements the various features of instances in a very memory-efficient way.

Python's Object Model

All "normal" new-style Python instances (i.e. instances of subclasses of object without added declarations) store two (or possibly three) kinds of information.

Storing the Class

Every instance knows which class it belongs to. This information is accessible via the .__class__ attribute. It can also be changed to other (compatible enough) classes by writing to that attribute.

Instance Variables

Every instance also has an arbitrary number of attributes stored (also called instance variables). The instance variables used can vary per instance, which is not the case in most other class-based languages: traditionally (e.g. in Smalltalk or Java) the class describes the shape of its instances, which means that the set of admissible instance variable names is the same for all instances of a class.

In Python on the other hand, it is possible to add arbitrary attributes to an instance at any point. The instance behaves like a dictionary mapping attribute names (as strings) to the attribute values.

This is actually how CPython implements instances. Every instance has a reference to a dictionary that stores all the attributes of the instance. This dictionary can be reached via the .__dict__ attribute. To make things more fun, the dictionary can also be changed by writing to that attribute.

Example

As an example, consider the following code:

class A(object):
    pass

instance1 = A()
instance1.x = 4
instance1.y = 6
instance1.z = -1

instance2 = A()
instance2.x = 1
instance2.y = 2
instance2.z = 3

These two instances would look something like this in memory:

(The picture glosses over a number of details, but it still shows the essential issues.)

This way of storing things is simple, but unfortunately rather inefficient. Most instances of the same class have the same shape, i.e. the same set of instance attribute names. That means that the key part of all the dictionaries is identical (shown grey here). Therefore storing that part repeatedly in all instances is a waste. In addition, dictionaries are themselves rather large. Since they are typically implemented as hashmaps, which must not be too full to be efficient, a dictionary will use something like 6 words on average per key.

Slots

Since normal instances are rather large, CPython 2.2 introduced slots, to make instances consume less memory. Slots are a way to fix the set of attributes an instance can have. This is achieved by adding a declaration to a class, like this:

class B(object):
    __slots__ = ["x", "y", "z"]

Now the instances of B can only have x, y and z as attributes and don't have a dictionary at all. Instead, the instances of B get allocated with enough size to hold exactly the number of instance variables that the class permits. This clearly saves a lot of memory over the dictionary approach, but has a number of disadvantages. It is obviously less flexible, as you cannot add additional instance variables to an instance if you happen to need to do that. It also introduces a set of rules and corner-cases that can be surprising sometimes (e.g. instances of a subclass of a class with slots that doesn't have a slots declaration will have a dict).

Using Maps for Memory-Efficient Instances

As we have seen in the diagram above, the dictionaries of instances of the same class tend to look very similar and share all the keys. The central idea to use less memory is to "factor out" the common parts of the instance dictionaries into a new object, called a "map" (because it is a guide to the landscape of the object, or something). After that factoring out, the representation of the instances above looks something like this:

Every instance now has a reference to its map, which describes what the instance looks like. The actual instance variables are stored in an array (called storage in the diagram). In the example here, the map describes that the instances have three attributes x, y and z. The numbers after the attributes are indexes into the storage array.

If somebody adds a new attribute to one of the instances, the map for that instance will be changed to another map that also contains the new attribute, and the storage will have to grow a field with the new attribute. The maps are immutable, immortal and reused as much as possible. This means, that two instances of the same class with the same set of attributes will have the same map. This also means that the memory the map itself uses is not too important, because it will potentially be amortized over many instances.

Note that using maps makes instances nearly as small as if the correct slots had been declared in the class. The only overhead needed is the indirection to the storage array, because you can get new instance variables, but not new slots.

The concept of a "map" that describes instances is kind of old and comes from the virtual machine for the Self programming language. The optimization was first described in 1989 in a paper by Chambers, Ungar and Lee with the title An Efficient Implementation of Self, a Dynamically-Typed Object-Oriented Language Based on Prototypes. A similar technique is used in Google's V8 JavaScript engine, where the maps are called hidden classes and in the Rhino JavaScript engine.

The rest of the post describes a number of further details that occur if instances are implemented using maps.

Supporting Dictionaries with Maps

The default instance representation with maps as shown above works without actually having a dictionary as part of each instance. If a dictionary is actually requested, by accessing the .__dict__ attribute, it needs to be created and cached. The dictionary is not a normal Python dictionary, but a thin wrapper around the object that forwards all operations to it. From the user's point of view it behaves like a normal dictionary though (it even has the correct type).

The dictionary needs to be cached, because accessing .__dict__ several times should always return the same dictionary. The caching happens by using a different map that knows about the dictionary and putting the dictionary into the storage array:

Things become really complex if the fake dict is used in strange ways. As long as the keys are strings, everything is fine. If somebody adds other keys to the dict, they cannot be represented by the map any more (which supports only attributes, i.e. string keys in the __dict__). If that happens, all the information of the instance will move into the fake dictionary, like this:

In this picture, the key -1 was added to the instance's dictionary. Since using the dictionary in arbitrary ways should be rare, we are fine with the additional time and memory that the approach takes.

Slots and Maps

Maps work perfectly together with slots, because the slots can just be stored into the storage array used by the maps as well (in practise there are some refinements to that scheme). This means that putting a __slots__ on a class has mostly no effect, because the instance only stores the values of the attributes (and not the names), which is equivalent to the way slots are stored in CPython.

Implementation Details

In the diagrams above, I represented the maps as flat objects. In practise this is a bit more complex, because it needs to be efficient to go from one map to the next when new attributes are added. Thus the maps are organized in a tree.

The instances with their maps from above look a bit more like this in practise:

Every map just describes one attribute of the object, with a name and a an index. Every map also has a back field, that points to another map describing what the rest of the object looks like. This chain ends with a terminator, which also stores the class of the object.

The maps also contain the information necessary for making a new object of class A. Immediately after the new object has been created, its map is the terminator. If the x attribute is added, its maps is changed to the second-lowest map, and so on. The blue arrows show the sequence of maps that the new object goes through when the attributes x, y, z are added.

This representation of maps as chains of objects sounds very inefficient if an object has many attributes. The whole chain has to be walked to find the index. This is true to some extent. The problem goes away in the presence of the JIT, which knows that the chain of maps is an immutable structure, and will thus optimize away all the chain-walking. If the JIT is not used, there are a few caches that try to speed up the walking of this chain (similar to the method cache in CPython and PyPy).

Results

It's hard to compare the improvements of this optimization in a fair way, as the trade-offs are just very different. Just to give an impression, a million objects of the same class with three fields on a 32bit system takes:

without slots:

  • 182 MiB memory in CPython
  • 177 MiB memory in PyPy without maps
  • 40 MiB memory in PyPy with maps

with slots:

  • 45 MiB memory in CPython
  • 50 MiB memory in PyPy without maps
  • 40 MiB memory in PyPy with maps

Note how maps make the objects a bit more efficient like CPython using slots. Also, using slots has no additional effect in PyPy.

Conclusion

Maps are a powerful approach to shrinking the memory used by many similar instances. I think they can be pushed even further (e.g. by adding information about the types of the attributes) and plan to do so in the following months. Details will be forthcoming.

Wednesday, November 10, 2010

Speeding up PyPy by donations

PyPy joins the Software Freedom Conservancy

Good news. PyPy is now a member of the Software Freedom Conservancy (SFC), see the SFC blog post. This allows us to manage non-profit monetary aspects of the project independently from a company or particular persons. So we can now officially receive donations both from people prefering right or left sides, see the Donate buttons on our home page and our blog. And you can use PayPal or Google Checkout, Donations are tax-exempt in the USA and hopefully soon in Europe as well.

What's it going to get used for? For the immediate future we intend to use the donations for funding travels of core contributors to PyPy sprints who otherwise can't afford to come. So if you have no time but some money you can help to encourage coding contributors to care for PyPy. If we end up with bigger sums we'll see and take suggestions. Money spending decisions will be done by core PyPy people according to non-profit guidelines. And we'll post information from time to time about how much we got and where the money went.

If you have any questions regarding the SFC membership or donations you may send email to sfc at pypy.org which will be observed by Carl Friedrich Bolz, Jacob Hallen and Holger Krekel - the initial PyPy SFC representatives on behalf of the PyPy team. Many thanks go out to Bradley M. Kuhn for helping to implement the PyPy SFC membership.

cheers,

Holger & Carl Friedrich

Tuesday, November 9, 2010

A snake which bites its tail: PyPy JITting itself

We have to admit: even if we have been writing for years about the fantastic speedups that the PyPy JIT gives, we, the PyPy developers, still don't use it for our daily routine. Until today :-).

Readers brave enough to run translate.py to translate PyPy by themselves surely know that the process takes quite a long time to complete, about a hour on super-fast hardware and even more on average computers. Unfortunately, it happened that translate.py was a bad match for our JIT and thus ran much slower on PyPy than on CPython.

One of the main reasons is that the PyPy translation toolchain makes heavy use of custom metaclasses, and until few weeks ago metaclasses disabled some of the central optimizations which make PyPy so fast. During the recent Düsseldorf sprint, Armin and Carl Friedrich fixed this problem and re-enabled all the optimizations even in presence of metaclasses.

So, today we decided that it was time to benchmark again PyPy against itself. First, we tried to translate PyPy using CPython as usual, with the following command line (on a machine with an "Intel(R) Xeon(R) CPU W3580 @ 3.33GHz" and 12 GB of RAM, running a 32-bit Ubuntu):

$ python ./translate.py -Ojit targetpypystandalone --no-allworkingmodules

... lots of output, fractals included ...

[Timer] Timings:
[Timer] annotate                       ---  252.0 s
[Timer] rtype_lltype                   ---  199.3 s
[Timer] pyjitpl_lltype                 ---  565.2 s
[Timer] backendopt_lltype              ---  217.4 s
[Timer] stackcheckinsertion_lltype     ---   26.8 s
[Timer] database_c                     ---  234.4 s
[Timer] source_c                       ---  480.7 s
[Timer] compile_c                      ---  258.4 s
[Timer] ===========================================
[Timer] Total:                         --- 2234.2 s

Then, we tried the same command line with PyPy (SVN revision 78903, x86-32 JIT backend, downloaded from the nightly build page):

$ pypy-c-78903 ./translate.py -Ojit targetpypystandalone --no-allworkingmodules

... lots of output, fractals included ...

[Timer] Timings:
[Timer] annotate                       ---  165.3 s
[Timer] rtype_lltype                   ---  121.9 s
[Timer] pyjitpl_lltype                 ---  224.0 s
[Timer] backendopt_lltype              ---   72.1 s
[Timer] stackcheckinsertion_lltype     ---    7.0 s
[Timer] database_c                     ---  104.4 s
[Timer] source_c                       ---  167.9 s
[Timer] compile_c                      ---  320.3 s
[Timer] ===========================================
[Timer] Total:                         --- 1182.8 s

Yes, it's not a typo: PyPy is almost two times faster than CPython! Moreover, we can see that PyPy is faster in each of the individual steps apart compile_c, which consists in just a call to make to invoke gcc. The slowdown comes from the fact that the Makefile also contains a lot of calls to the trackgcroot.py script, which happens to perform badly on PyPy but we did not investigate why yet.

However, there is also a drawback: on this specific benchmark, PyPy consumes much more memory than CPython. The reason why the command line above contains --no-allworkingmodules is that if we include all the modules the translation crashes when it's complete at 99% because it consumes all the 4GB of memory which is addressable by a 32-bit process.

A partial explanation if that so far the assembler generated by the PyPy JIT is immortal, and the memory allocated for it is never reclaimed. This is clearly bad for a program like translate.py which is divided into several independent steps, and for which most of the code generated in each step could be safely be thrown away when it's completed.

If we switch to 64-bit we can address the whole 12 GB of RAM that we have, and thus translating with all working modules is no longer an issue. This is the time taken with CPython (note that it does not make sense to compare with the 32-bit CPython translation above, because that one does not include all the modules):

$ python ./translate.py -Ojit

[Timer] Timings:
[Timer] annotate                       ---  782.7 s
[Timer] rtype_lltype                   ---  445.2 s
[Timer] pyjitpl_lltype                 ---  955.8 s
[Timer] backendopt_lltype              ---  457.0 s
[Timer] stackcheckinsertion_lltype     ---   63.0 s
[Timer] database_c                     ---  505.0 s
[Timer] source_c                       ---  939.4 s
[Timer] compile_c                      ---  465.1 s
[Timer] ===========================================
[Timer] Total:                         --- 4613.2 s

And this is for PyPy:

$ pypy-c-78924-64 ./translate.py -Ojit

[Timer] Timings:
[Timer] annotate                       ---  505.8 s
[Timer] rtype_lltype                   ---  279.4 s
[Timer] pyjitpl_lltype                 ---  338.2 s
[Timer] backendopt_lltype              ---  125.1 s
[Timer] stackcheckinsertion_lltype     ---   21.7 s
[Timer] database_c                     ---  187.9 s
[Timer] source_c                       ---  298.8 s
[Timer] compile_c                      ---  650.7 s
[Timer] ===========================================
[Timer] Total:                         --- 2407.6 s

The results are comparable with the 32-bit case: PyPy is still almost 2 times faster than CPython. And it also shows that our 64-bit JIT backend is as good as the 32-bit one. Again, the drawback is in the consumed memory: CPython used 2.3 GB while PyPy took 8.3 GB.

Overall, the results are impressive: we knew that PyPy can be good at optimizing small benchmarks and even middle-sized programs, but as far as we know this is the first example in which it heavily optimizes a huge, real world application. And, believe us, the PyPy translation toolchain is complex enough to contains all kinds of dirty tricks and black magic that make Python lovable and hard to optimize :-).

Sunday, October 31, 2010

Düsseldorf Sprint Report 2010

This years installment of the yearly PyPy Düsseldorf Sprint is drawing to a close. As usual, we worked in the seminar room of the programming language group at the University of Düsseldorf. The sprint was different from previous ones in that we had fewer people than usual and many actually live in Düsseldorf all the time.

David spent the sprint working on the arm-backend branch, which is adding an ARM backend to the JIT. With the help of Armin he added support for bridges in the JIT and generally implemented missing operations, mostly for handling integers so far.

Ronny and Anto worked the whole week trying to come up with a scheme for importing PyPy's SVN history into a mercurial repository without loosing too much information. This is a non-trivial task, because PyPy's history is gnarly. We are nearly at revision 79000 and when we started using it, Subversion was at version 0.1. All possible and impossible ways to mangle and mistreat a Subversion repository have been applied to PyPy's repo, so most of the importing tools just give up. Ronny and Anto came up with a new plan and new helper scripts every day, only to then discover another corner case that they hadn't thought of. Now they might actually have a final plan (but they said that every day, so who knows?).

The branch history of PyPy's repository (every box is a branch)

Carl Friedrich and Lukas started working in earnest on memory benchmarks to understand the memory behaviour of Python code better. They have now implemented a generic memory benchmark runner and a simple analysis that walks all objects and collects size information about them. They also added some benchmarks that were proposed in the comments of the recent call for benchmarks. As soon as some results from that work are there, we will post about them.

There were also some minor tasks performed during the sprint. Armin implemented the _bisect module and the dict.popitem method in RPython. Armin and Carl Friedrich made the new memory-saving mapdict implementation more suitable to use without the JIT (blog post should come about that too, at some point). They also made classes with custom metaclasses a lot faster when the JIT is used.

The last three days of the sprint were spent working on Håkan's jit-unroll-loops branch. The branch is meant to move loop invariants out of the loop, using techniques very similar to what is described in the recent post on escape analysis across loop boundaries (see? it will soon stop being science-fiction). Some of the ideas of this approach also come from LuaJIT which also uses very aggressive loop invariant code motion in its optimizers. Moving loop invariants outside of the loop is very useful, because many of the lookups that Python programs do in loops are loop invariants. An example is if you call a function in a loop: The global lookup can often be done only once.

This branch fundamentally changes some of the core assumptions of the JIT, so it is a huge amount of work to make it fit with all the other parts and to adapt all tests. That work is now nearly done, some failing tests remain. The next steps are to fix them and then do additional tests with the translated executable and look at the benchmarks.

Monday, October 25, 2010

The peace of green

No, we are not going to talk about the environment (i.e., the set of variables as printed by /usr/bin/env. What else? :-)).

After months in which we had a couple of tests failing every day, we finally managed to turn (almost) everything green today, at least on Linux. Enjoy this screenshoot taken from the nightly build page:

As usual, the full buildbot results can be seen from the summary page.

cheers, Anto

Friday, October 22, 2010

PhD Thesis about PyPy's CLI JIT Backend

Hi all,

few months ago I finished the PhD studies and now my thesis is available, just in case someone does not have anything better to do than read it :-).

The title of the thesis is High performance implementation of Python for CLI/.NET with JIT compiler generation for dynamic languages, and its mainly based on my work on the CLI backend for the PyPy JIT (note that the CLI JIT backend is currently broken on trunk, but it's still working in the cli-jit branch).

The thesis might be useful also for people that are not directly interested in the CLI JIT backend, as it also contains general information about the inner workings of PyPy which are independent from the backend: in particular, chapters 5 and 6 explain how the JIT frontend works.

Here is the summary of chapters:
  1. Introduction
  2. The problem
  3. Enter PyPy
  4. Characterization of the target platform
  5. Tracing JITs in a nutshell
  6. The PyPy JIT compiler generator
  7. The CLI JIT backend
  8. Benchmarks
  9. Conclusion and Future Work

cheers, Anto

Friday, October 1, 2010

Next PyPy sprint

Hi all,

The next PyPy sprint is scheduled for the end of the month, from the 25th to the 31st of October 2010. It will be done at the university of Düsseldorf, Germany, where three of us are working.

Please see this link for more information.

Thursday, September 23, 2010

PyPy in Google's Summer of Code 2010

Hello.

This year we had a record of two and a half applications (one was on a cross section of PyPy and numpy) accepted for the Google SoC program. Since it ended a couple of weeks ago, we wanted to present the results that were achieved. All three projects were completed successfully, although the rate of success varied quite a bit.

The Numpy proposal progress significantly on making numpy compatible with PyPy's CPython's extension module support, but failed to bring PyPy's numpy implementation into a usable shape (which is a somewhat ambitious goal, one might argue). The experiments done during the projects are living on the micronumpy branch.

The Fast ctypes proposal did some useful experiments on how to JIT external calls from PyPy to C, however, the actual code as of now is not very interesting and it's quite far from providing a full ctypes replacement (or equivalent).

Definitely the most successful proposal was a 64bit (x86_64) backend for PyPy's JIT. It not only includes working 64bit JIT (merged into PyPy trunk), but also a working asmgcc for x86_64 linux platform, that makes it possible to run the JIT on this architecture with our advanced garbage collectors. One can say that x64_64 is now no longer a second-class citizen for PyPy, although it definitely didn't receive as much testing as the x86 platform. Expect this to be a major selling point for the next PyPy release :-)

Cheers, fijal & the PyPy team

Wednesday, September 22, 2010

Using Escape Analysis Across Loop Boundaries for Specialization

This blog post is a successor to the one about escape analysis in PyPy's JIT. The examples from there will be continued here. This post is a bit science-fictiony. The algorithm that PyPy currently uses is significantly more complex and much harder than the one that is described here. The resulting behaviour is very similar, however, so we will use the simpler version (and we might switch to that at some point in the actual implementation).

In the last blog post we described how escape analysis can be used to remove many of the allocations of short-lived objects and many of the type dispatches that are present in a non-optimized trace. In this post we will improve the optimization to also handle more cases.

To understand some more what the optimization described in the last blog post can achieve, look at the following figure:

new lifetimes

The figure shows a trace before optimization, together with the lifetime of various kinds of objects created in the trace. It is executed from top to bottom. At the bottom, a jump is used to execute the same loop another time. For clarity, the figure shows two iterations of the loop. The loop is executed until one of the guards in the trace fails, and the execution is aborted.

Some of the operations within this trace are new operations, which each create a new instance of some class. These instances are used for a while, e.g. by calling methods on them, reading and writing their fields. Some of these instances escape, which means that they are stored in some globally accessible place or are passed into a function.

Together with the new operations, the figure shows the lifetimes of the created objects. Objects in category 1 live for a while, and are then just not used any more. The creation of these objects is removed by the optimization described in the last blog post.

Objects in category 2 live for a while and then escape. The optimization of the last post deals with them too: the new that creates them and the field accesses are deferred, until the point where the object escapes.

The objects in category 3 and 4 are in principle like the objects in category 1 and 2. They are created, live for a while, but are then passed as an argument to the jump operation. In the next iteration they can either die (category 3) or escape (category 4).

The optimization of the last post considered the passing of an object along a jump to be equivalent to escaping. It was thus treating objects in category 3 and 4 like those in category 2.

The improved optimization described in this post will make it possible to deal better with objects in category 3 and 4. This will have two consequences: on the one hand, more allocations are removed from the trace (which is clearly good). As a side-effect of this, the traces will also be type-specialized.

Optimizing Across the Jump

Let's look at the final trace obtained in the last post for the example loop. The final trace was much better than the original one, because many allocations were removed from it. However, it also still contained allocations:

step 1

The two new BoxedIntegers stored in p15 and p10 are passed into the next iteration of the loop. The next iteration will check that they are indeed BoxedIntegers, read their intval fields and then not use them any more. Thus those instances are in category 3.

In its current state the loop allocates two BoxedIntegers at the end of every iteration, that then die very quickly in the next iteration. In addition, the type checks at the start of the loop are superfluous, at least after the first iteration.

The reason why we cannot optimize the remaining allocations away is because their lifetime crosses the jump. To improve the situation, a little trick is needed. The trace above represents a loop, i.e. the jump at the end jumps to the beginning. Where in the loop the jump occurs is arbitrary, since the loop can only be left via failing guards anyway. Therefore it does not change the semantics of the loop to put the jump at another point into the trace and we can move the jump operation just above the allocation of the objects that appear in the current jump. This needs some care, because the arguments to jump are all currently live variables, thus they need to be adapted.

If we do that for our example trace above, the trace looks like this:

step 2

Now the lifetime of the remaining allocations no longer crosses the jump, and we can run our escape analysis a second time, to get the following trace:

step3

This result is now really good. The code performs the same operations than the original code, but using direct CPU arithmetic and no boxing, as opposed to the original version which used dynamic dispatching and boxing.

Looking at the final trace it is also completely clear that specialization has happened. The trace corresponds to the situation in which the trace was originally recorded, which happened to be a loop where BoxedIntegers were used. The now resulting loop does not refer to the BoxedInteger class at all any more, but it still has the same behaviour. If the original loop had used BoxedFloats, the final loop would use float_* operations everywhere instead (or even be very different, if the object model had user-defined classes).

Entering the Loop

The approach of placing the jump at some other point in the loop leads to one additional complication that we glossed over so far. The beginning of the original loop corresponds to a point in the original program, namely the while loop in the function f from the last post.

Now recall that in a VM that uses a tracing JIT, all programs start by being interpreted. This means that when f is executed by the interpreter, it is easy to go from the interpreter to the first version of the compiled loop. After the jump is moved and the escape analysis optimization is applied a second time, this is no longer easily possible. In particular, the new loop expects two integers as input arguments, while the old one expected two instances.

To make it possible to enter the loop directly from the intepreter, there needs to be some additional code that enters the loop by taking as input arguments what is available to the interpreter, i.e. two instances. This additional code corresponds to one iteration of the loop, which is thus peeled off:

step 4

Summary

The optimization described in this post can be used to optimize away allocations in category 3 and improve allocations in category 4, by deferring them until they are no longer avoidable. A side-effect of these optimizations is also that the optimized loops are specialized for the types of the variables that are used inside them.

Monday, September 13, 2010

Escape Analysis in PyPy's JIT

The goal of a just-in-time compiler for a dynamic language is obviously to improve the speed of the language over an implementation of the language that uses interpretation. The first goal of a JIT is thus to remove the interpretation overhead, i.e. the overhead of bytecode (or AST) dispatch and the overhead of the interpreter's data structures, such as operand stack etc. The second important problem that any JIT for a dynamic language needs to solve is how to deal with the overhead of boxing of primitive types and of type dispatching. Those are problems that are usually not present in statically typed languages.

Boxing of primitive types means that dynamic languages need to be able to handle all objects, even integers, floats, etc. in the same way as user-defined instances. Thus those primitive types are usually boxed, i.e. a small heap-structure is allocated for them, that contains the actual value.

Type dispatching is the process of finding the concrete implementation that is applicable to the objects at hand when doing a generic operation at hand. An example would be the addition of two objects: The addition needs to check what the concrete objects are that should be added are, and choose the implementation that is fitting for them.

Last year, we wrote a blog post and a paper about how PyPy's meta-JIT approach works. These explain how the meta-tracing JIT can remove the overhead of bytecode dispatch. In this post (and probably a followup) we want to explain how the traces that are produced by our meta-tracing JIT are then optimized to also remove some of the overhead more closely associated to dynamic languages, such as boxing overhead and type dispatching. The most important technique to achieve this is a form of escape analysis that we call virtual objects. This is best explained via an example.

Running Example

For the purpose of this blog post, we are going to use a very simple object model, that just supports an integer and a float type. The objects support only two operations, add, which adds two objects (promoting ints to floats in a mixed addition) and is_positive, which returns whether the number is greater than zero. The implementation of add uses classical Smalltalk-like double-dispatching. These classes could be part of the implementation of a very simple interpreter written in RPython.

class Base(object):
    def add(self, other):
        """ add self to other """
        raise NotImplementedError("abstract base")
    def add__int(self, intother):
        """ add intother to self, where intother is a Python integer """
        raise NotImplementedError("abstract base")
    def add__float(self, floatother):
        """ add floatother to self, where floatother is a Python float """
        raise NotImplementedError("abstract base")
    def is_positive(self):
        """ returns whether self is positive """
        raise NotImplementedError("abstract base")

class BoxedInteger(Base):
    def __init__(self, intval):
        self.intval = intval
    def add(self, other):
        return other.add__int(self.intval)
    def add__int(self, intother):
        return BoxedInteger(intother + self.intval)
    def add__float(self, floatother):
        return BoxedFloat(floatother + float(self.intval))
    def is_positive(self):
        return self.intval > 0

class BoxedFloat(Base):
    def __init__(self, floatval):
        self.floatval = floatval
    def add(self, other):
        return other.add__float(self.floatval)
    def add__int(self, intother):
        return BoxedFloat(float(intother) + self.floatval)
    def add__float(self, floatother):
        return BoxedFloat(floatother + self.floatval)
    def is_positive(self):
        return self.floatval > 0.0

Using these classes to implement arithmetic shows the basic problem that a dynamic language implementation has. All the numbers are instances of either BoxedInteger or BoxedFloat, thus they consume space on the heap. Performing many arithmetic operations produces lots of garbage quickly, thus putting pressure on the garbage collector. Using double dispatching to implement the numeric tower needs two method calls per arithmetic operation, which is costly due to the method dispatch.

To understand the problems more directly, let us consider a simple function that uses the object model:

def f(y):
    res = BoxedInteger(0)
    while y.is_positive():
        res = res.add(y).add(BoxedInteger(-100))
        y = y.add(BoxedInteger(-1))
    return res

The loop iterates y times, and computes something in the process. To understand the reason why executing this function is slow, here is the trace that is produced by the tracing JIT when executing the function with y being a BoxedInteger:

# arguments to the trace: p0, p1
# inside f: res.add(y)
guard_class(p1, BoxedInteger)
    # inside BoxedInteger.add
    i2 = getfield_gc(p1, intval)
    guard_class(p0, BoxedInteger)
        # inside BoxedInteger.add__int
        i3 = getfield_gc(p0, intval)
        i4 = int_add(i2, i3)
        p5 = new(BoxedInteger)
            # inside BoxedInteger.__init__
            setfield_gc(p5, i4, intval)
# inside f: BoxedInteger(-100)
p6 = new(BoxedInteger)
    # inside BoxedInteger.__init__
    setfield_gc(p6, -100, intval)

# inside f: .add(BoxedInteger(-100))
guard_class(p5, BoxedInteger)
    # inside BoxedInteger.add
    i7 = getfield_gc(p5, intval)
    guard_class(p6, BoxedInteger)
        # inside BoxedInteger.add__int
        i8 = getfield_gc(p6, intval)
        i9 = int_add(i7, i8)
        p10 = new(BoxedInteger)
            # inside BoxedInteger.__init__
            setfield_gc(p10, i9, intval)

# inside f: BoxedInteger(-1)
p11 = new(BoxedInteger)
    # inside BoxedInteger.__init__
    setfield_gc(p11, -1, intval)

# inside f: y.add(BoxedInteger(-1))
guard_class(p0, BoxedInteger)
    # inside BoxedInteger.add
    i12 = getfield_gc(p0, intval)
    guard_class(p11, BoxedInteger)
        # inside BoxedInteger.add__int
        i13 = getfield_gc(p11, intval)
        i14 = int_add(i12, i13)
        p15 = new(BoxedInteger)
            # inside BoxedInteger.__init__
            setfield_gc(p15, i14, intval)

# inside f: y.is_positive()
guard_class(p15, BoxedInteger)
    # inside BoxedInteger.is_positive
    i16 = getfield_gc(p15, intval)
    i17 = int_gt(i16, 0)
# inside f
guard_true(i17)
jump(p15, p10)

(indentation corresponds to the stack level of the traced functions).

The trace is inefficient for a couple of reasons. One problem is that it checks repeatedly and redundantly for the class of the objects around, using a guard_class instruction. In addition, some new BoxedInteger instances are constructed using the new operation, only to be used once and then forgotten a bit later. In the next section, we will see how this can be improved upon, using escape analysis.

Virtual Objects

The main insight to improve the code shown in the last section is that some of the objects created in the trace using a new operation don't survive very long and are collected by the garbage collector soon after their allocation. Moreover, they are used only inside the loop, thus we can easily prove that nobody else in the program stores a reference to them. The idea for improving the code is thus to analyze which objects never escape the loop and may thus not be allocated at all.

This process is called escape analysis. The escape analysis of our tracing JIT works by using virtual objects: The trace is walked from beginning to end and whenever a new operation is seen, the operation is removed and a virtual object is constructed. The virtual object summarizes the shape of the object that is allocated at this position in the original trace, and is used by the escape analysis to improve the trace. The shape describes where the values that would be stored in the fields of the allocated objects come from. Whenever the optimizer sees a setfield that writes into a virtual object, that shape summary is thus updated and the operation can be removed. When the optimizer encounters a getfield from a virtual, the result is read from the virtual object, and the operation is also removed.

In the example from last section, the following operations would produce two virtual objects, and be completely removed from the optimized trace:

p5 = new(BoxedInteger)
setfield_gc(p5, i4, intval)
p6 = new(BoxedInteger)
setfield_gc(p6, -100, intval)

The virtual object stored in p5 would know that it is an BoxedInteger, and that the intval field contains i4, the one stored in p6 would know that its intval field contains the constant -100.

The following operations, that use p5 and p6 could then be optimized using that knowledge:

guard_class(p5, BoxedInteger)
i7 = getfield_gc(p5, intval)
# inside BoxedInteger.add
guard_class(p6, BoxedInteger)
# inside BoxedInteger.add__int
i8 = getfield_gc(p6, intval)
i9 = int_add(i7, i8)

The guard_class operations can be removed, because the classes of p5 and p6 are known to be BoxedInteger. The getfield_gc operations can be removed and i7 and i8 are just replaced by i4 and -100. Thus the only remaining operation in the optimized trace would be:

i9 = int_add(i4, -100)

The rest of the trace is optimized similarly.

So far we have only described what happens when virtual objects are used in operations that read and write their fields. When the virtual object is used in any other operation, it cannot stay virtual. For example, when a virtual object is stored in a globally accessible place, the object needs to actually be allocated, as it will live longer than one iteration of the loop.

This is what happens at the end of the trace above, when the jump operation is hit. The arguments of the jump are at this point virtual objects. Before the jump is emitted, they are forced. This means that the optimizers produces code that allocates a new object of the right type and sets its fields to the field values that the virtual object has. This means that instead of the jump, the following operations are emitted:

p15 = new(BoxedInteger)
setfield_gc(p15, i14, intval)
p10 = new(BoxedInteger)
setfield_gc(p10, i9, intval)
jump(p15, p10)

Note how the operations for creating these two instances has been moved down the trace. It looks like for these operations we actually didn't win much, because the objects are still allocated at the end. However, the optimization was still worthwhile even in this case, because some operations that have been performed on the forced virtual objects have been removed (some getfield_gc operations and guard_class operations).

The final optimized trace of the example looks like this:

# arguments to the trace: p0, p1
guard_class(p1, BoxedInteger)
i2 = getfield_gc(p1, intval)
guard_class(p0, BoxedInteger)
i3 = getfield_gc(p0, intval)
i4 = int_add(i2, i3)
i9 = int_add(i4, -100)

guard_class(p0, BoxedInteger)
i12 = getfield_gc(p0, intval)
i14 = int_add(i12, -1)

i17 = int_gt(i14, 0)
guard_true(i17)
p15 = new(BoxedInteger)
setfield_gc(p15, i14, intval)
p10 = new(BoxedInteger)
setfield_gc(p10, i9, intval)
jump(p15, p10)

The optimized trace contains only two allocations, instead of the original five, and only three guard_class operations, from the original seven.

Summary

In this blog post we described how simple escape analysis within the scope of one loop works. This optimizations reduces the allocation of many intermediate data structures that become garbage quickly in an interpreter. It also removes a lot of the type dispatching overhead. In a later post, we will explain how this optimization can be improved further.

Tuesday, August 17, 2010

EuroPython 2010 Videos available

Hi all,

the videos of the talks from EuroPython 2010 are now available on blip.tv: in particular, there are the three videos of the PyPy talk.

Part 1: What's news in PyPy 1.2 and 1.3 (by Antonio Cuni)

Part 2: Just in Time compilation (by Armin Rigo)

Part 3: cpyext (by Amaury Forgeot d'Arc)

Moreover, here is Mark Shannon's talk which compares HotPy, Unladen Swallow and PyPy:

Call for Benchmarks

As you know, a lot of PyPy's recent development effort has gone into speeding up execution of Python programs. However, an additional good property of PyPy's Python interpreter is that most objects are represented in a much more compact way than in CPython. We would like to investigate some more advanced techniques to reduce the memory usage of Python programs further.

To do this it is necessary to investigate the memory behaviour of real programs with large heaps. For speed measurements there are standard benchmarks, but for memory improvements there is nothing comparable, the memory behaviour of large programs is not that well understood. Therefore we are looking for programs that we can study and use as benchmarks.

Specifically we are looking for Python programs with the following properties:

  • large heaps of about 10MB-1GB
  • should have non-trivial runtime as well (in the range of a few seconds), to judge the speed impact of optimizations
  • ideally pure-Python programs that don't use extension modules so that they run under both CPython and PyPy (this is optional, but makes my life much easier).

We are also rather interested in programs that do a lot of string/unicode processing.

We would be grateful for all ideas. Telling us about a program also has the advantage that we will work on optimizing PyPy for it :-).

Monday, August 2, 2010

PyOhio

This weekend I delivered a talk at PyOhio (an annual conference in Columbus, OH, USA) on PyPy and Unladen Swallow. The talk covered reasons that Python, the language, is hard to optimize, why CPython is slow, and a few optimizations that PyPy and Unladen Swallow have implemented. The slides from my talk are online, and the talk was recorded so a video will follow. I gave a similar talk to ChiPy (the Chicago Python user group), which was also recorded and the video is available. Both audiences were excited about the futures for PyPy and Unladen Swallow, and for the future of a faster Python.

Alex

Using virtualenv with PyPy

Thanks to the work that was recently done on the sys-prefix branch, it is now possible to use virtualenv with PyPy.

To try it, you need:

  • a recent version of PyPy: PyPy 1.3 does not contain the necessary logic to work with virtualenv, so you need a more recent PyPy from subversion trunk. You can either build it by yourself or download one of our precompiled nightly builds
  • a copy of virtualenv-pypy: this is a fork of virtualenv that contains all the patches needed to work with PyPy, and hopefully will be merged back at some point. It should be totally compatible with the official version of virtualenv, so it is safe to use it even to create non-PyPy environments. If you notice some weird behavior that does not happen with the standard virtualenv, please let us know.

The directory layout has been redesigned in a way that it is possible to use virtualenv to install a PyPy both from a precompiled tarball or from an svn checkout:

# from a tarball
$ virtualenv -p /opt/pypy-c-jit-76426-linux/bin/pypy my-pypy-env

# from the svn checkout
$ virtualenv -p /path/to/pypy-trunk/pypy/translator/goal/pypy-c my-pypy-env

Once the environment has been created, you can enter it as usual. Note that bin/python is now a symlink to bin/pypy.

Enjoy it :-)

Tuesday, July 27, 2010

A Play on Regular Expression

The paper where the algorithms we described in the recent blog posts come from is now available. It is written as a play in three Acts with a cast of three and is very readable and funny. The Haskell code is at Sebastian Fischer's github pages.

Monday, July 26, 2010

EuroPython 2010 report

So, EuroPython 2010 is over, I am flying home and it's time to write a report about the conference from the PyPy point of view.

As usual, the conference was very interesting and went very well. The quality of the talks I attended to was high on average and most importantly I could meet a lot of interesting people to discuss various things.

On the first day, Armin, Amaury and I presented the usual PyPy status talk (here are the slides): the talk is an extended version of the one that I and Armin presented at Pycon Italia in May and is divided in three parts: first I talked about the current status of the project, what is the content of the recent 1.2 and 1.3 releases and showed a demo of a simple Django application that renders a Mandelbrot fractal and is measurably faster on PyPy than on CPython. In the second part of the talk, Armin gave an introduction about the ideas that stand behind the JIT. Finally, in the third part Amaury explained how the new cpyext module lets PyPy to compile and load existing CPython extensions written in C.

I think that the talk was well received: the only drawback is that there was no time to answer questions at the end of the presentation. However, we received a lot of "offline" questions after the talk finished and thorough the whole conference: it is always great to see that people are interested in our work, and I'd like to thank everybody for the feedback that they gave to us.

PyPy was also mentioned in the interesting Mark Shannon's talk, where he compared the optimization techniques used by PyPy, Unladen Swallow and HotPy, which is Mark's own PhD project. Moreover, Henrik Vendelbo gave a talk about how to tweak PyPy to produce a standalone executable which embeds a whole python application to make deployment easier, while Andrew Francis explained his implementation of the Go select statement based on the stackless.py module implemented in PyPy. Personally, I am glad to see that people start to think of PyPy as a useful starting point to experiment with new features and use cases that we did not think about: after all, one of PyPy explicit goals is to be "flexible and easy to experiment with".

After the conference there were the usual post EuroPython sprints: this year we had not planned a PyPy sprint, but some people showed interest in it and since Armin and I happened to be still around the day after the conference, we decided to do a mini 1-day sprint, with 6 or 7 people present. Since there were only two core developers it was impossible to use our usual pairing scheme, in which every newcomer pairs with someone who is experienced with the source code to gain knowledge of it. However, I think it was still a successful day of work, and we managed to fix a couple of bugs that was standing in our issue tracker. Again, I'd like to thank all the people that came and worked with us during the sprint.

In conclusion I really enjoyed the EuroPython 2010 experience: the fact that I managed to find a place in Birmingham where to eat a good Italian-style "gelato" helped a lot :-).