Pages

Wednesday, March 14, 2012

3

Google App Engine, Python 2.7, and lxml

Google App Engine (GAE) lets you build and run applications on Google's infrastructure. I have done two applications with GAE and Python (the programming language). The previous version of my tool to get blogger avatars (avafavico) used regular expressions to parse the Blogger profile page for the profile photo, or user's first contributed blog's favico, if no profile photo is found. As you might know, using regular expressions may not be the best solution. It would be better to parse html page document object model (DOM) and get results there. That way the code is not so sensitive to possible changes in the page html code. Of course I did the regular expressions so, that they should work with different html.

Previously I used GAE Python 2.5 environment, which was the default and supported version. On February 27th Python 2.7 became fully supported. GAE with Python 2.7 contains more external libraries than version 2.5, one of those libraries being lxml.

I tried to search for different solutions and examples, but there were not many. With GAE Python 2.5, one can use BeautifulSoup, but there are some issues (problematic 3.1.0 version, uncertain development future of BS, etc.). And there is minidom, but it may not handle broken html well. Blogger profile page should not have broken html, but you never know. lxml is definitely better and faster, supports XPath, etc.

The day before yesterday I updated the GAE app to use Python 2.7 and lxml. There were none to some examples about using python27, lxml and GAE, so I'll show you here a working example. First I started modifying the file app.yaml, there I changed runtime to python27, added "threadsafe" (false), and added (latest) lxml in libraries section. Increased version number to 5. Now app.yaml looks like this:


In blogava.py I added "from lxml import etree" and then used etree functions instead of regular expressions to find things in html. Here's how DOM tree is constructed, variable "result" contains page html as a string, and then XPath is applied to the tree, like this:

>>> tree = etree.HTML(result)
>>> r=tree.xpath("//img[@id='profile-photo']/@src")

Here find from the tree the first img tag, which id is set to "profile-photo", and get that tag's src attribute. In the full script, if no id='profile-photo' is not found, then try to search for the first image that has class "photo". If both fails, search for first "contributed-to" blog, and use it's favicon, if that is not found, use Blogger favicon. And here is the blogava.py source file:


This new version of avafavico has been up and running for two days. I'm very pleased that I got lxml working with GAE, all in all it was quite easy. Hope this example is useful to someone. If it helped you, please leave a comment. :)
[Hide comments] - [Show comments]
Click on a single comment to hide/show its text

3 comments:

bubuli said... [reply]

I found this very useful.

Being a n00b myself on GAE, it took me a while to figure out on Eclipse that you need to reference the lxml location in your GAE PyDev project otherwise you see that nagging red line in your .py source...and naturally Python interpreter won't be able to resolve the lxml library.

bubuli said... [reply]

BTW, according to this: http://lxml.de/FAQ.html#can-i-use-threads-to-concurrently-access-the-lxml-api lxml is thread-safe as of 2.2...and since GAE uses 2.3, you can use threadafe: true for your app.

MS-potilas said... [reply]

@bubuli Thanks for the info. Although to get benefit from multithreading, I think I should build some kind of synchronization, like semaphore to avatar loading/memcache handling. Otherwise two threads (requests) could check, if the same "user 123"'s avatar is in memcache, if it was not, both would start loading it. So I think the sequential handling of the requests, one after another (threadsafe=no) works best for this code.

Post a Comment

Related Posts Plugin for WordPress, Blogger...
See the hack
for this dynamic
views icon: