Sunday, June 8, 2008

X.Org 7.3: The Good, The Bad and The Ugly (3)

It was Good, it was Bad, and finally it went horribly Ugly. The so called "User Experience", that is.

The story so far: after upgrading X.Org to version 7.3, my laptop would completely lockup at startup, upon switching from console display to graphical display. After some futzing around I isolated the problem to the ATI display driver.

What now? my options seemed clear:
  1. downgrade the driver (and, due to dependencies, all of X.Org) to the previous, working, version, file a bug report, and then wait for a fix...
  2. apply one of the workarounds that I found, file a bug report, and then wait for a fix...
Being me I started exploring another option: try to debug and fix it myself, file a bug report containing a patch that fixes the problem, and then wait for it to be included upstream...

I went over to the ATI driver page on the Debian PTS, and found out that the package source code repository is managed with Git. This was great news.

In brief, Git provides a tool called git bisect that (in theory) allows anyone (including non-programmers - again, in theory) to find the cause of a software bug by isolating a single bad commit (i.e. a single batch of source code modifications) that is causing it. But there's no guarantee that the problem is caused by a single commit. I decided to play the optimist (for a change) and dived in - head first.

First things first: install Git, like this
aptitude install git-core gitk
If you're running a firewall, you'd better open port 9418 for outgoing TCP connections. I use shorewall:
  1. add the following line to /etc/shorewall/rules:
    ACCEPT      $FW      net        tcp     9418
  2. restart the firewall
    invoke-rc.d shorewall restart
Next, clone the source code repository:
git clone git://
Now figure out how to build, install and test it, which, in this case, is as simple as:
cd xserver-xorg-video-ati; dpkg-buildpackage -rfakeroot -b -tc -uc
dpkg -i ../xserver-xorg-video-ati_6.8.0-1_i386.deb
... and then reboot.

This is where the fun starts. You start out by telling Git that a bisection process has started and marking the current version as bad:
cd xserver-xorg-video-ati
git bisect start
git bisect bad
We now need to mark the previous version as good:
git checkout -f xserver-xorg-video-ati-1_6.6.3-4; git clean -d -f
git bisect good
Git responds by selecting a commit halfway between the bad and good commits:
Bisecting: 426 revisions left to test after this
[2f87bff293a343b40c1be096933a5ae126632468] RADEON: Fix subtle change in crtc reg init
At this point we need to build this halfway snapshot, test it and tell Git if it works or not with git bisect good or git bisect bad, respectively.

So much for theory. I couldn't build the halfway snapshot that I got! the problem was rather odd - there was no debian sub-directory. I figured out what happened by using gitk to inspect the commit history in the repository.

It turns out that the Debian package Git repository contains both downstream unique files (i.e. the debian directory and its contents) and the upstream source code. Occasionaly, when a new version of the driver's package is being prepared, upstream commits are pulled to the downstream repository and merged. The debian directory is missing from the upstream repository (and this is as it should be), so that whenever Git bisects the downstream repository it is most likely to create a repository without this directory.

My solution to this was to have two clones of the package Git repository - one of them was used only for bisection, and the other for actual package building and testing. After each bisection step I pulled from the first repository to the second repository, which was reset beforehand to the previous working version. This way I got a repository that included both the debian directory and the commits upto the current bisection point.

It took around 13 iterations (read: around 13 reboots) before I hit the jackpot (did I mention that this is the Ugly part of my story?).

Eventually Git informed me that
80eee856938756e1222526b6c39cee8b5252b409 is first bad commit
RADEON: fix console restore on netbsd
This looked very relevant, but after inspecting the source code I was stumped: it was obvious that some hardware registers/modes were being saved/restored, but to what end? and what did this "fix" actually fix? and more importantly: what did this NetBSD related fix break on my box?

The only fix I could come up with was to revert the effect of this modification - but only under Linux. And what do you know? it solved my problem! I incorporated my fix into the current version, and it started to work fine (in case you're keeping count: two more reboots).

I reported the bug on the Debian BTS, complete with a patch (see bug #480312). My patch was eventually committed into the upstream Git repository a few days later.

A happy end?

I later spent some time browsing through more of the code, and my fix seemed to be at home: the driver's code contains quite a few code fragments that are either enabled or disabled, depending on both hardware type and target platform. It's quite obvious that the upstream author(s) of the driver need all the help they can get - the task they took upon themselves isn't easy.

I have a strong suspicion that it will break again - I just hope that I'll upgrade my hardware by then...

No comments:

Post a Comment