Determining which entries the user has already seen

Historical perspective

Earlier versions of this program used the following approach to determine which changlog or NEWS entries (hereafter "entries") are new and should be displayed to the user:

This approach was based on two assumptions, neither of which is always true:

  1. Assume that the version numbering for all packages that come from the same source package are in the same series.
  2. Assume that the version numbering of entries always matches the aforementioned version numbering.

For an example of where these assumptions break down, look at the dmsetup package:

This approach was also limited in that it only looked at NEWS.Debian[.gz], changelog.Debian[.gz], changlog.Debian.arch[.gz], and changelog[.gz]. For an example of where this fails, again look at dmsetup, which has changelog.Debian.devmapper.gz.

Another technique used in earlier versions of this program was to attempt heuristically to ignore version number suffixes which should not be considered when evaluating whether a particular entry was new. The employed heuristics were brittle, potentially leading to missed entries or entries displayed multiple times.

Current approach

The current approach abandons the dependency on version numbers and relies instead on entry checksums.

The program maintains a persistent database of previously seen changelog entries containing the following data:

We index content by source package because the same changelog entries frequently appear in multiple binary packages built from the same source package, and we only want the user to see those once.

We remove the header line of each entry in the second set of checksums because sometimes a package version uploaded to stable and a different version uploaded to unstable use different header lines for the same changelog entry.

Given this stored data, the filtering algorithm is simple: Ignore any entry whose content checksum is in the database, and stop reading a file when we hit a complete entry that's already in the database.

The database used by the current approach is significantly larger than the database required for the historical approach -- a few megabytes vs. a few kilobytes -- but it is still relatively mall and we consider this an acceptable amount of space to use for a significantly better-performing algorithm.

Because this approach uses entry checksums, it is able to include entries from files like changelog.Debian.devmapper that the historical approach ignored.

Edge case: no database, or no data for a file in the database

When the persistent database is not being used in a particular invocation of the program, or when there is no data for a particular file in the database, then the above approach requires modification.

In this case, we read and calculate checksums for the same path on disk to seed the database before we parse the file in the package.

Edge case: no database, changelog data from network

When the persistent database is not being used in a particular invocation of the program, and the changelog data for a package is being fetched over the network because it is not present in the package, there is no reliable way to determine which changelog entries have been displayed already, so the program displays all of them.

This is sufficiently rare, both because the program is usually used with a persistent database and because there are relatively few packages without embedded changelogs, that it is considered an acceptable performance degradation to exchange for better overall performance.

It is also preferable to the historical approach because it errs by displaying extra information to the user rather than by failing to display data that it should have.