|
A database within pyrpm is a set of rpms. Basic operations supported by databases are:
open, close, read, clear - NOPs for some classes
clearPkgs - remove tags from rpms to reduce memeory usage
isFilelistImported, importFilelist - NOPs for non repo dbs
reloadDependencies - needed after loading filelist
adding and removing rpms - some do that in memory others directly write to disk
in operator
getMemoryCopy - a copy of the database that can be modified in memory
iterate over Provides, Requires, Conflicts, Obsoletes, Triggers and Files (PRCOTFs)
search for name and PRCOTFs
getFileRequires, getPkgsFileRequires
Most features are implemented in seperate classes. Those features are brought together either by inheritance or by using instances of other classes.
RpmDatabase - abstract super class RpmDB - The on disk rpm db(4) RpmDiskShadowDB - allow virtually removes from db that are not written to disk but insted are just filtered from all results RpmMemoryDB - in memory db that builds hashes for searching, work with all kind of rpms RpmRepoDB - Yum repository, reads data into memory SqliteRepoDB - uses the yum sqlite db RhnChannelRepoDB - deals with RHN channels which are very similar to Yum repositories RpmExternalSearchDB - use another db (sqlite) for searching while maintaining an own list of rpms. All rpms must be contained in the external db! JointDB - treat several dbs as one RhnRepoDB - RHN Repository. Work is done by RhnChannelRepoDB instances RpmShadowDB - current state during resolving - see RpmYum.pydb use case below
Altough databases are used in more or less every script. There are two use cases within pyrpmyum that cover all database classes.
"->" means holding a pointer to an/several instance(s) of another class
Database containing all rpms that are used to resolve dependencies. After creation this database is read only.
JointDB -> SqliteRepoDB - on per repository -> RhnRepoDB - optional -> RhnChannelRepoDB - one per channel -> RpmMemoryDB - containing rpms given at the command line (optional)
Database used for resolving. Rpms are added and removed to/from that db and the searches for resolving dependencies are performed on it. All modifications are kept in memory. It uses the RpmYum.repos and the RpmDB for searching and filters the results to the rpms that have not yet removed or have been added. That way neither linear search nor building additional hashes is needed.
RpmShadowDB -> RpmExternalSearchDB - keeps track of rpms installed from the repos -> RpmYum.repos - used for searches. See above for details -> RpmDiskShadowDB - keeps track of the rpms deleted in the RpmDB -> RpmDB - used for searches
For RPM there are nowadays several "formats" in which you can find information about rpm packages. The most typical one is of course the binary rpm header which is part of every binart rpm package. A typical binary rpm package looks like this:
+------+-----------+--------+-------------+ | Lead | Signature | Header | Gziped CPIO | +------+-----------+--------+-------------+ |
The lead has a fixed size of 96 bytes and contains some very basic information about the binary rpm. It can also generally be used to determine if a file is a binary rpm or not (using file e.g.) as it contains some very specific to easily identify them.
The signature and the header are stored as rpm header structures. Rpm header structures look like this:
+-------+---------+-----------+-----------+-----------+ | Magic | IndexNr | StoreSize | Indexdata | Storedata | +-------+---------+-----------+-----------+-----------+ |
The Magic is a hardcoded value, IndexNr the number of index entries and StoreSize the size in bytes of the store data.
Indexdata consists of IndexNr index entries each of which is 16 bytes. Each index entry looks like this:
+-----+------+--------+-------+ | Tag | Type | Offset | Count | +-----+------+--------+-------+ |
Tag specifies which tag this entry is about. Type specifies the type of the tage. Offset specifies at which offset in the Storedata the data begins for this tag. Count has various size meanings depending on the type.
Storedata finally contains the real tag information. As mentioned in the previous paragraph by using an index entry from the Indexdata you can find and parse all data relevant to a specifc tag. The format depends of course on the type of the tag.
More detailed information about the binary rpm format can be found here: http://www.rpm.org/max-rpm/s1-rpm-file-format-rpm-file-format.html
The rpm binary format can be partially found in the rpmdb as well. The file /var/lib/rpm/Packages contains the complete headers of the orignal binary rpms in a rpm header structure format without the 8 byte magic and with some additional installation revelvant indexes appended.
Another nowadays common format for reduced rpm header data is the repo metadata format used by yum. It is a split up and reduced version of the orignal rpm header information using XML. It is mainly useful to determine and resolve dependencies of rpm packages. More information about the metadata can be found here:
http://linux.duke.edu/projects/metadata/
Other less common storage formats include databases like SQLite or MySQL which e.g yum uses to convert the repodata format to a more usable form locally.
Apart from that rpm itself extracts quite a bit of the information from rpm binary headers and writes them in various db4 files in /var/lib/rpm.
This section describes the structure from the various files in /var/lib/rpm. All files are db4 files, either hash or btree based. With the exception of Packages all files have the corresponding rpmtag based value as key. The data consists of integer pairs which contain the package id and the index at which this entry can be found in the rpm header of that tag. The values are 4 byte integers in host byte order. For some tags the index doesn't make any sense. In those cases the index value will always be set to 0.
key: Basename (string)
values: list of 2-tuples: installid (4 byte int), basenameindex (4 byte int)
key: Conflictname (string)
values: list of 2-tuples: installid (4 byte int), conflictindex (4 byte int)
key: Dirname (string)
values: list of 2-tuples: installid (4 byte int), dirindex (4 byte int)
key: md5sum (4 * 4 byte int, no hex string!)
values: list of 2-tuples: installid (4 byte int), filemd5sindex (4 byte int)
Only stored if file md5sum exists and if the file is a regular file (usually equivalent)
key: Groupname (string)
values: list of 2-tuples: installid (4 byte int), index (4 byte int) (always 0)
key: Installtime of transaction (4 byte int, time() value)
values: list of 2-tuples: installid (4 byte int), index (4 byte int) (always 0)
key: Packagename (string)
values: list of 2-tuples: installid (4 byte int), index (4 byte int) (always 0)
key: Installid (4 byte int)
values: Complete binary rpm header with some additional information from signature without lead.
key: Providename (string)
values: list of 2-tuples: installid (4 byte int), providenameindex (4 byte int)
key: Provideversion (string)
values: list of 2-tuples: installid (4 byte int), provideversionindex (4 byte int)
key: unknown yet
values: unknown yet
key: Requirename (string)
values: list of 2-tuples: installid (4 byte int), requirenameindex (4 byte int)
Only contains the requirenames of not install prereqs
key: Requireversion (string)
values: list of 2-tuples: installid (4 byte int), requireversionindex (4 byte int)
key: Sha1header (string) (just as the value from the header)
values: list of 2-tuples: installid (4 byte int), index (4 byte int) (always 0)
key: md5sum from header (4 * 4 byte int)
values: list of 2-tuples: installid (4 byte int), index (4 byte int) (always 0)
key: Triggername (string)
values: list of 2-tuples: installid (4 byte int), triggerindex (4 byte int)
Only contains the first entry for each name from a package
Now an example of the connection between the package headers which are stored in Packages and the rest of the files.
The connection between /var/lib/rpm/Packages and the other files looks like this:
Package id | Requirename | Index |
---|---|---|
5 | a | 0 |
b | 1 | |
8 | c | 0 |
a | 1 | |
b | 2 |
Requirename | Package Id | Index |
---|---|---|
a | 5 | 0 |
8 | 1 | |
b | 5 | 1 |
8 | 2 | |
c | 8 | 0 |
That means the complete /var/lib/rpm files can be cross checked with /var/lib/rpm/Packages and can be regenerated from that file as well.
An exception is Installtid. This db file contains as keys the TID which is a unique time in seconds since 1970 that reflects a complete transaction. Every header in Packages contains that TID as "installtid" tag. The values of the Installtid db file are again pairs of integers with a package id as first value and the second value always 0. Here a small example:
Package id | Install Tid |
---|---|
5 | 1000000 |
8 | 1000000 |
6 | 1234567 |
9 | 1234567 |
7 | 2345678 |
Install Tid | Package ID | Index |
---|---|---|
1000000 | 5 | 0 |
8 | 0 | |
1234567 | 6 | 0 |
9 | 0 | |
2345678 | 7 | 0 |
As you can see it can happen that package ID's get reused, in our example 6. This can happen if a package gets deleted and the ID "dropped". So there is unfortunately no autoincrementing ID for the packages.
The following things should be noted about the repo metadata. yum is using the repodata only within the resolver part to determine a set of rpms that should be updated and/or installed. Then the complete rpm headers are downloaded and another dependency check from librpm is run in addition to determining the ordering of rpm packages.
Here a few limitations you should be aware of if you want to work with the repodata for more than the resolver or understand the limits of the resolver:
Repodata has evolved over time. Until now no version information has been added to the created data, this might make sense for future changes.
Even if no epoch is specified in the rpm header, the metadata will specify this as "0". That's the correct way for version and dependency checks.
Dependency information is often specified like bash >= 3.0 and consists of a (name, flag, version) triple. The flag part is specified as integer within the rpm header and is only partially copied over into the repodata. Installation ordering of rpm packages is not possible with the current available data (or only based on reduced data). Future repodata could make the data more complete or just copy the integer into the output to provide it as exact copy. (Repo data adds a "pre" flag if the RPMSENSE_PREREQ flag is set. That information is actually not complete to identify install prereq versus an erase prereq.)
The primary.xml.gz file contains a subset of the included files. Because of thise some operations cannot be equivalently with binary rpms or repodata headers.
The data eating up RAM in rpm headers are descriptions, changelogs and filelists.
The dependency data we operate with is extremely huge. In addition to the Provides: data which contains shared libs, rpm versions and explicitely listed ones in .spec files, dependency data can also use any filerequires like e.g. Requires: /usr/bin/foo to reference any file in any other rpm package. That means we potentially have to look at a filelist of all rpm packages. That data is extremely huge as the current Fedora Core development tree contains more than 350000 files.
As the dependency data is worked with on each client to update the machine, it must be a goal to reduce this data to a smaller subset.
The current repo metadata has a fixed file regex of (.*bin/.*|/etc/.*|/usr/lib/sendmail)$ and a directory regex of (.*bin/.*|/etc/.*)$. That regex specifies the data given in the repodata/primary.xml.gz file and you have to fallback to the complete filelists available in repodata/filelists.xml.gz if any dependency request is done outside of that data. (The regex gives a deterministic way to know when to load the full filelist.) The regex used to be pretty complete for Fedora Core in the past, but additional filerequires are present in newer Fedora Core and Fedora Extra rpm packages which require a reload of the complete lists.
In addition to the completeness problems above, it was also noted that the regex lists contain 100 times more data than actually being used in current repositories. Conary is thus maintaining explicit lists of possible file requires. Maybe new ways to add autogenerated, small filelists can be worked out that would work for most comon usage cases, also with the fallback to the complete lists like yum / createrepo implement right now.
It would also be possible to store dependency graphs that contain data for the resolver to select the right rpm packages plus the orderer to specify the right sequence to install them. But many machines do have further packages installed outside of that package set, so this would then mostly be used for new installs. Optimizing the general update path for running machines should be more important than improving the install path for new installs, so this is currently no goal, but would very well be possible todo.
Last updated 25-Apr-2007 17:57:12 CEST