Shared MIME-info Database
Shared MIME-info Database
Shared MIME-info Database
1. Introduction
1.1. Version
This is version 0.20 of the Shared MIME-info Database specication, last updated 8 October 2010.
2. Unied system
In discussions about the previous systems used by GNOME, KDE and ROX (see the "History and related systems" document), it was clear that the differences between the databases were simply a result of them being separate, and not due to any fundamental disagreements between developers. Everyone is keen to see them merged. This specication proposes:
A standard way for applications to install new MIME related information. A standard way of getting the MIME type for a le. A standard way of getting information about a MIME type. Standard locations for all the les, and methods of resolving conicts.
Further, the existing databases have been merged into a single package [SharedMIME].
Applications must be able to extend the database in any way when they are installed, to add both new rules for determining type, and new information about specic types. It must be possible to install applications in /usr, /usr/local and the users home directory (in the normal Unix way) and have the MIME information used.
This specication uses the XDG Base Directory Specication[BaseDir] to dene the prexes below which the database is stored. In the rest of this document, paths shown with the prex <MIME> indicate the les should be loaded from the mime subdirectory of every directory in XDG_DATA_HOME:XDG_DATA_DIRS. For example, when using the default paths, Load all the <MIME>/text/html.xml les means to load /usr/share/mime/text/html.xml, /usr/local/share/mime/text/html.xml, and ~/.local/share/mime/text/html.xml (if they exist, and in this order). Information found in a
Shared MIME-info Database directory is added to the information found in previous directories, except when glob-deleteall or
magic-deleteall is used to overwrite parts of a mimetype denition.
Each application that wishes to contribute to the MIME database will install a single XML le, named after the application, into one of the three <MIME>/packages/ directories (depending on where the user requested the application be installed). After installing, uninstalling or modifying this le, the application MUST run the update-mime-database command, which is provided by the freedesktop.org shared database[SharedMIME]. update-mime-database is passed the mime directory containing the packages subdirectory which was modied as its only argument. It scans all the XML les in the packages subdirectory, combines the information in them, and creates a number of output les. Where the information from these les is conicting, information from directories lower in the list takes precedence. Any le named Override.xml takes precedence over all other les in the same packages directory. This can be used by tools which let the user edit the database to ensure that the users changes take effect. The les created by update-mime-database are:
<MIME>/globs
(contains a mapping from names to MIME types) [deprecated for globs2] (contains a mapping from names to MIME types and glob weight)
<MIME>/globs2 <MIME>/magic
(contains a mapping from le contents to MIME types) (contains a mapping from MIME types to types they inherit from)
(contains a mapping from MIME types to icons) (contains a mapping from MIME types to generic icons) (contains a mapping from XML (namespaceURI, localName) pairs to
<MIME>/generic-icons <MIME>/XMLnamespaces
MIME types)
<MIME>/MEDIA/SUBTYPE.xml
(one le for each MIME type, giving details about the type, including comment, icon and generic-icon) (contains the same information as the globs2, magic, subclasses, aliases, icons, generic-icons and XMLnamespaces les, in a binary, mmappable format)
<MIME>/mime.cache
The format of these generated les and the source les in packages are explained in the following sections. This step serves several purposes. First, it allows applications to quickly get the data they need without parsing all the source XML les (the base package alone is over 700K). Second, it allows the database to be used for other purposes (such as creating the /etc/mime.types le if desired). Third, it allows validation to be performed on the input data, and removes the need for other applications to carefully check the input for errors themselves.
elements have a pattern attribute. Any le whose name matches this pattern will be given this MIME type (subject to conicting rules in other les, of course). There is also an optional weight attribute which is used when resolving conicts with other glob matches. The default weight value is 50, and the maximum is 100. KDEs glob system replaces GNOMEs and ROXs ext/regex elds, since it is trivial to detect a pattern in the form *.ext and store it in an extension hash table internally. The full power of regular expressions was not being used by either desktop, and glob patterns are more suitable for lename matching anyway.
A glob-deleteall element, which indicates that patterns from previously parsed directories must be discarded. The patterns dened in this le (if any) are used instead. attribute for all of the contained rules. Low numbers should be used for more generic types (such as gzip compressed data) and higher values for specic subtypes (such as a word processor format that happens to use gzip to compress the le). The default priority value is 50, and the maximum is 100. Each match element has a number of attributes: Attribute type Required? Yes Value
string, host16, host32, big16, big32, little16, little32 or byte.
magic elements contain a list of match elements, any of which may match, and an optional priority
offset
Yes
The byte offset(s) in the le to check. This may be a single number or a range in the form start:end, indicating that all offsets in the range should be checked. The range is inclusive. The value to compare the le contents with, in the format indicated by the type attribute.
value
Yes
Shared MIME-info Database Attribute mask Required? No Value The number to AND the value in the le with before comparing it to value. Masks for numerical types can be any number, while masks for strings must be in base 16, and start with 0x.
Each element corresponds to one line of le(1)s magic.mime le. They can be nested in the same way to provide the equivalent of continuation lines. That is, <a><b/><c/></a> means a and (b or c).
A magic-deleteall element, which indicates that magic matches from previously parsed directories must be discarded. The magic dened in this le (if any) is used instead. elements indicate that the type is also sometimes known by another name, given by the type attribute. For example, audio/midi has an alias of audio/x-midi. Note that there should not be a mime-type element dening each alias; a single element denes the canonical name for the type and lists all its aliases.
alias
sub-class-of
elements indicate that any data of this type is also some other type, given by the type attribute. See Section 2.11. elements give a human-readable textual description of the MIME type, usually composed of an acronym of the le name extension and a short description, like "ODS spreadsheet". There may be many of these elements with different xml:lang attributes to provide the text in multiple languages. elements give experienced users a terse idea of the document contents. for example "ODS", "GEDCOM", "JPEG" and "XML". There may be many of these elements with different xml:lang attributes to provide the text in multiple languages, although these should only be used if absolutely neccessary. elements are the expanded versions of the acronym elements, for example "OpenDocument Spreadsheet", "GEnealogical Data COMmunication", and "eXtensible Markup Language". The purpose of these elements is to provide users a way to look up information on various MIME types or le formats in third-party resources. There may be many of these elements with different xml:lang attributes to provide the text in multiple languages, although these should only be used if absolutely neccessary.
comment
acronym
expanded-acronym
icon
elements specify the icon to be used for this particular mime-type, given by the name attribute. Generally the icon used for a mimetype is created based on the mime-type by mapping "/" characters to "-", but users can override this by using the icon element to customize the icon for a particular mimetype. This element is not used in the system database, but only used in the user overridden database. Only one icon element is allowed.
generic-icon
elements specify the icon to use as a generic icon for this particular mime-type, given by the name attribute. This is used if there is no specic icon (see icon for how these are found). These are used for categories of similar types (like spreadsheets or archives) that can use a common icon. The Icon Naming Specication lists a set of such icon names. If this element is not specied then the mimetype is used to generate the generic icon by using the top-level media type (e.g. "video" in
Shared MIME-info Database "video/ogg") and appending "-x-generic" (i.e. "video-x-generic" in the previous example). Only one generic-icon element is allowed.
root-XML
elements have namespaceURI and localName attributes. If a le is identied as being an XML le, these rules allow a more specic MIME type to be chosen based on the namespace and localname of the document element. If localName is present but empty then the document element may have any name, but the namespace must still match.
treemagic
elements contain a list of treematch elements, any of which may match, and an optional priority attribute for all of the contained rules. The default priority value is 50, and the maximum is 100. Each treematch element has a number of attributes: Attribute path Required? Yes Value A path that must be present on the mounted volume/lesystem. The path is interpreted as a relative path starting at the root of the tested volume/lesystem The type of path. Possible values: file, directory,
link
type
No
match-case
No
Whether path should be matched case-sensitively. Possible values: true, false Whether the le must be executable. Possible values: true, false Whether the directory must be non-empty. Possible values: true, false The mimetype for the le at path
executable
No
non-empty
No
mimetype
No
treematch elements can be nested, meaning that both the outer and the inner treematch must be satised for a "match".
Applications may also dene their own elements, provided they are namespaced to prevent collisions. Unknown elements are copied directly to the output XML les like comment elements. A typical use for this would be to indicate the default handler application for a particular desktop ("Galeon is the GNOME default text/html browser"). Note that this doesnt indicate the users preferred application, only the (xed) default.
In practice, common types such as text/x-diff are provided by the freedesktop.org shared database. Also, only new information needs to be provided, since this information will be merged with other information about the same type.
<?xml version="1.0" encoding="utf-8"?> <mime-type xmlns="http://www.freedesktop.org/standards/shared-mime-info" type="text/x-diff"> <!--Created automatically by update-mime-database. DO NOT EDIT!--> <comment>Differences between files</comment> <comment xml:lang="af">verskille tussen la ers</comment> ... </mime-type>
The glob le is a simple list of lines containing a MIME type and pattern, separated by a colon. It is deprecated in favour of the globs2 le which also lists the weight of the glob rule. The lines are ordered by glob weight. For example:
# This file was automatically generated by the # update-mime-database command. DO NOT EDIT! ... text/x-diff:*.patch text/x-diff:*.diff ...
Applications MUST match globs case-insensitively, except when the case-sensitive attribute is set to true. This is so that e.g. main.C will be seen as a C++ le, but IMAGE.GIF will still use the *.gif pattern. If several patterns of the same weight match then the longest pattern SHOULD be used. In particular, les with multiple extensions (such as Data.tar.gz) MUST match the longest sequence of extensions (eg *.tar.gz in preference to *.gz). Literal patterns (eg, Makele) must be matched before all others. It is suggested that patterns beginning with *. and containing no other special characters (*?[) should be placed in a hash table for efcient lookup, since this covers the majority of the patterns. Thus, patterns of this form should be matched before other wildcarded patterns. If a matching pattern is provided by two or more MIME types, applications SHOULD not rely on one of them. They are instead supposed to use magic data (see below) to detect the actual MIME type. This is for instance required to deal with container formats like Ogg or AVI, that map various video and/or audio-encoded data to one extension. There may be several rules mapping to the same type. They should all be merged. If the same pattern is dened twice, then they MUST be ordered by the directory the rule came from, as described above.
Shared MIME-info Database The glob-deleteall element, which means that implementations SHOULD discard information from previous directories, is written out into the globs2 le using __NOGLOBS__ as the pattern. For instance:
0:text/x-diff:__NOGLOBS__ 50:text/x-diff:*.diff ...
In the above example, the mimetype text/x-diff is redened (for instance in a users ~/.local/share/mime) to only be associated with the pattern *.diff, so the other patterns like *.patch were removed. The weight in front of the __NOGLOBS__ line is ignored. In a given globs2 le, the __NOGLOBS__ line for a given mimetype is always written out before any other globs for this mimetype. Lines beginning with # are comments and should be ignored. Everything from the : character to the newline is part of the pattern; spaces should not be stripped. The le is in the UTF-8 encoding. The format of the glob pattern is as for fnmatch(3). The format does not allow a pattern to contain a literal newline character, but this is not expected to be a problem. Common types (such as MS Word Documents) will be provided in the X Desktop Groups package, which MUST be required by all applications using this specication. Since each application will then only be providing information about its own types, conicts should be rare. The fourth eld ("cs" in the rst globs2 example) contains a list of comma-separated ags. The ags currently dened are: cs (for case-sensitive). Implementations should ignore unknown ags. Implementations should also ignore further elds, so that the syntax of the globs2 le can be extended in the future. Example: "50:text/x-c++src:*.C:cs,newag:newfeature:somethingelse" should currently be parsed as "50:text/x-c++src:*.C:cs".
Part indent
Example 1
Meaning The nesting depth of the rule, corresponding to the number of > characters in the traditional le format. The offset into the le to look for a match. Two bytes giving the (big-endian) length of the value, followed by the value itself. The mask, which (if present) is exactly the same length as the value. On little-endian machines, the size of each group to byte-swap. The length of the region in the le to check.
>4 =\0x0\0x2\0x55\0x40
"&" mask
&\0xff\0xf0
~2 +8
Note that the value, value length and mask are all binary, whereas everything else is textual. Each of the elements begins with a single character to identify it, except for the indent level. The word size is used for byte-swapping. Little-endian systems should reverse the order of groups of bytes in the value and mask if this is greater than one. This only affects host matches (big32 entries still have a word size of 1, for example, because no swapping is necessary, whereas host32 has a word size of 4). The indent, range-length, word-size and mask components are optional. If missing, indent defaults to 0, range-length to 1, the word-size to 1, and the mask to all one bits. Indent corresponds to the nesting depth of the rule. Top-level rules have an indent of zero. The parent of an entry is the preceding entry with an indent one less than the entry. If an unknown character is found where a newline is expected then the whole line should be ignored (there will be no binary data after the new character, so the next line starts after the next "\n" character). This is for future extensions.
10
Shared MIME-info Database The text/x-diff above example would (on its own) create this magic le:
00000000 00000010 00000020 00000030 00000040 4d 74 00 09 62 49 65 05 0a 64 4d 78 64 3e 69 45 74 69 30 72 2d 2f 66 3d 65 4d 78 66 00 63 61 2d 09 17 74 67 64 0a 43 6f 69 69 3e 6f 72 63 66 30 6d 69 00 66 3d 6d 65 0a 5d 00 6f 73 5b 0a 04 6e 3a 35 3e 2a 20 20 30 30 2a 73 0a 3a 3d 2a 75 |MIME-Magic..[50:| |text/x-diff].>0=| |..diff..>0=..***| |..>0=..Common su| |bdirectories: .|
The magic-deleteall attribute, which means that implementations SHOULD discard information from previous directories, is written out into the magic le using __NOMAGIC__ as the value:
>0=__NOMAGIC__\n
For example:
http://www.w3.org/1999/xhtml html application/xhtml+xml
The lines are sorted (using strcmp in the C locale) and there are no lines with the same namespaceURI and localName in one le. If the localName was empty then there will be two spaces following the namespaceURI.
For example:
application/msword:x-office-document
11
Shared MIME-info Database The le starts with the magic string "MIME-TreeMagic\0\n". There is no version number in the le. Incompatible changes will be handled by creating both the current treemagic and a newer treemagic2 in the new format. Where possible, changes will be made in a compatible fashion. The rest of the le is made up of a sequence of small sections. Each section is introduced by giving the priority and type in brackeds, followed by a newline character. Higher priority entries come rst. Example:
[50:x-content/image-dcf]\n
Meaning The nesting depth of the rule. The path to match. The required le type, one of "le", "directory", "link" or "any" Optional for the optional attributes of treematch elements. Possible values are "executable", "match-case", "non-empty", or a MIME type
MAJOR_VERSION 1 MINOR_VERSION 2 ALIAS_LIST_OFFSET PARENT_LIST_OFFSET LITERAL_LIST_OFFSET REVERSE_SUFFIX_TREE_OFFSET GLOB_LIST_OFFSET MAGIC_LIST_OFFSET NAMESPACE_LIST_OFFSET ICONS_LIST_OFFSET GENERIC_ICONS_LIST_OFFSET
12
AliasListEntry: 4 CARD32 ALIAS_OFFSET 4 CARD32 MIME_TYPE_OFFSET ParentList: 4 CARD32 N_ENTRIES 8*N_ENTRIES ParentListEntry ParentListEntry: 4 CARD32 MIME_TYPE_OFFSET 4 CARD32 PARENTS_OFFSET Parents: 4 CARD32 N_PARENTS 4*N_PARENTS CARD32 MIME_TYPE_OFFSET LiteralList: 4 CARD32 N_LITERALS 12*N_LITERALS LiteralEntry LiteralEntry: 4 CARD32 LITERAL_OFFSET 4 CARD32 MIME_TYPE_OFFSET 4 CARD32 WEIGHT in lower 8 bits FLAGS in rest: 0x100 = case-sensitive
N_GLOBS GlobEntry
ReverseSuffixTree: 4 CARD32 N_ROOTS 4 CARD32 FIRST_ROOT_OFFSET ReverseSuffixTreeNode: 4 CARD32 CHARACTER 4 CARD32 N_CHILDREN 4 CARD32 FIRST_CHILD_OFFSET ReverseSuffixTreeLeafNode: 4 CARD32 0 4 CARD32 MIME_TYPE_OFFSET 4 CARD32 WEIGHT in lower 8 bits
13
NamespaceList: 4 CARD32 N_NAMESPACES 12*N_NAMESPACES NamespaceEntry NamespaceEntry: 4 CARD32 NAMESPACE_URI_OFFSET 4 CARD32 LOCAL_NAME_OFFSET 4 CARD32 MIME_TYPE_OFFSET GenericIconsList: IconsList: 4 CARD32 N_ICONS 8*N_ICONS IconListEntry IconListEntry: 4 CARD32 MIME_TYPE_OFFSET 4 CARD32 ICON_NAME_OFFSET
Lists in the le are sorted, to enable binary searching. The list of aliases is sorted by alias, the list of literal globs is sorted by the literal. The SufxTreeNode siblings are sorted by character. The list of namespaces is sorted by namespace uri. The list of icons is sorted by mimetype. Mimetypes are stored in the sufx tree by appending sufx tree leaf nodes with \0 as character. These nodes appear at the beginning of the list of children. All offsets are in bytes from the beginning of the le.
14
Shared MIME-info Database Strings are zero-terminated. All numbers are in network (big-endian) order. This is necessary because the data will be stored in arch-independent directories like /usr/share/mime or even in users home directories. Cache les have to be written atomically - write to a temporary name, then move over the old le - so that clients that have the old cache le open and mmaped wont get corrupt data.
2.11. Subclassing
A type is a subclass of another type if any instance of the rst type is also an instance of the second. For example, all image/svg les are also text/xml, text/plain and application/octet-stream les. Subclassing is about the format, rather than the category of the data (for example, there is no generic spreadsheet class that all spreadsheets inherit from). Some subclass rules are implicit:
All text/* types are subclasses of text/plain. All streamable types (ie, everything except the inode/* types) are subclasses of application/octet-stream.
In addition to these rules, explicit subclass information may be given using the sub-class-of element. Note that some le formats are also compressed les (application/x-jar les are also application/zip les). However, this is different to a case such as a compressed postscript le, which is not a valid postscript le itself (so application/x-gzpostscript does not inherit from application/postscript, because an application that can handle the latter may not cope with the former). Some types may or may not be instances of other types. For example, a spreadsheet le may be compressed or not. It is a valid spreadsheet le either way, but only inherits from application/x-gzip in one case. This information cannot be represented statically; instead an application interested in this information should run all of the magic rules, and use the list of types returned as the subclasses.
15
If a MIME type is provided explicitly (eg, by a ContentType HTTP header, a MIME email attachment, an extended attribute or some other means) then that should be used instead of guessing. Otherwise, start by doing a glob match of the lename. Keep only globs with the biggest weight. If the patterns are different, keep only globs with the longest pattern, as previously discussed. If after this, there is one or more matching glob, and all the matching globs result in the same mimetype, use that mimetype as the result. If the glob matching fails or results in multiple conicting mimetypes, read the contents of the le and do magic snifng on it. If no magic rule matches the data (or if the content is not available), use the default type of application/octet-stream for binary data, or text/plain for textual data. If there was no glob match, use the magic match as the result. Note: Checking the rst 32 bytes of the le for ASCII control characters is a good way to guess whether a le is binary or text, but note that les with high-bit-set characters should still be treated as text since these can appear in UTF-8 text, unlike control characters.
If any of the mimetypes resulting from a glob match is equal to or a subclass of the result from the magic snifng, use this as the result. This allows us for example to distinguish text les called "foo.doc" from MS-Word les with the same name, as the magic match for the MS-Word le would be application/x-ole-storage which the MS-Word type inherits. Otherwise use the result of the glob match that has the highest weight.
There are several reasons for checking the glob patterns before the magic. First of all doing magic snifng is very expensive as reading the contents of the les causes a lot of seeks, which is very expensive. Secondly, some applications dont check the magic at all (sometimes the content is not available or too slow to read), and this makes it more likely that both will get the same type. Also, users can easily understand why calling their text le README.mp3 makes the system think its an MP3, whereas they have trouble understanding why their computer thinks README.txt is a PostScript le. If the system guesses wrongly, the user can often rename the le to x the problem.
16
Shared MIME-info Database textual description of one of these objects. The media type inode is provided for this purpose, with the following types corresponding to the standard types of object found in a Unix lesystem: inode/blockdevice inode/chardevice inode/directory inode/fo inode/mount-point inode/socket inode/symlink An inode/mount-point is a subclass of inode/directory. It can be useful when adding extra actions for these directories, such as mount or eject. Mounted directories can be detected by comparing the st_dev of a directory with that of its parent. If they differ, they are from different devices and the directory is a mount point.
17
Shared MIME-info Database having the same type. This is to help interoperability. The type determined in this way is only a guess, and an application MUST NOT trust a le based simply on its MIME type. For example, a downloader should not pass a le directly to a launcher application without conrmation simply because the type looks harmless (eg, text/plain). Do not rely on two applications getting the same type for the same le, even if they both use this system. The spec allows some leeway in implementation, and in any case the programs may be following different versions of the spec.
3. Contributors
Thomas Leonard <tal197 at users.sf.net> David Faure <faure at kde.org> Alex Larsson <alexl at redhat.com> Seth Nickell <snickell at stanford.edu> Keith Packard <keithp at keithp.com> Filip Van Raemdonck <mechanix at debian.org> Christos Zoulas <christos at zoulas.com> Matthias Clasen <mclasen at redhat.com> Bastien Nocera <hadess at hadess.net>
References
GNOMEThe GNOME desktop, http://www.gnome.org KDEThe KDE desktop, http://www.kde.org ROXThe ROX desktop, http://rox.sourceforge.net DesktopEntriesDesktop Entry Specication, http://www.freedesktop.org/standards/desktop-entry-spec.html SharedMIMEShared MIME-info Database, http://www.freedesktop.org/standards/shared-mime-info.html RFC-2119 Key words for use in RFCs to Indicate Requirement Levels, http://www.ietf.org/rfc/rfc2119.txt?number=2119 BaseDir XDG Base Directory Specication http://www.freedesktop.org/standards/basedir/draft/basedir-spec/basedir-spec.html ACAP ACAP Media Type Dataset Class ftp://ftp.ietf.org/internet-drafts/draft-ietf-acap-mediatype-01.txt
18