Open Index Concordance : RFC

More >

Open Index Concordance : RFC

Updated 23 July 2023

Status: WORKING DRAFT

This is the first rough cut of a bright idea I had, while wrestling with the index for a long book I have just written. I have no idea how to approach the LibreOffice development community to find out if it might have any mileage, but I am happy to be contacted anybody who thinks it might. The RFC bit is just a bit of hubris and my idea of a joke. Hey-ho.

Indexing

When you are indexing a book in say LibreOffice Writer, you can use the Find & Replace > Find All feature to highlight every instance of a given word or phrase, then open the Insert Index Entry dialog to add inex tags to each instance and list them in the alphabetical index. You can then go back and edit individual tags to change how they appear in the index. But this is not always practicable or, if there is much post-editing to do, desirable.

Concordance files

A prepared list of words for indexing, called a concordance file, can make life a lot easier. Concordance files are used for all sorts of purposes; indexing is only one of them, and every wordprocessor has its own format for them. An indexing concordance file for LibreOffice Writer is a simple CSV data file, which can be edited or even hand-built in anything from a text editor to an Oracle database. You just enter each text word or phrase you want to index and how you want to present it. When done, you will need to change its dot extension to .sdi. Offer the sdi file to Writer, and it will whoosh through your draft adding index tags and building the index itself. You can even have standard concordance files for different topics, and apply as many of them as appropriate to each book, with each file adding to the Index; just make sure the same word does not appear in more than one file.

It all sounds wonderful, and it can help enormously. But the LibreOffice sdi format is limited and there are problems. Sometimes you may want to rchitect

present a phrase twice, with different word orders. For example the same tag may be required to generate multiple entries for say "concordance file" and "file, concordance. See concordance file." Or, you may or may not want to preserve italic formatting, such as contrasting the journal "Nature" with Nature as in "Nature red in tooth and claw".

And of course if you move to another wordprocessor, then the file format will be all wrong and some facilities may not even be transferable.

LibreOffice Writer sets a kind of default open standard for index concordance files, but it is pretty limited and there is no roadmap to a fully-featured standard format.

What is missing is an open specification which wordprocessor creators can aspire to meet. One such specification is what this note sets out to explore.

Open Index Concordance

The staring point is the .sdi open format used by LibreOffice. Backwards compatibility is maintained in the internal format, but not necessarily with the file type. It makes sense therefore to define a new file type, rather than risk breaking the existing one for some users.

File format

The format is called Open Index Concordance (OIC)

OIC files should be given the dot extension .oic

The internal format is plain-text CSV, using the semicolon ( ; ) separator.

The character encoding is UTF-8. No other coding is acceptable. Developers offering other encodings should ensure that, when loading or saving an OIC file, their software not only understands the appropriate OIC formatting and functionality, but also converts seamlessly between UTF-8 and any alternative offered. The reason for this is that other encodings cannot be guaranteed to be available, however any system capable of working with OIC is highly unlikely not to understand UTF-8 when asked to do so.

Each line must contain one of:

White space, comprising any number (including none) of spaces and/or tabs.
A comment, starting with doubled gate characters, as ##.
A single index entry in .sdi format
A continuation line to an index entry, starting with a single gate character, as #.

The use of gate characters in this way ensures that, when treated as an sdi file (e.g. by renaming the dot suffix):

The # prefix to the second line ensures that the incompatible entries will be treated as a comment and ignored.

The ## ensures that a comment will still be treated as such. It is also the fastes and most convenioent way of distinguishing a comment from a second entry line.

Entry format

Each entry may be specified in either one or two conscutive lines. The entry data comprises a semicolon-separated list of fields.

Every entry must have a first line in .sdi format. Not all fields need be populated, but all must be delimited.

Optionally, an entry may have a second line comprising the additional OIC fields. This line must be prefixed by a # character, before the continuation of the entry fields.

The first six fields, comprising the first line, are compatible with the LibreOffice .sdi format. Their names, as defined here, have been changed because I can. This does not matter in practice, because the names do not appear in the file itself. A line looks like this:

Content term;Index entry;1st key;2nd key;Case sensitivity;Word only

Content term

The document text that you want to mark in the document. (Labeled as "Search term" in LibreOffice Writer)

Index entry

The text entry to appear in the index. (Labeled as "Alternative entry" in LibreOffice Writer)

Where omitted, the search term is used.

1st key

The 1st parent index entry. The index entry is presented as a sub-entry under the 1st key.

2nd key

The 2nd parent index entry. The 1st key is presented as a sub-entry under the 2nd Key, and the actual index entry as a sub-sub-entry under the 1st key.

Case sensitivity

When set, uppercase and lowercase letters are considered distinct. (Labeled as "Match case" in LibreOffice Writer)

Values:

Empty or 0 : case-insensitive
Anything else : case sensitive

Word only

When set, indexes only instances where the term occurs as a single, unattached word.

Values:

Empty or 0 : finds all matching strings
0 : finds all matching strings
Anything else : finds only whole, unattached words

The remaining fields, comprising the second line, are additional to the .sdi fields. This line is optional. Where it is present, all of these fields must be delimited. A line looks like this:

#italic;See entry;Free text

Italic

Specifies any text in the index entry, which is to be set in italic font.

See entry

Specifies the destination index entry to be given in a "See..." note. Replaces any page numbers.

There will normally be a second entry for the same content term, specifying the destination entry itself.

Free text

Free text string. Replaces any page numbers.

Examples

The appearance shown is of course dependent on the document styling; the examples shown are therefore nominal.

Entry:

Times correspondent;Times, The;newspaper;;1;
#Times, The

Appearance:

newspaper:
Times, The

Entry:

Times correspondent;Times, The (newspaper);;;1;
#Times, The;See newspaper

Appearance:

Times, The (newspaper): See newspaper