What are all the files in the Findings library?

cparnot · May 11, 2018, 1:07pm

Documents

Here is how to get to your Findings documents on disk:

In Findings, choose the menu Findings > Preferences
In the preferences, choose the ‘Library’ tab
Press the button ‘Reveal in Finder’ below the text field labeled ‘Documents:’
This will select a folder called ‘Default’ in the Finder

Note that if you use Dropbox sync, this folder will be stored on Dropbox.

The ‘Default’ directory contains your actual documents, which are organized in sudirectories ‘Experiments’ and ‘Protocols’. Each document is a file package that contains your data, using the open-source key-value store implementation PARStore (more on this below). The attachments are also stored inside the file package corresponding to the experiment or protocol, inside a subdirectory ‘attachments’. Each of the experiments, protocols and attachments are uniquely identified using a universal identifier (aka UUID), which is a random, but unique, string of digits and letters representing 16 bytes in hexadecimal, for instance: 387700B5-8B21-463D-B898-5068FE4A5327

Derived Data

Here is how to get to the derived data for your Findings installation on your Mac, which contains various files that are used to display your documents, but are not part of your actual documents:

In Findings, choose the menu Findings > Preferences
In the preferences, choose the ‘Library’ tab
Press the arrow button in the ‘Derived Data’ box
This will select a folder called ‘Findings’ in the Finder

The ‘Persistent’ directory contains some non-critical, basic metadata about your library.

The ‘Derived’ directory contains non-critical derived information about your library, that is used to display your experiments and protocols, and search them.

The ‘Local’ directory contains your actual documents, if stored only on your machine. If you use sync, your documents will be stored in Dropbox (see above)

More about PARStore

Each of your document is a file package that contains your data, using the open-source PARStore. In short, PARStore is a key-value store, where the keys and values are stored in a bunch of SQLite databases. The source code can be found here: GitHub - cparnot/PARStore: Open source Dropbox-resistant key-value store for local storage, written in Objective-C for Mac OS X and iOS.. Now for the longer version…

To allow multiple apps, users or devices to write at the same time to the same PARStore, the implementation uses a simple concept which I call ‘file ownership’. This means that each individual device/app is allowed to read/write the files that it created, but it can only read (and not modify) other files. This is done by contract, not by using r/w filesystem settings. Because files can only ever be modified by one device, it really makes it very easy to avoid corrupting or overwriting the content over file syncing system like Dropbox. More on this topic here: https://vimeo.com/90417635 and NSConf: Rethinking Syncing - The Guinea Pig in the Cocoa Mine. The first half of the video explains the concept, and the second half gives example, one of which is basically PARStore.

To allow the implementation of file ownership in the case of PARStore, the first time an app launches on a device, it creates a unique identifier (UUID) and stores it inside the app settings (preference file or better, ~/Library/Application Support’). This UUID is what I will call the device identifier. For instance, Findings will use a different device identifier on 2 different Macs. But note that 2 different apps on the same device (for example Findings and a hypothetical app ‘Results’) would both use a different device identifier. So maybe you can think of ‘device’ identifier as a ‘participant’ identifier. Anyway. The device identifier is then used in all the PARStore files that a particular instance of the app will manipulate on a particular device. In the video linked above, I talk of device ‘A’ and device ‘B’, and those names could be used as device identifiers, but there is the risk they would not be unique enough So instead I use UUID, such as ‘0C6EEE94-7291-4097-A894-910D90AB6190’. These can be generated using the uuidgen tool on the command-line (man uuidgen!).

With the concept of device identifier in place, here is how PARStore files are organized and where these identifiers are used. Each experiment or protocol is a different PARStore. A PARStore is not a single file, it’s a file package directory that contains 2 subdirectories:

directory ‘devices’ contains SQLite databases, each inside its own subdirectory which are named with the ‘device identifier’; an individual device can read all the databases, but it can only modify its own; this is how file ownership is enforced; the same app on a given app will always use the same device identifier across multiple launches
directory ‘blobs’ that can contain arbitray files and directories and is used for larger assets that don’t really fit in a database; in the case of Findings, there is a subdirectory ‘attachments’, which is used to store attachments for the experiment/protocol; each attachment is stored under its original name inside yet another subdirectory named with a UUID corresponding to its location inside the document (see more below: the UUID is the paragraph identifier); the ‘blobs’ directory does not enforce file ownership and it’s up to each device to do the right thing

The SQLite databases are then very simple, with just one table, and 4 columns:

key: a string
value: a data blob corresponding to a property list stored in binary format (for Python, see here: Reading Binary Plist files with Python - Stack Overflow)
timestamp: 64 bit integer corresponding to the number of milliseconds since reference date January 1, 2001, 12:00 GMT
parentTimestamp: timestamp of the previous value for that key, from which it was modified to the new value (this can be null); this column can be used to improve merging and provide some kind of history tree for a given key/value pair

To get the final key-value pairs, you simply read all the databases and collect the most recent key/value pairs for each key, based on timestamp. With PARStore, you can use any key you’d like. PARStore could be used for other apps beyond Findings.

In the case of Findings, each experiment or protocol (more generally a ‘document’) is a separate PARStore. The key/value pairs used by Findings are:

uuid: the uuid of the document, string
author: author name, string
title: document title, string
rating: float
done: boolean
aim: string
summary: string
projectTitle: string
projectUUID: string
categories: list of protocol subfields, array of strings
layout: list of paragraph UUIDs, array of strings
doneParagraphs: list of paragraph UUIDs, array of strings
: each paragraph has a different identifier, had the content is stored as a dictionary/hash, in particular with the ‘content’ key (this UUID is also used for the attachment identification within the blobs directory); note that there might be paragraphs that are not in the experiment any more and have been deleted, in which case the UUID won’t be in the layout array

The list is not meant to be exhaustive, and more keys may be added in the future as well (which would be backward-compatible).

777 · May 17, 2018, 4:06pm

Could you explain a bit about PARStore’s deletion protocol?

I’m most interested in the intersection of attachment handling and deletion. It looks like PARStore uses some git concepts, which I like. Are identical attachments stored only once (like in a git db)?

Consider this scenario: I have a folder with several huge results files that I insert into an experiment. But, oops!, one of the files is redundant, so I (perhaps foolishly?) remove the entire folder attachment from the experiment, fix the source folder, and then re-insert it.
But, oops again!, I missed another file that should be removed, and so repeat that process of deleting the attachment folder from the experiment, and then reinserting another folder with identical content except one file is omitted.

Is the final attached folder’s contents, which have now been inserted three times, stored three times in Findings? Or only once?

In the case of the files that were removed and not reinserted, how long will Findings keep them around in the background? To borrow git terminology, how/when does Findings prune unreachable content?

cparnot · May 22, 2018, 8:31pm

Good question! The handling of attachment is indeed also non-destructive: attachment files are never deleted. There is no automatic pruning at the moment. The rational is that I simply want to avoid having any code that delete a file in the code (there is just one place in the code that does it, which is when you delete an entry or protocol). In Findings 2, it’s fairly easy to prune large attachments manually, by using the menu Entry > Show in Finder, which lists all the attachments. I plan to make it even easier in the future.

Attachments are not stored based on a git-like hash. The reason is simple: the file can be further edited, in place, and it would be very tricky to then keep track of that new hash in a reliable way. If you keep inserting then deleting an attachment, you will get as many copies as what you dropped in.

I hope that answers your question!

Charles

777 · May 23, 2018, 8:12pm

Thanks for the answer!

OK, I understand. I will have to be more careful it seems!

Entry > Show in Finder is cool, but is there a way to tell which ones correspond to attachments currently showing in the Entry, as opposed to historic ones that are not otherwise viewable?
If I insert and delete and replace a folder called job_logs three times, not sure how I delete the two now-unused ones.

I definitely appreciate no auto-pruning, though. I wouldn’t know where to begin suggesting rules that would not infuriate at least someone!
But, perhaps an option to prune all unreachable attachments? Maybe also allow a specifiable threshold: Prune all unreachable attachments greater than ## size.
I think I read that history review is a long-term feature goal for Findings? Could replace any such pruned files with a dummy placeholder with the same filename but no contents, just to keep the historical view consistent.

Doesn’t seem high priority to me right now, really. But I like the idea of keeping most of my file-based findings in Findings, and eventually, without a way to systematically see or purge stuff I’m not actually using, I’ll have an issue.

cparnot · May 31, 2018, 9:01pm

Ah yes, it can be a bit tricky. A workaround for now, is to show which one is in use by the document by hovering on the attachment preview, and choose the ‘Show in Finder’ button (glass magnifier icon). Then you’ll know which one to keep, and thus which to delete.

I agree we need a special menu item to prune unused attachments… It’s on my list!