TSV format of Shelfcheck

Home » Shelfcheck » Here

Intended audience: database administrators, developers

Shelfcheck is a On-Shelf-Analytics (OSA) webapp for retail networks that delivers Out-Of-Shelf (OOS) alerts.It can import data from a remote file repository of TSV files through FTP or through cloud storage.

See also our BigFiles, our file hosting service that supports FTP. Push your files directly to Lokad.

This document details how to setup the data in the expected format that will allow Shelfcheck to auto-import the data directly.

Big picture

The data to be imported by Shelfcheck falls into 3 categories:

Stores which provide high level informations about each store (for example the store name).
Items which provide generic information about every single item sold accross the network.
Receipts which provide the sales history, with flexibility offered concerning the data granularity.

Shelfcheck relies on statistical analysis; it therefore works better with more data. In particular, we suggest including as much sales history as possible. For example, if you have 3 years of daily sales data available please include the 3 years. Pushing a longer sales history to Shelfcheck is the easiest way to improve the precision of the OOS alerts.

   Big picture
   Convention over configuration
   Flat text files and files hierarchy
   FTP repository or Windows Azure Blob Storage
   Tab Separated Value format (TSV)
      GZip compression
   Files hierarchy
      Format of stores.tsv
      Format of items.tsv
      Format of receipts-yyyy-MM-dd.tsv
      Format of backorders-yyyy-MM-dd.tsv
      Format of deliveries-yyyy-MM-dd.tsv
      Format of status-yyyy-MM-dd.tsv
      Format of uploading.tsv
      Format of shelfcheck.config
   Annex: Terminology

Convention over configuration

Instead of following a rigid data format that might not be appropriate for your business, Shelfcheck relies on a flexible approach based on file naming conventions and column naming conventions. This approach is similar to the Salescast SQL Intermediate Schema, but tailored for the specific context of retail networks.

Shelfcheck can potentially import a lot of data, down to each receipt with its full details, or fall back to pre-aggregated daily data. Shelfcheck lets you choose which data you want to include in the data repository. Nearly all fields are optional. If a field cannot be found, Shelfcheck will continue to process your data and skip the missing field.

Caution: Shelfcheck is case-sensitive, make sure to respect the specified capitalization for fields and file names.

Flat text files and files hierarchy

Dealing with a lot of data is one of the challenges faced by software apps when working with store-level data for any larger retail network. At Lokad, our experience indicates that while SQL databases are great for warehouse data, at the store level, SQL itself becomes a problem as it becomes increasingly hard to tune a SQL database containing millions of rows to deliver the desired performance.

Instead, Shelfcheck relies on a repository of flat tab-separated files. Futhermore, Shelfcheck expects the data to be partitioned according to a set of precise guidelines (detailed below). The files are either hosted on a classic filesystem (reachable through FTP), or hosted in a cloud storage.

FTP repository or Windows Azure Blob Storage

At this time, Shelfcheck supports two repository types (where the data files can be stored):

FTP repository (hosted by Lokad)
Blob Storage on Windows Azure

An FTP repository can be made available on client request by Lokad. The client is expected to push the data toward her FTP account. The main drawback of FTP is the lack of scalability: FTP is a statefull protocol that is basically attached to a single server. Hence, FTP does not easily scale over multiple machines to distribute the workload.

Once the data to be uploaded toward Shelfcheck exceeds 10GB we suggest switching to cloud storage. Shelfcheck can pull the data from a Blob Container hosted on Windows Azure. A storage key for the Windows Azure Storage can be made available by Lokad upon client request. Unlike FTP, the Blob Storage is stateless and supports a very large number of concurrent uploads.

Tab Separated Value format (TSV)

TSV is a popular format, and Shelfcheck follows the most usual patterns:

The file name extension is .tsv.
The encoding of strings is UTF-8.
The delimiter is the TAB character (Ascii code: 9).
Fields can contain any string, excluding TABs.
Fields can be left blank (zero characters), but TABs cannot be omitted.
Each line contains the same number of fields.
The first line contains the names of the fields (headers).

File names and (relative) folder or header names must start with a letter and contain only letters, numbers and hyphen signs (-). These names must be between 3 and 63 characters long.

Headers must start with a letter, and contain only letters and numbers. Headers must be from 3 to 63 characters long.

When a field contains numbers, the number must be formatted as 123456.789 with the dot . used as decimal separator and without separators for thousand blocks.

When a field contains dates, dates should be formatted as ISO 8601, that is 2012-05-16T12:38Z (combined date and time in UTC). Dates with week numbers are not supported.

GZip compression

If the file name extension is .tsv.gz, then Shelfcheck expects the file to be compressed with GZip. This option is offered to facilitate data transfer, as TSV file can typically be compressed from 80% to 90% using GZip.

Files hierarchy

The block below outlines a file hiearchy as it should be exposed to Shelfcheck.

    /items.tsv
    /shelfcheck.config
    /stores.tsv
    /store-123/ ...
              /backorders-2012-05-17.tsv
              /backorders-2012-05-18.tsv
              /backorders-2012-05-19.tsv
              / ...
              /deliveries-2012-05-17.tsv
              /deliveries-2012-05-18.tsv
              /deliveries-2012-05-19.tsv
              / ...
              /items.tsv
              / ...
              /receipts-2012-05-17.tsv
              /receipts-2012-05-18.tsv
              /receipts-2012-05-19.tsv
              / ...
              /status-2012-05-17.tsv
              /status-2012-05-18.tsv
              /status-2012-05-19.tsv
              / ...
              /uploading.tsv
    /store-124/ ...
    /store-125/ ...

The root folder / contains two files /stores.tsv and /items.tsv that contain respectively the list of stores and the list of items. Shelfcheck.config is a special file to configure Shelfcheck itself.

As explained here above, you can compress any TSV files with GZIP. When doing so, you need to replace the file extension .tsv by .tsv.gz. For the sake of concision, we will not repeat this option in the following sections.

Format of stores.tsv

The purpose of stores.tsv is to list the stores contained in the retail network, along with their meta-data.

`stores.tsv`
`StoreId`	string	Required. The `StoreId` represents the identifier used by Shelfcheck to locate the store data in the repository, located at `/store-StoreId/`. All lines are supposed to contain distinct `StoreId` identifiers. `StoreId` must contain only letters and numbers and be between 1 and 57 characters long.
`Name`	string	A human-readable identifier used for reporting purposes.
`AdminEmail`	string	An email address associated to the administrative contact - within the retail network - for managing Shelfcheck for this particular store.
`OosAlertEmail`	string	An email address where the daily OOS (out-of-shelf) alert gets delivered. Leave blank if no email notification is intended for this store.

Format of items.tsv

The file items.tsv contains the description of the items sold in the retail network. This file is found in two locations: at the root / and/or within a store folder /store-123/.

`items.tsv`
`ItemId`	string	Required. Uniquely identifies items from a data analytics viewpoint. It means that all the sales performed under the same `ItemId` will be consolidated as a single unit of need. In particular, OOS alerts are produced at the `ItemId` level.
`GTINs`	string	Required. A list of comma-separated GTIN identifiers (aka separated by the `,` symbol). GITNs (Global Trade Item Numbers) include both UPC (North America) and EAN (Europe). The first GTIN of the list is intended as the primary GTIN associated to the item, i.e. the GTIN to be reported within web UIs.
`Name`	string	A human-readable name for the item. When omitted, the `ItemId` is used instead for reporting.
`ReorderCode`	string	An operating code typically used by store employees to replenish the item. This field is supported to make the reports more practical when iterating over OOS alerts. This value is only intended at the primary reorder code associated to the item. In the receipt's files however, multiple reoder codes might appear associated to the same `ItemId`.
`ShelfCode`	string	An in-store proximity code. When this field is present, OOS alerts are grouped by `ShelfCode`, and listed in lexical order. This field is provided to facilate the task of store employees. By grouping alerts according to the location of the items, employees can save time while iterating over OOS alerts through the store.
`OosAlertEnabled`	bool	This flag is boolean and should only be associated the strings `true` or `false`. When the column is omitted, or when the value is left blank, then `true` is used by default. This flag indicates whether an OOS alert can be produced for the item. A typical situation where alerts should not be produced is when the stockout is intentional. In this case, there is no need to notify employees about the stockout.
`TagFoo`	string	The fields prefixed with `Tag` (aka `TagFoo` where `Foo` is an ad-hoc string) will be used by Shelfcheck as extra information about the underlying time-series. See our guide about tags and events to learn more about which data should be used for tags. In practice, tags will be used to represent product family, sub-family, category ... that is, to reflect the catalog hierarchy.
`LabelFoo`	string	The fields prefixed with `Label` (`LabelFoo` where `Foo` is an ad-hoc string) will be used in the Shelfcheck reports. For example, having a column named `LabelBrand` will enable Shelfcheck to display the brand within the reports. The convention allows adding any arbitrary set of extra fields for reporting purposes (the `Label` prefix will be removed from the final reports). Those fields are for display only and they do not impact the OSA process itself.
`TagLabelFoo`	string	If you wish to combine both `Label` and `Tag` behaviors into a single field, then you need to prefix the column with `TagLabel`.

The file items.tsv located in the store folder complements and overwrites the information available in /items.tsv :

if a header Foo is found in /store-123/items.tsv which does not exist in /items.tsv, then the local Foo information is added to items for the store 123 as an extra field. The matching is operated based on the ItemId identifier.
if a header Bar is found in /store-123/items.tsv, and the same header already exists in /items.tsv, then the local Bar information overwrites the global Bar information. The matching is operated based on the ItemId identifier.

The GTINs list represents all the barcodes that, in the end, should be merged as a single item. There are multiple reasons to operate such a merge, such as grouping items sold by the unit or in packs, group promoted and non-promoted items, establishing a link between the old and the new item when the supplier is only operating a minor packaging evolution.

Format of receipts-yyyy-MM-dd.tsv

The receipts represent the most heavyweight information to be processed by Shelfcheck. Hence, the intermediate format enforces a daily partitioning of the data files. This structure is particularly convenient to enable the quick delivery of small incremental data files (i.e. no need to resend the whole sales history on a daily basis).

The suffix of the file indicates the date for all the receipts listed within the file. Except for the initial receipt file, the dates listed within the file must strictly match the day as indicated by the suffix.

Caution: the suffix does not represent the date when the data file is pushed. The suffix matches the date of its file content.

`receipts-yyyy-MM-dd.tsv`
`ReceiptId`	string	Identifies all the lines that are part of the same market basket (i.e. same receipt). This field can be omitted.
`DateTime`	datetime	Required. Identifies the purchase date associated to the article. This value is typically produced by the barcode scanner itself when items are processed at the checkout. Within each file, lines should be sorted by this column from oldest to newest.
`GTIN`	string	Required. Typically read by the barcode scanner.
`Quantity`	float	Required. As the name suggest, the quantity of items being purchased. Note that Shelfcheck does not require to pre-aggregate lines within a single receipt base on the `ItemId`.
`UnitPrice`	float	The price per item of the goods being acquired. Dollar cents or Euro cents (or any equivalent fractional currency value) should be represented with a fractional number as well. Fractional numbers are supported, by convention, we suggest not expressing prices in cents here.
`Discount`	float	An optional discount (expressed in currency unit) that might have been offered to the client. The price that the client is paying is the end must match the equation: `Total = Quantity x UnitPrice - Discount`. The `Discount` value is optional. If the field is present but the value is left blank, then it counts as zero.
`IsPromoted`	bool	A optional Boolean field indicating whether the purchased item was promoted at the time of the purchase. When present, its value should be `true` or `false`. When omitted, it's `false` by default.
`LoyaltyId`	string	Represents the client identifier when a loyalty card is used. When the `ReceiptId` is present, all the lines associated to a particular `ReceiptId` identifier are expected to be associated with the same `LoyaltyId` (if this field is present).

Within the file, lines must be ordered against DateTime, starting from the early history and moving toward the recent one.

If the store is closed, then you should include an empty file (properly named). A missing data file for a particular date is interpreted as, well, missing sales data, which is not equivalent to zero sales data. Since this aspect does impact the analytics performed by Shelfcheck, we suggest being careful about this aspect.

The first file (earliest date) may contain the entire sales history observed so far. This feature is supported in order to facilitate the initialization of the Shelfcheck data repository. Indeed, the first data transfer is likely to include as much as 1000x more data (i.e. 3 years) than the routine daily updates that happen afterward.

For a given store, all receipt files must contain the exact same fields (headers). If you need to change the columns of those files, then you need to revise all the files. Yet, you do not need to revise all stores at once, you can iterate from one store to the next. The receipt format is not required to be the same across the entire network.

The data format of the receipt file is not normalized in the relational sense. Indeed, the association between ReceiptId and LoyaltyId is repeated across each line of the receipt. However, our experience indicates that it's easier to tolerate a bit of redundancy rather than to deal with a greater number of files.

Format of backorders-yyyy-MM-dd.tsv

Backorders represent the replenishment orders passed from the store to the suppliers. Like it is done for receipts, the intermediate format also enforces a daily partitioning of the data files. Providing the backorders to Shelfcheck is optional.

The suffix of the file indicates the date for all the backorders listed within the file. Except for the initial backorder file, the dates listed within the file must strictly match the day as indicated by the suffix.

Caution: the suffix does not represent the date when the data file is pushed. The suffix matches the date of its file content.

`backorders-yyyy-MM-dd.tsv`
`BackorderId`	string	An optional field to specify a unique identifier of the backorder. This field can be leveraged to match backorders with deliveries made to the store later on.
`DateTime`	datetime	Required. Identifies the date when this line occured. Within each file, lines should be sorted by this column from oldest to newest.
`ReorderCode`	string	An operating code typically used by store employees to replenish the item.
`GTIN`	string	Required, unless a `BackorderId` was specified on the same line.
`Quantity`	float	Required. The quantity to be backordered.

Format of deliveries-yyyy-MM-dd.tsv

Deliveries represent the flow of goods being delivered to the stores, once backorders have been made. Like it is done for receipts, the intermediate format also enforces a daily partitioning of the data files. Providing the deliveries to Shelfcheck is optional.

The suffix of the file indicates the date for each delivery listed within the file. Except for the initial delivery file, the dates listed within the file must strictly match the day as indicated by the suffix.

Caution: the suffix does not represent the date when the data file is pushed. The suffix matches the date of its file content.

`deliveries-yyyy-MM-dd.tsv`
`BackorderId`	string	Optional field that should match its original backorder. This identifier allows to track the lead time, among other uses.
`DateTime`	datetime	Required. Identifies the date when this line occured. Within each file, lines should be sorted by this column from oldest to newest.
`ReorderCode`	string	An operating code typically used by store employees to replenish the item.
`GTIN`	string	Required. The GTIN of the delivered product.
`Quantity`	float	Required. The quantity that was delivered.

Format of status-yyyy-MM-dd.tsv

Status files contain a list of misc information, represented as key-value pairs, about the data processing itself and/or the store context. Providing the status file to Shelfcheck is optional.

This file is primarily intended as a flexible mechanism to provide some ad-hoc extra information for each store on a daily basis.

The suffix of the file indicates the date for each status listed within the file. Unlike for receipts, backorders and deliveries, status do not come with any exceptional behavior with respect of the first file found in history.

`status-yyyy-MM-dd.tsv`
`Key`	string	Required. Expected to start with a letter, and contains only letter and numbers, and have less than 64 characters.
`Value`	string	Required.

The following keys are reserved with special meaning (all of them are optional):

PreIntermediateDataAvailabilityTime: This key should be attached to a value representing a date-time information. It represents the time when the data was made available to the process actually populating the intermediate TSV format.

Format of uploading.tsv

Because of the nature of the TSV files that could be read line-by-line, it is complicated for Shelfcheck to determine if the daily upload of the data, for a given store, is still in progress or completed already.

Thus, to remove this ambiguity, Shelfcheck supports an upload flag named uploading.tsv. The process goes as follow:

Before starting the upload, create the file /store-123/uploading.tsv.
Upload the data for the store 123.
After completing the upload, remove the file /store-123/uploading.tsv.

Then, once the file uploading.tsv has been removed, Shelfcheck will start its process of the newly added data. The lock is defined per store, thus, it's possible for Shelfcheck to start working on the first stores while data delivery for the others is still in progress.

Format of shelfcheck.config

The file name /shelfcheck.config is reserved. It contains the configuration settings of the Shelfcheck instance.

Annex: Terminology

Id vs Code: you might have noticed that sometime we use FooId, and sometime it's FooCode. Fields with Id as suffix represent identifiers that can be used by Shelfcheck to perform data aggregations. On the other hand, when it's FooCode, it means that the field is typically intended for reporting only. Enforcing unicity on FooCode fields is not required - but typically better as far reporting goes.

Choosing Shelfcheck

Getting started

User guide

coming soon