Skip to main content

Common Format and Media Type for Control-Character-Separated Values (CCSV) Files
draft-rankin-ccsv-02

Document Type Active Internet-Draft (individual)
Author Michael Rankin
Last updated 2024-09-16
RFC stream (None)
Intended RFC status (None)
Formats
Stream Stream state (No stream defined)
Consensus boilerplate Unknown
RFC Editor Note (None)
IESG IESG state I-D Exists
Telechat date (None)
Responsible AD (None)
Send notices to (None)
draft-rankin-ccsv-02
Network Working Group                                          M. Rankin
Internet-Draft                                         16 September 2024
Intended status: Informational                                          
Expires: 20 March 2025

  Common Format and Media Type for Control-Character-Separated Values
                              (CCSV) Files
                          draft-rankin-ccsv-02

Abstract

   This document establishes the format used for Control-Character-
   Separated Values (CCSV) files and registers the associated MIME type
   "text/ccsv".

About This Document

   This note is to be removed before publishing as an RFC.

   The latest revision of this draft can be found at
   https://oldgrognard.github.io/ccsv-id/draft-rankin-ccsv.html.  Status
   information for this document may be found at
   https://datatracker.ietf.org/doc/draft-rankin-ccsv/.

   Source for this draft and an issue tracker can be found at
   https://github.com/oldgrognard/ccsv-id.

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at https://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on 20 March 2025.

Rankin                    Expires 20 March 2025                 [Page 1]
Internet-Draft                    CCSV                    September 2024

Copyright Notice

   Copyright (c) 2024 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents (https://trustee.ietf.org/
   license-info) in effect on the date of publication of this document.
   Please review these documents carefully, as they describe your rights
   and restrictions with respect to this document.  Code Components
   extracted from this document must include Revised BSD License text as
   described in Section 4.e of the Trust Legal Provisions and are
   provided without warranty as described in the Revised BSD License.

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
   2.  Conventions and Definitions . . . . . . . . . . . . . . . . .   3
     2.1.  Definition of the CCSV format . . . . . . . . . . . . . .   3
       2.1.1.  Formatting Rules  . . . . . . . . . . . . . . . . . .   3
   3.  Encoding Considerations . . . . . . . . . . . . . . . . . . .   4
     3.1.  Why UTF-8?  . . . . . . . . . . . . . . . . . . . . . . .   4
       3.1.1.  Compatibility . . . . . . . . . . . . . . . . . . . .   4
       3.1.2.  Internationalization  . . . . . . . . . . . . . . . .   4
       3.1.3.  Efficiency  . . . . . . . . . . . . . . . . . . . . .   4
       3.1.4.  Standardization . . . . . . . . . . . . . . . . . . .   4
       3.1.5.  Future-Proofing . . . . . . . . . . . . . . . . . . .   5
   4.  Security Considerations . . . . . . . . . . . . . . . . . . .   5
   5.  Interoperability Considerations . . . . . . . . . . . . . . .   5
   6.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .   5
   7.  Normative References  . . . . . . . . . . . . . . . . . . . .   6
   Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . .   7
   Author's Address  . . . . . . . . . . . . . . . . . . . . . . . .   7

1.  Introduction

   A CCSV (Control-Character-Separated Values file) is a file format
   that enables moving data between spreadsheets, statistical analysis
   programs, databases, and any other program that works with
   rectangular data.  It is very similar to (CSV) Comma-Separated Values
   files [RFC4180], (TSV) Tab-Separated Values files, and their
   derivatives.  Unlike those file types, the CCSV minimizes usage
   ambiguity by having non-printable characters as delimiters.  The two
   delimiter characters may not appear in the document's text, making
   the practice of escaping certain characters or adding additional
   delimiters for certain strings unnecessary.  This document seeks to
   define the format of Control Character Separated Values (CCSV) files
   and formally register the "text/ccsv" Media Type for CCSV in

Rankin                    Expires 20 March 2025                 [Page 2]
Internet-Draft                    CCSV                    September 2024

   accordance with [RFC6838].

2.  Conventions and Definitions

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
   "OPTIONAL" in this document are to be interpreted as described in
   BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all
   capitals, as shown here.

2.1.  Definition of the CCSV format

   In order for a file to be a CCSV, it MUST adhere to the following
   formatting rules:

2.1.1.  Formatting Rules

   1.  The file MUST use UTF-8 encoding.  Since US-ASCII is a subset of
       UTF-8, programs may create CCSV files with that encoding.  A
       consuming program may not be able to interpret all characters if
       it only works with US-ASCII and SHOULD work with UTF-8 if the
       source is unknown.

   2.  The file MUST NOT begin with a Byte Order Mark (U+FEFF).

   3.  A CCSV MUST begin with a header.  The header consists of the
       names of the columns separated with US (U+001F) entities.

   4.  A Unit Separator US (U+001F) is used between each field in a
       record.  Note that carriage returns and line feeds are not part
       of the delimiter and are valid characters in the body of a field.

   5.  A Record Separator RS (U+001E) is used between each record in the
       file including the header.

   6.  A Record Separator RS (U+001E) MAY appear as the last entity of
       the last record but is NOT RECOMMENDED.

   7.  The header and each record, if any, MUST contain the same number
       of US (U+001F) entities.  Consequently, the header and every
       record will have the same number of fields.

   8.  Empty fields are represented by consecutive delimiters.

   9.  The US (U+001F) entity and the RS (U+001E) entity MUST NOT appear
       in the body of a field.

   The ABNF grammar [STD68] appears as follows:

Rankin                    Expires 20 March 2025                 [Page 3]
Internet-Draft                    CCSV                    September 2024

file = header RS *(record RS) [record]
header = name *( US name )
record = field *( US field )
name = field
field = *VCHAR
VCHAR = %x00-1D / %x20-D7FF / %xE000-10FFFF ; all characters except the designated delimiters and surrogates
RS = %x1E ; record separator
US = %x1F ; unit separator

3.  Encoding Considerations

   CCSV files MUST be encoded using UTF-8 [RFC3629].

   Implementations MUST NOT add a byte order mark (U+FEFF) to the
   beginning of the file or networked-transmitted text.

3.1.  Why UTF-8?

3.1.1.  Compatibility

   UTF-8 is widely supported across different platforms, operating
   systems, and languages.  This ensures that CCSV files can be opened
   and read correctly regardless of the environment they are used in.

3.1.2.  Internationalization

   UTF-8 supports a vast range of characters from various languages,
   including those that use non-Latin scripts.  This is crucial for data
   that might include international names, addresses, or other text in
   multiple languages, ensuring that all characters are preserved and
   displayed correctly.

3.1.3.  Efficiency

   UTF-8 is a variable-width encoding scheme that uses 1 to 4 bytes for
   each character.  It is efficient for encoding text that is primarily
   in English, as it uses only one byte for the most common characters,
   but can still accommodate characters from other languages when
   needed.

3.1.4.  Standardization

   By requiring UTF-8, the CCSV format ensures a standard way of
   encoding text, which simplifies processing, parsing, and exchanging
   files.  It helps in avoiding the complexities and potential errors
   that can arise from dealing with multiple encodings.

Rankin                    Expires 20 March 2025                 [Page 4]
Internet-Draft                    CCSV                    September 2024

3.1.5.  Future-Proofing

   As the internet and technologies continue to evolve, UTF-8 remains a
   robust and forward-compatible choice, ensuring that CCSV files remain
   accessible and usable in the long term

4.  Security Considerations

   CCSV files alone are considered relatively harmless as there is no
   additional prescribed processing.  However, the file may be parsed
   and further processed by the recipient.  To the extent that a
   receiving application executes arbitrary system level commands from
   strings contained in a CCSV file, they may be at risk.

5.  Interoperability Considerations

   Adherence to the Formatting Rules Section 2.1.1 and the Encoding
   Considerations Section 3 ensures a high level of interoperability.

6.  IANA Considerations

   This section provides the media-type registration application (as per
   [RFC6838]).

   Type name: text

   Subtype name: ccsv

   Required parameters: N/A

   Optional parameters: N/A

   Encoding considerations: See Section 3

   Security considerations: See Section 4

   Interoperability considerations: See Section 5

   Published specification: TBD

   Applications that use this media type:

    Databases, spreadsheets, statistical programs, and data conversion utilities

   Fragment identifier considerations: N/A

   Additional information:

Rankin                    Expires 20 March 2025                 [Page 5]
Internet-Draft                    CCSV                    September 2024

       Deprecated alias names for this type: N/A
       Magic number(s): N/A
       File extension(s): CCSV
       Macintosh file type code(s): TEXT

   Person & email address to contact for further information:

       Mike Rankin
       2108 Independence Dr
       Chambersburg, PA  17201
       USA

       mrankin@icf.com

   Intended usage: COMMON

   Restrictions on usage: N/A

   Author/Change controller: Mike Rankin

   Provisional registration?

7.  Normative References

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119,
              DOI 10.17487/RFC2119, March 1997,
              <https://www.rfc-editor.org/rfc/rfc2119>.

   [RFC3629]  Yergeau, F., "UTF-8, a transformation format of ISO
              10646", STD 63, RFC 3629, DOI 10.17487/RFC3629, November
              2003, <https://www.rfc-editor.org/rfc/rfc3629>.

   [RFC4180]  Shafranovich, Y., "Common Format and MIME Type for Comma-
              Separated Values (CSV) Files", RFC 4180,
              DOI 10.17487/RFC4180, October 2005,
              <https://www.rfc-editor.org/rfc/rfc4180>.

   [RFC6838]  Freed, N., Klensin, J., and T. Hansen, "Media Type
              Specifications and Registration Procedures", BCP 13,
              RFC 6838, DOI 10.17487/RFC6838, January 2013,
              <https://www.rfc-editor.org/rfc/rfc6838>.

   [RFC8174]  Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
              2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
              May 2017, <https://www.rfc-editor.org/rfc/rfc8174>.

Rankin                    Expires 20 March 2025                 [Page 6]
Internet-Draft                    CCSV                    September 2024

   [STD68]    Internet Standard 68,
              <https://www.rfc-editor.org/info/std68>.
              At the time of writing, this STD comprises the following:

              Crocker, D., Ed. and P. Overell, "Augmented BNF for Syntax
              Specifications: ABNF", STD 68, RFC 5234,
              DOI 10.17487/RFC5234, January 2008,
              <https://www.rfc-editor.org/info/rfc5234>.

Acknowledgments

   TODO acknowledge.

Author's Address

   Mike Rankin
   Email: mrankin@oldgrognard.pub

Rankin                    Expires 20 March 2025                 [Page 7]