Network Working Group                                             S. Rao
Internet-Draft                                                      Grab
Intended status: Experimental                                   S. Sahib
Expires: May 7, 2020                                            R. Guest
                                                              Salesforce
                                                        November 4, 2019


             Personal Information Tagging for Logs (PITFoL)
                          draft-rao-pitfol-00

Abstract

   Software applications typically generate a large amount of log data
   in the course of their operation in order to help with monitoring,
   troubleshooting, etc.  However, like all data generated and operated
   upon by software systems, logs can contain information sensitive to
   users.  Personal data identification and anonymization in logs is
   thus crucial to ensure that no personal data is being inadvertently
   logged and retained which would make the logging application run
   afoul of laws around storing private information.  This document
   focuses on exploring mechanisms to specify personal or sensitive data
   in logs, to enable any server collecting, processing or analyzing
   logs to identify personal data and thereafter, potentially enforce
   any redaction.

Requirements Language

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in RFC 2119 [RFC2119].

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at https://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on May 7, 2020.



Rao, et al.                Expires May 7, 2020                  [Page 1]


Internet-Draft                   PITFoL                    November 2019


Copyright Notice

   Copyright (c) 2019 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (https://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
   2.  Terminology . . . . . . . . . . . . . . . . . . . . . . . . .   3
   3.  Motivation and Use Cases  . . . . . . . . . . . . . . . . . .   3
   4.  Techniques  . . . . . . . . . . . . . . . . . . . . . . . . .   4
     4.1.  Field Level Tagging . . . . . . . . . . . . . . . . . . .   4
     4.2.  Log Level Tagging . . . . . . . . . . . . . . . . . . . .   4
   5.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .   5
   6.  Security Considerations . . . . . . . . . . . . . . . . . . .   5
   7.  Acknowledgements  . . . . . . . . . . . . . . . . . . . . . .   5
   8.  Normative References  . . . . . . . . . . . . . . . . . . . .   5
   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .   6

1.  Introduction

   Personal data identification and redaction is crucial to make sure
   that a logging application is not storing and potentially leaking
   users' private information.  There are known precedents that help
   discover and extract sensitive data, for example, we can define a
   regular expression or lookup rules that will match a person's name,
   credit card number, email address and so on.  Besides, there are data
   dictionary and datasets based training models that can predict the
   presence of sensitive data.  In most cases, however, what data is
   considered personal and sensitive is often subjective, provisional
   and contextual to the data source or the application processing the
   data, which makes it hard to use automated techniques to identify
   personal data.  The challenges are summarized as follows:

   - What comprises personal data is often subjective and use case
   specific.





Rao, et al.                Expires May 7, 2020                  [Page 2]


Internet-Draft                   PITFoL                    November 2019


   - There are many disparate set of personal data types and often
   require multitude approaches for its detection.

   - There are no standards that govern formats of sensitive data making
   automation difficult for most common use cases.

   Once the personal information is identified, it has to be
   appropriately tagged.  Personal data tagging is especially important
   in cases where log data is flowing in from disparate sources.  In
   cases where tagging at source is not possible (e.g. log data
   generated by a legacy IoT device, Web server or a Firewall), a
   centralized logging server can be tasked with making sure the log
   data is tagged before passing on downstream.  Once the logs are
   tagged, the logging application can use anonymization techniques to
   redact the fields appropriately.  This document focuses on the
   tagging aspect of log redaction.

2.  Terminology

   Personal data: RFC 6973 [RFC6973] defines personal data as "any
   information relating to an individual who can be identified, directly
   or indirectly."  This typically includes information such as IP
   addresses, username.  However, the definition of personal data varies
   heavily by what other information is available, the jurisdiction of
   operation and other such factors.  Hence, this document does not
   focus on prescriptively listing what log fields contain personal data
   but rather on what a tagging mechanism would look like once a logging
   application has determined which fields it considers to hold personal
   data.

3.  Motivation and Use Cases

   Most systems like network devices, web servers and application
   services record information about user activity, transactions,
   network flows, etc., as log data.  Logs are incredibly useful for
   various purposes such as security monitoring, application debugging
   and opertional maintenace.  In addition, there are use cases of
   organizations exporting or sharing logs with third party log
   analyzers for purposes of security incident reponse, monitoring,
   business analytics, where logs can be valuable source of information.
   In such cases, there are concerns about potential exposure of
   personal data to unintented systems or receipients.  This document
   explores techiques for tagging logs to aid identification of personal
   data.







Rao, et al.                Expires May 7, 2020                  [Page 3]


Internet-Draft                   PITFoL                    November 2019


4.  Techniques

   Once personal information data is identified via manual detection,
   dictionary or dataset based training models, the log imposed with tag
   information either at field-level or the log-level.

   This is an example of a log message in RFC 3164 [RFC3164] format.  We
   can imagine that a logging application determines that user_name,
   err_user and ip_addr are fields that can contain sensitive personal
   data.

   <120> Nov 16 16:00:00 10.0.1.11 ABCDEFG: [AF@0 event="AF-Authority
   failure" violation="A-Not authorized to object" actual_type="AF-A"
   jrn_seq="1001363" timestamp="20120418163258988000"
   job_name="QPADEV000B" user_name="XYZZY" job_number="256937"
   err_user="TESTFORAF" ip_addr="10.0.1.21" port="55875"
   action="Undefined(x00)" val_job="QPADEV000B" val_user="XYZZY"
   val_jobno="256937" object="TEST" object_library="CUS9242"
   object_type="*FILE" pgm_name="" pgm_libr="" workstation=""]

4.1.  Field Level Tagging

   In the field-level tagging method, the identifed <attribute,
   value>field is tagged with a "pii_data=true" attribute specifying the
   field to be sensitive or personal.  In case of log-level tagging
   approach, the data about fields that are personal is specified using
   "pii_name" attribute that contains list of one or more field deemed
   sensitive or personal.

   This log can be transformed as following using 'Field-level' tagging
   techique:

   <120> Apr 18 16:32:58 10.0.1.11 QAUDJRN: [AF@0 event="AF-Authority
   failure" violation="A-Not authorized to object" actual_type="AF-A"
   jrn_seq="1001363" timestamp="20120418163258988000"
   job_name="QPADEV000B" {user_name="XYZZY" pii_data="true"}
   job_number="256937" {err_user="XYZZY" pii_data="true"]
   [ip_addr="10.0.1.21" pii_data="true"] port="55875"
   action="Undefined(x00)" val_job="QPADEV000B" val_jobno="256937"
   object="TEST" object_library="CUS9242" object_type="*FILE"
   pgm_name="" pgm_libr="" workstation=""]

4.2.  Log Level Tagging

   <120> Apr 18 16:32:58 10.0.1.11 QAUDJRN: [AF@0 event="AF-Authority
   failure" violation="A-Not authorized to object" actual_type="AF-A"
   jrn_seq="1001363" timestamp="20120418163258988000"
   job_name="QPADEV000B" user_name="XYZZY" job_number="256937"



Rao, et al.                Expires May 7, 2020                  [Page 4]


Internet-Draft                   PITFoL                    November 2019


   err_user="XYZZY" ip_addr="10.0.1.21" port="55875"
   action="Undefined(x00)" val_job="QPADEV000B" val_jobno="256937"
   object="TEST" object_library="CUS9242" object_type="*FILE"
   pgm_name="" pgm_libr="" workstation="", pii="user_name,err_user,
   ip_addr"]

   A new (metadata) "pii" field was added to the MSG part of the syslog
   log message.

   A more complicated example, that can be used to support the ability
   to radact different fields in different ways as per privacy
   preservation policy.

   <120> Apr 18 16:32:58 10.0.1.11 QAUDJRN: [AF@0 event="AF-Authority
   failure" violation="A-Not authorized to object" actual_type="AF-A"
   jrn_seq="1001363" timestamp="20120418163258988000"
   job_name="QPADEV000B" user_name="XYZZY" job_number="256937"
   err_user="XYZZY" ip_addr="10.0.1.21" port="55875"
   action="Undefined(x00)" val_job="QPADEV000B" val_jobno="256937"
   object="TEST" object_library="CUS9242" object_type="*FILE"
   pgm_name="" pgm_libr="" workstation="",
   pii_name="user_name,err_user", pii_ipaddr="ip_addr"]

   where the log data is tagged with "pii_name" and "pii_ipaddr"
   attributes that specifies the senstive data in the log at a granular
   level.

5.  IANA Considerations

   We can consider defining a Structured Data ID for PII to specify
   various structured parameters.

6.  Security Considerations

   TBD

7.  Acknowledgements

   TBD

8.  Normative References

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119,
              DOI 10.17487/RFC2119, March 1997,
              <https://www.rfc-editor.org/info/rfc2119>.





Rao, et al.                Expires May 7, 2020                  [Page 5]


Internet-Draft                   PITFoL                    November 2019


   [RFC3164]  Lonvick, C., "The BSD Syslog Protocol", RFC 3164,
              DOI 10.17487/RFC3164, August 2001,
              <https://www.rfc-editor.org/info/rfc3164>.

   [RFC6973]  Cooper, A., Tschofenig, H., Aboba, B., Peterson, J.,
              Morris, J., Hansen, M., and R. Smith, "Privacy
              Considerations for Internet Protocols", RFC 6973,
              DOI 10.17487/RFC6973, July 2013,
              <https://www.rfc-editor.org/info/rfc6973>.

Authors' Addresses

   Sandeep Rao
   Grab
   Bangalore
   India

   Email: sandeeprao.ietf@gmail.com


   Shivan Sahib
   Salesforce

   Email: shivankaulsahib@gmail.com


   Ryan Guest
   Salesforce

   Email: rguest@salesforce.com





















Rao, et al.                Expires May 7, 2020                  [Page 6]