[Search] [txt|html|xml|pdf|bibtex] [Tracker] [Email] [Nits]

Versions: 00                                                            
Network Working Group                                         C. Bormann
Internet-Draft                                    Universit├Ąt Bremen TZI
Intended status: Standards Track                             12 May 2021
Expires: 13 November 2021


                I-Regexp: An Interoperable Regexp Format
                   draft-bormann-jsonpath-iregexp-00

Abstract

   "Regular expressions" (regexps) are a set of related, widely
   implemented pattern languages used in data modeling formats and query
   languages that is available in many dialects.  This specification
   defines an interoperable flavor of regexps, I-Regexp.

   The present version -00 of this document is a trial balloon, meant to
   determine whether this approach is useful for the JSONPath WG.

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at https://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on 13 November 2021.

Copyright Notice

   Copyright (c) 2021 IETF Trust and the persons identified as the
   document authors.  All rights reserved.











Bormann                 Expires 13 November 2021                [Page 1]


Internet-Draft                  I-Regexp                        May 2021


   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents (https://trustee.ietf.org/
   license-info) in effect on the date of publication of this document.
   Please review these documents carefully, as they describe your rights
   and restrictions with respect to this document.  Code Components
   extracted from this document must include Simplified BSD License text
   as described in Section 4.e of the Trust Legal Provisions and are
   provided without warranty as described in the Simplified BSD License.

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
   2.  Requirements  . . . . . . . . . . . . . . . . . . . . . . . .   3
   3.  Subsetting XSD Regexps  . . . . . . . . . . . . . . . . . . .   4
   4.  Formal definition of I-Regexp . . . . . . . . . . . . . . . .   4
   5.  Mapping I-Regexp to Regexp Dialects . . . . . . . . . . . . .   6
     5.1.  XSD Regexps . . . . . . . . . . . . . . . . . . . . . . .   6
     5.2.  ECMAScript Regexps  . . . . . . . . . . . . . . . . . . .   6
     5.3.  PCRE, RE2, Ruby Regexps . . . . . . . . . . . . . . . . .   6
     5.4.  << Your kind of Regexp here >>  . . . . . . . . . . . . .   7
   6.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .   7
   7.  Security considerations . . . . . . . . . . . . . . . . . . .   7
   8.  References  . . . . . . . . . . . . . . . . . . . . . . . . .   7
     8.1.  Normative References  . . . . . . . . . . . . . . . . . .   7
     8.2.  Informative References  . . . . . . . . . . . . . . . . .   7
   Appendix A.  Regexps and Similar Constructs in Published RFCs . .   8
   Acknowledgements  . . . . . . . . . . . . . . . . . . . . . . . .   9
   Author's Address  . . . . . . . . . . . . . . . . . . . . . . . .   9

1.  Introduction

   Data modeling formats (YANG, CDDL) as well as query languages
   (jsonpath) often need a regular expression (regexp) sublanguage.
   There are many dialects of regular expressions in use in platforms,
   programming languages, and data modeling formats.

   While regular expressions originally were intended to provide a
   Boolean matching function, they have turned into parsing functions
   for many applications, with capture groups, greedy/lazy/possessive
   variants, etc.  Language features such as backreferences allow
   specifying languages that actually are context-free (Chomsky type 2)
   instead of the regular languages (Chomsky type 3) that regular
   expressions are named for.

   YANG (Section 9.4.5 of [RFC7950]) and CDDL (Section 3.8.3 of
   [RFC8610]) have adopted the regexp language from W3C Schema [XSD2].
   XSD regexp is a pure matching language, i.e., XSD regexps can be used
   to match a string against them and yield a simple true or false



Bormann                 Expires 13 November 2021                [Page 2]


Internet-Draft                  I-Regexp                        May 2021


   result.  XSD regexps are not as widely implemented as programming
   language regexp dialects such as those of Perl, Python, Ruby, Go
   [RE2], or JavaScript (ECMAScript) [ECMA-262].  The latter are often
   in a state of continuous development; in the best case (ECMAScript)
   there is a complete specification which however is highly complex
   (Section 21.2 of [ECMA-262] comprises 62 pages) and evolves on a
   yearly timeline, with significant additions.  Regexp dialects such as
   PCRE [PCRE2] have evolved to cover a common set of functions
   available in parsing regexp dialects, offered in a widely available
   library.

   With continuing accretion of complex features, parsing regexp
   libraries have become susceptible to bugs, in particular those that
   can be exploited in DoS attacks.  The library RE2 that is compatible
   with Go language regexps strives to be immune to DoS attacks, making
   it attractive to applications such as query languages where an
   attacker could control the input.  The problem remains that other
   bugs in such libraries can lead to exploitable vulnerabilities; at
   the time of writing, the Common Vulnerabilities and Exposures (CVE)
   system has 131 entries that mention the word "regex" [REGEX-CVE] (not
   all, but many of which are such bugs, with 23 matches for arbitrary
   code execution).

   Implementations of YANG and CDDL often struggle with providing true
   XSD regexps; some instead cheat by providing one of the parsing
   regexp varieties, sometime without even advertising this fact.

   A matching regexp that does not use the more complex XSD features
   (Section 3) can usually be converted into a parsing regexp of many
   dialects by surrounding it with anchors of that dialect (e.g., "^" or
   "\A" and "$" or "\z").  If the original matching regexps exceed the
   envelope of compatibility between dialects, this can lead to
   interoperability problems, or, worse, security vulnerabilities.
   Also, features of the target dialect such as capture groups may be
   triggered inadvertently, reducing performance.

   The present specification defines an interoperable regexp flavor for
   matching, I-Regexp.  This flavor is a subset of XSD regexps.  It also
   comes with defined rules for converting the regexp into common
   parsing regexp dialects.

2.  Requirements

   I-Regexps should handle the vast majority of practical cases where a
   matching regexp is needed in a data model specification or a query
   language expression.





Bormann                 Expires 13 November 2021                [Page 3]


Internet-Draft                  I-Regexp                        May 2021


   A brief survey of published RFCs yielded the regexp patterns in
   Appendix A (with no attempt at completeness).  These should be
   covered by I-Regexps, both syntactically and with their intended
   semantics.

3.  Subsetting XSD Regexps

   XSD Regexps are relatively easy to implement or map to widely
   implemented parsing regexp dialects, with a small number of notable
   exceptions:

   *  Character class subtraction.  This is a very useful feature in
      many specifications, but it is unfortunately mostly absent from
      parsing regexp dialects.

      -  *Issue*: This absence can be addressed by translating character
         class subtraction into positive character classes, or by
         leaving out subtraction.  The current draft opts for the
         latter, but that decision is up for discussion.

   *  Unicode.  While there is no doubt that a regexp flavor meant to
      last needs to be Unicode enabled, there are a number of aspects of
      this that need discussion.  First of all, predefined character
      classes such as "\w" may be meant to be ASCII only, or they may
      encompass all letters and digits defined in Unicode.  The latter
      is usually of interest in query languages, while the former is of
      interest to a subset of applications in data model specifications.
      Second, not all regexp implementations that one might want to map
      I-Regexps to will support accesses to Unicode tables that enable
      executing on constructs such as "\p{IsCoptic}".

      -  *Issue*: The ASCII focus can partially be addressed by adding a
         constraint that the matched text has to be ASCII in the first
         place.  This often is all that is needed where regexps are used
         to define lexical elements of a computer language.  The access
         to Unicode tables can simply be ruled out.  (Note that RFC 6643
         contains a lone instance of "\p{IsBasicLatin}{0,255}", which is
         needed to describe a transition from a legacy character set to
         Unicode.  The author believes that this would be a rare
         application and can be left out.  RFC2622 contans
         "[[:digit:]]", "[[:alpha:]]", "[[:alnum:]]", albeit in a
         specification for the "flex" tool; this is intended to be close
         to "\d", "\p{L}", "\w" in an ASCII subset.)

4.  Formal definition of I-Regexp

   I-Regexp is defined by the ABNF specification in Figure 1.




Bormann                 Expires 13 November 2021                [Page 4]


Internet-Draft                  I-Regexp                        May 2021


   i-regexp = branch *( "|" branch )
   branch = *piece
   piece = atom [ quantifier ]
   quantifier = ( %x2A-2B ; '*'-'+'
    / "?" ) / ( "{" quantity "}" )
   quantity = quantRange / quantMin / QuantExact
   quantRange = QuantExact "," QuantExact
   quantMin = QuantExact ","
   QuantExact = 1*%x30-39 ; '0'-'9'

   atom = NormalChar / charClass / ( "(" i-regexp ")" )
   NormalChar = ( %x00-27 / %x2C-2D ; ','-'-'
    / %x2F-3E ; '/'-'>'
    / %x40-5A ; '@'-'Z'
    / %x5E-7A ; '^'-'z'
    / %x7E-10FFFF )
   charClass = SingleCharEsc / charClassEsc / charClassExpr / WildcardEsc
   charClassExpr = "[" charGroup "]"
   charGroup = negCharGroup / posCharGroup
   posCharGroup = 1*charGroupPart [ bareHyphen ]
   negCharGroup = "^" posCharGroup
   charGroupPart = singleCharSeq / charRange / charClassEsc
   bareHyphen = "-"
   singleCharSeq = ( bareHyphen / singleCharNonHyphen ) *singleCharNonHyphen
   singleCharNonHyphen = SingleCharEsc / SingleCharNoEscHyphen
   singleChar = SingleCharEsc / SingleCharNoEsc
   charRange = singleChar "-" singleChar
   SingleCharNoEsc = ( %x00-5A / "\" / %x5E-10FFFF )
   SingleCharNoEscHyphen = ( %x00-2C / %x2E-5A ; '.'-'Z'
    / "\" / %x5E-10FFFF )
   SingleCharEsc = "\" ( %x28-2B ; '('-'+'
    / %x2D-2E ; '-'-'.'
    / "?" / "[" / %x5D-5E ; ']'-'^'
    / %s"n" / %s"r" / %s"t" / %x7B-7D ; '{'-'}'
    )
   charClassEsc = ( MultiCharEsc / catEsc / complEsc )
   catEsc = %s"\p{" charProp "}"
   complEsc = %s"\P{" charProp "}"
   charProp = IsCategory / IsBlock
   IsCategory = Letters / Marks / Numbers / Punctuation / Separators / Symbols / Others
   Letters = %s"L" [ ( %x6C-6D ; 'l'-'m'
    / %s"o" / %x74-75 ; 't'-'u'
    ) ]
   Marks = %s"M" [ ( %s"c" / %s"e" / %s"n" ) ]
   Numbers = %s"N" [ ( %s"d" / %s"l" / %s"o" ) ]
   Punctuation = %s"P" [ ( %x63-66 ; 'c'-'f'
    / %s"i" / %s"o" / %s"s" ) ]
   Separators = %s"Z" [ ( %s"l" / %s"p" / %s"s" ) ]



Bormann                 Expires 13 November 2021                [Page 5]


Internet-Draft                  I-Regexp                        May 2021


   Symbols = %s"S" [ ( %s"c" / %s"k" / %s"m" / %s"o" ) ]
   Others = %s"C" [ ( %s"c" / %s"f" / %x6E-6F ; 'n'-'o'
    ) ]
   IsBlock = %s"Is" 1*( "-" / %x30-39 ; '0'-'9'
    / %x41-5A ; 'A'-'Z'
    / %x61-7A ; 'a'-'z'
    )
   MultiCharEsc = "\" ( %x43-44 ; 'C'-'D'
    / %s"I" / %s"S" / %s"W" / %x63-64 ; 'c'-'d'
    / %s"i" / %s"s" / %s"w" )
   WildcardEsc = "."

                                  Figure 1

   *  *Issue*: This is essentially XSD regexp without character class
      subtraction.  There is probably potential for simplification in
      IsBlock (leave out) and possibly in the rather large part for
      IsCategory as well.  The ABNF has been automatically generated and
      maybe could use some polishing.  The ABNF has been verified
      against Appendix A, but a wider corpus of regular expressions
      should be examined.

5.  Mapping I-Regexp to Regexp Dialects

   (TBD; these mappings need to be thoroughly verified.)

5.1.  XSD Regexps

   Any I-Regexp also is an XSD Regexp [XSD2], so the mapping is an
   identify function.

5.2.  ECMAScript Regexps

   An I-Regexp, enveloped in "^" and "$", is an ECMAScript regexp
   [ECMA-262].  The performance can be increased by turning
   parenthesized regexps (production "atom") into "(?:...)"
   constructions.

5.3.  PCRE, RE2, Ruby Regexps

   An I-Regexp, enveloped in "\A" and "\z", is a valid regexp in PCRE
   [PCRE2], the Go programming language [RE2], and the Ruby programming
   language.  The performance can be increased by turning parenthesized
   regexps (production "atom") into "(?:...)" constructions.







Bormann                 Expires 13 November 2021                [Page 6]


Internet-Draft                  I-Regexp                        May 2021


5.4.  << Your kind of Regexp here >>

   (Please submit the mapping needed for your favorite kind of regexp.)

6.  IANA Considerations

   This document makes no requests of IANA.

7.  Security considerations

   TBD

   (Discuss security issues of regexp implementations, both DoS and RCE;
   this is covered in part in Section 1.)

8.  References

8.1.  Normative References

   [XSD2]     Biron, P. and A. Malhotra, "XML Schema Part 2: Datatypes
              Second Edition", World Wide Web Consortium Recommendation
              REC-xmlschema-2-20041028, 28 October 2004,
              <https://www.w3.org/TR/2004/REC-xmlschema-2-20041028>.

8.2.  Informative References

   [ECMA-262] Ecma International, "ECMAScript 2020 Language
              Specification", ECMA Standard ECMA-262, 11th Edition, June
              2020, <https://www.ecma-international.org/wp-
              content/uploads/ECMA-262.pdf>.

   [PCRE2]    "Perl-compatible Regular Expressions (revised API:
              PCRE2)", n.d., <http://pcre.org/current/doc/html/>.

   [RE2]      "RE2 is a fast, safe, thread-friendly alternative to
              backtracking regular expression engines like those used in
              PCRE, Perl, and Python. It is a C++ library.", n.d.,
              <https://github.com/google/re2>.

   [REGEX-CVE]
              "CVE - Search Results", n.d.,
              <https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=regex>.

   [RFC7493]  Bray, T., Ed., "The I-JSON Message Format", RFC 7493,
              DOI 10.17487/RFC7493, March 2015,
              <https://www.rfc-editor.org/info/rfc7493>.





Bormann                 Expires 13 November 2021                [Page 7]


Internet-Draft                  I-Regexp                        May 2021


   [RFC7950]  Bjorklund, M., Ed., "The YANG 1.1 Data Modeling Language",
              RFC 7950, DOI 10.17487/RFC7950, August 2016,
              <https://www.rfc-editor.org/info/rfc7950>.

   [RFC8610]  Birkholz, H., Vigano, C., and C. Bormann, "Concise Data
              Definition Language (CDDL): A Notational Convention to
              Express Concise Binary Object Representation (CBOR) and
              JSON Data Structures", RFC 8610, DOI 10.17487/RFC8610,
              June 2019, <https://www.rfc-editor.org/info/rfc8610>.

Appendix A.  Regexps and Similar Constructs in Published RFCs

   This appendix contains a number of regular expressions that have been
   extracted from published RFCs based on some ad-hoc matching.  Multi-
   line constructions were not included.  All regular expressions
   validate against the ABNF in Figure 1.

   rfc6021.txt  459 (([0-1](\.[1-3]?[0-9]))|(2\.(0|([1-9]\d*))))
   rfc6021.txt  513 \d*(\.\d*){1,127}
   rfc6021.txt  529 \d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}(\.\d+)?
   rfc6021.txt  631 ([0-9a-fA-F]{2}(:[0-9a-fA-F]{2})*)?
   rfc6021.txt  647 [0-9a-fA-F]{2}(:[0-9a-fA-F]{2}){5}
   rfc6021.txt  933 ((:|[0-9a-fA-F]{0,4}):)([0-9a-fA-F]{0,4}:){0,5}
   rfc6021.txt  938 (([^:]+:){6}(([^:]+:[^:]+)|(.*\..*)))|
   rfc6021.txt 1026 ((:|[0-9a-fA-F]{0,4}):)([0-9a-fA-F]{0,4}:){0,5}
   rfc6021.txt 1031 (([^:]+:){6}(([^:]+:[^:]+)|(.*\..*)))|
   rfc6020.txt 6647 [0-9a-fA-F]*
   rfc6095.txt 2544 \S(.*\S)?
   rfc6110.txt 1583 [aeiouy]*
   rfc6110.txt 3222 [A-Z][a-z]*
   rfc6536.txt 1583 \*
   rfc6536.txt 1632 [^\*].*
   rfc6643.txt  524 \p{IsBasicLatin}{0,255}
   rfc6728.txt 3480 \S+
   rfc6728.txt 3500 \S(.*\S)?
   rfc6991.txt  477 (([0-1](\.[1-3]?[0-9]))|(2\.(0|([1-9]\d*))))
   rfc6991.txt  525 \d*(\.\d*){1,127}
   rfc6991.txt  541 [a-zA-Z_][a-zA-Z0-9\-_.]*
   rfc6991.txt  542 .|..|[^xX].*|.[^mM].*|..[^lL].*
   rfc6991.txt  571 \d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}(\.\d+)?
   rfc6991.txt  665 ([0-9a-fA-F]{2}(:[0-9a-fA-F]{2})*)?
   rfc6991.txt  693 [0-9a-fA-F]{2}(:[0-9a-fA-F]{2}){5}
   rfc6991.txt  725 ([0-9a-fA-F]{2}(:[0-9a-fA-F]{2})*)?
   rfc6991.txt  743 [0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-
   rfc6991.txt 1041 ((:|[0-9a-fA-F]{0,4}):)([0-9a-fA-F]{0,4}:){0,5}
   rfc6991.txt 1046 (([^:]+:){6}(([^:]+:[^:]+)|(.*\..*)))|
   rfc6991.txt 1099 [0-9\.]*
   rfc6991.txt 1109 [0-9a-fA-F:\.]*



Bormann                 Expires 13 November 2021                [Page 8]


Internet-Draft                  I-Regexp                        May 2021


   rfc6991.txt 1164 ((:|[0-9a-fA-F]{0,4}):)([0-9a-fA-F]{0,4}:){0,5}
   rfc6991.txt 1169 (([^:]+:){6}(([^:]+:[^:]+)|(.*\..*)))|
   rfc7407.txt  933 ([0-9a-fA-F]){2}(:([0-9a-fA-F]){2}){0,254}
   rfc7407.txt 1494 ([0-9a-fA-F]){2}(:([0-9a-fA-F]){2}){4,31}
   rfc7758.txt  703 \d{2}:\d{2}:\d{2}(\.\d+)?
   rfc7758.txt 1358 \d{2}:\d{2}:\d{2}(\.\d+)?
   rfc7895.txt  349 \d{4}-\d{2}-\d{2}
   rfc7950.txt 8323 [0-9a-fA-F]*
   rfc7950.txt 8355 [a-zA-Z_][a-zA-Z0-9\-_.]*
   rfc7950.txt 8356 [xX][mM][lL].*
   rfc8040.txt 4713 \d{4}-\d{2}-\d{2}
   rfc8049.txt 6704 [A-Z]{2}
   rfc8194.txt  629 \*
   rfc8194.txt  637 [0-9]{8}\.[0-9]{6}
   rfc8194.txt  905 Z|[\+\-]\d{2}:\d{2}
   rfc8194.txt  963 (2((2[4-9])|(3[0-9]))\.).*
   rfc8194.txt  974 (([fF]{2}[0-9a-fA-F]{2}):).*
   rfc8299.txt 7986 [A-Z]{2}
   rfc8341.txt 1878 \*
   rfc8341.txt 1927 [^\*].*
   rfc8407.txt 1723 [0-9\.]*
   rfc8407.txt 1749 [a-zA-Z_][a-zA-Z0-9\-_.]*
   rfc8407.txt 1750 .|..|[^xX].*|.[^mM].*|..[^lL].*
   rfc8525.txt  550 \d{4}-\d{2}-\d{2}
   rfc8776.txt  838 /?([a-zA-Z0-9\-_.]+)(/[a-zA-Z0-9\-_.]+)*
   rfc8776.txt  874 ([a-zA-Z0-9\-_.]+:)*
   rfc8819.txt  311 [\S ]+
   rfc8944.txt  596 [0-9a-fA-F]{2}(:[0-9a-fA-F]{2}){7}

         Figure 2: Example regular expressions extracted from RFCs

Acknowledgements

   This draft has been motivated by the discussion in the IETF JSONPATH
   WG about whether to include a regexp mechanism into the JSONPath
   query expression specification, as well as by previous discussions
   about the YANG "pattern" and CDDL ".regexp" features.

   The basic approach for this draft was inspired by The I-JSON Message
   Format [RFC7493].

Author's Address









Bormann                 Expires 13 November 2021                [Page 9]


Internet-Draft                  I-Regexp                        May 2021


   Carsten Bormann
   Universit├Ąt Bremen TZI
   Postfach 330440
   D-28359 Bremen
   Germany

   Phone: +49-421-218-63921
   Email: cabo@tzi.org











































Bormann                 Expires 13 November 2021               [Page 10]