A deep dive into deeply recursive Go
In our previous blog post on Go security, we wrote about security issues fixed in 2024. One point we highlighted was related to the September security release of Go 1.22.7 and 1.23.1, containing fixes for three Denial-of-Service vulnerabilities. Stack exhaustion, the root cause of all three issues, is extremely common in code that is not explicitly written in a defensive manner. And in Go in particular, stack exhaustion is more dangerous than in most memory-safe languages because of the way it is handled by the runtime. When stack exhaustion happens, a crash is unavoidable.
Further investigation is warranted into the mechanisms of why deep recursion is so bad in Go, how you typically end up with deeply recursive Go code, and what can be done to mitigate stack exhaustion issues.
Defining Deep Recursion
Some problems are inherently recursive. That’s not to say that such problems could not be solved iteratively, but that the recursive solution is much more intuitive, and often more readable. A simple example of this is the Fibonacci sequence, whose very definition is recursive: each element of the sequence is the sum of the two previous elements. It would be easy to write an iterative Go program that prints the sequence, but a recursive implementation is more intuitive because it maps to the definition so directly:
// Fibonacci prints the Fibonacci sequence 1, 1, 2, 3, 5, ... func Fibonacci(f ...uint64) { var next uint64 = 1 if len(f) > 1 { next = f[len(f)-2] + f[len(f)-1] } fmt.Printf("%d, ", next) Fibonacci(append(f, next)...) }
This function is an example of infinite recursion: it keeps printing numbers from the Fibonacci sequence until the end of time – or until some external force stops it. It is also an example of a tail-recursive function, where the recursive step is the very last statement in the function body. Such functions are easily converted to an iterative form, even automatically by compilers in a process called tail call optimization, or TCO. Unfortunately, Go does not support TCO, meaning the recursion does not get optimized away, which leads to the stack growing on every step of the sequence.
Detour: the stack and the heap
Go programs have access to two types of memory, the stack and the heap, as do computer programs more generally. A simple program that uses only local variables and primitive types might only access the stack, which is where all local variables are stored whenever possible. But for any more complex program, the heap is where most of your allocations happen in practice: everything allocated with make
or new
, the contents of all slices, and everything that cannot be stored on the stack safely. To determine exactly what can be stored on the stack and what needs to be placed on the heap, the Go compiler uses a process called escape analysis. The details are somewhat involved and can make it difficult to determine where a variable is stored simply by looking at the code.
All this matters because stack and heap memory are managed completely differently. The heap is the domain of the garbage collector, whereas the stack grows and shrinks using a much simpler mechanism.
Values on the stack are stored in a very orderly fashion, as the name suggests: Each goroutine has its own stack consisting of frames. Calling a function pushes a frame onto the stack and reserves space for all its arguments and local variables. Returning from a function pops the corresponding frame from the stack along with all associated values. This means that values are automatically discarded when they are no longer accessible from the scope in which they were created.
In the Fibonacci example, calling the function writes a stack frame containing the argument f, reserving some bytes from the stack. The local variable next
escapes to the heap. The recursive step writes another frame of identical layout but new values, taking up some more space, and so forth, ad infinitum.
The stack is, of course, not infinite. It is bounded by the amount of physical memory and the operating system, but most importantly, it is bounded by the Go runtime that allocates the stack. A fresh goroutine always has an initial stack of fixed size, 2 KiB to be specific. Whenever a new frame no longer fits the allocated stack, a new stack of double the size is allocated and the old stack copied over. This happens until one of two limits is reached: the configurable maximum stack size, or the fixed architecture-dependent maximum stack ceiling. Reaching one of these limits is called stack exhaustion.
Finite deep recursion
By now, it should be obvious that the Fibonacci example does not run for an infinite amount of time – in fact, it crashes rather quickly. This is then clearly one form of deep recursion that leads to stack exhaustion. But recursion does not need to be infinite for it to be deep enough to be harmful.
With a slight modification of the example, limits to recursion can be introduced:
// FibonacciUpToN prints the Fibonacci sequence 1, 1, 2, 3, 5, ... // It stops after n elements func FibonacciUpToN(n int, f ...uint64) { if len(f) >= n { fmt.Println("...") return } var next uint64 = 1 if len(f) > 1 { next = f[len(f)-2] + f[len(f)-1] } fmt.Printf("%d, ", next) FibonacciUpToN(n, append(f, next)...) }
As long as the parameter n
is fixed to a low enough value, this code works just fine. Low enough is, of course, tricky to determine: the value should be less than any system or runtime limits, but the units are different – the runtime measures stack depth in bytes, whereas the code only limits the number of recursive calls – and both system and runtime limits vary by platform.
In practice, exact limits should be established case by case. For example, the callers of our Fibonacci function might not have any reasonable need for values beyond the 50th element in the sequence, so just to be safe, we might fix n
to 75, which is still well below anything that would cause issues. In fact, the 100th element would already overflow int64
, so the printed values would be incorrect anyway by that point. In this case, we’re lucky, since it’s easy to justify a low limit.
Often, the justification for limits is much trickier to find. One option is to leave the decision to the caller. In our Fibonacci example, we might fix the integer overflows and decide that the caller gets to choose how many elements they need, whether it’s 10, 100, or 10,000,000. The value n becomes part of the input, but the recursion is no longer infinite, because an explicit limit is in place. Offloading the decision to the caller does not remove the need for a decision to be made, however, and the caller is likely less equipped to understand the implications of the limit, or even that the function being called is recursive in the first place. The caller might also be reading the input from a file provided by a third party. Is it then the third party’s responsibility? What if the third party is malicious?
Calling FibonacciUpToN
with the parameter value 10,000,000 will crash a Go program. But what about calling it with n
set to 5,000,000 or 2,000,000? Just by reading the code, it’s difficult to tell. What’s clear is that if n
is considered user input, FibonacciUpToN
is still deeply recursive in the sense that it may produce harmful outcomes.
The Fibonacci example is still an easy one to handle, because the caller can establish through trial and error that 2,000,000 is a safe limit for the particular target platform and use case, then validate the input accordingly, and fail gracefully if the limit is exceeded. In the real world, measuring the size of the input itself can be challenging. A recursive parser might need to handle input files of several gigabytes in length but only tolerate recursion in the structure of the input file up to a point. The relevant “size” of the input is hence unknown until the actual moment of parsing each structure. We’ll dive into such examples next.
Real-world Deep Recursion
The example discussed so far has been somewhat artificial: why would anyone want to know the 2,000,000th Fibonacci number anyway? There’s no shortage of realistic examples, however. Often, recursion is used to parse recursive file formats, protocol wire formats, and general data serialization formats, of which we’ve hand-picked a few cases below.
Skipping XML in Go 1.18.3
XML is one prominent example of an inherently recursive data serialization format: XML elements can contain other XML elements, which, in turn, can contain more XML elements:
<Element>
<AnotherElement>
<YetAnother>
<Element/>
</YetAnother>
</AnotherElement>
</Element>
The Go XML decoder encoding/xml.Decode
provides the method Skip, which can be used to discard an arbitrary element during parsing. In low-level usage, it can be called directly, but higher-level constructs like the xml.Unmarshal
function also call it internally.
Prior to Go versions 1.18.4 and 1.17.12, a 20-megabyte input could crash any program that called Skip, including ones that only called it implicitly through Unmarshal. This was because Skip would be called for any unrecognized XML structures, and then recursively for every StartElement token encountered while skipping:
func (d *Decoder) Skip() error { for { tok, err := d.Token() if err != nil { return err } switch tok.(type) { case StartElement: if err := d.Skip(); err != nil { return err } case EndElement: return nil } } }
The recursion was only bounded by the length of the input and the depth of the recursive structures within it, i.e., the number of nested XML elements. The method was not tail-recursive, but was nevertheless easily converted to an iterative form, which fixed the issue:
func (d *Decoder) Skip() error { var depth int64 for { tok, err := d.Token() if err != nil { return err } switch tok.(type) { case StartElement: depth++ case EndElement: if depth == 0 { return nil } depth-- } } }
Instead of implicitly using the stack to track the nesting depth of elements, the corrected code tracks the nesting depth explicitly in a local variable and never recurses – an elegant solution for this particular case.
Recursion limits in google.golang.org/protobuf
A more complex example can be found in the package google.golang.org/protobuf
, where recursion limits were introduced in the version 1.28.0 in response to our reports of crashes on large inputs. Like XML, the unmarshaling of the Protocol Buffer wire format is an inherently recursive process, and the Go unmarshaler is written in a deeply recursive manner. Up to version 1.27.1, a Go program unmarshaling a Protocol Buffer could be crashed by a long stream – about eight megabytes – of the byte 0x0B
. One culprit was this code:
func ConsumeFieldValue(num Number, typ Type, b []byte) (n int) { switch typ { /* ... */ case StartGroupType: n0 := len(b) for { num2, typ2, n := ConsumeTag(b) /* ... */ n = ConsumeFieldValue(num2, typ2, b) if n < 0 { return n // forward error code } b = b[n:] } /* ... */ } }
The value 0x0B
would be read from b
by ConsumeTag
and processed such that typ2
would always be set to StartGroupType
. The function ConsumeFieldValue
would then recurse, reading another byte from b the same way as before, continuing as long as stack was available.
As for the fix, the solution was to fall back to the behavior implemented in other languages. Java and C++ implementations already limited nesting depth to 100, so enforcing a similar limit was relatively safe in Go as well:
func ConsumeFieldValue(num Number, typ Type, b []byte) (n int) { + return consumeFieldValueD(num, typ, b, DefaultRecursionLimit) + } + + func consumeFieldValueD(num Number, typ Type, b []byte, depth int) (n int) { switch typ { /* ... */ case StartGroupType: + if depth < 0 { + return errCodeRecursionDepth + } n0 := len(b) for { num2, typ2, n := ConsumeTag(b) /* ... */ - n = ConsumeFieldValue(num2, typ2, b) + n = consumeFieldValueD(num2, typ2, b, depth-1) if n < 0 { return n // forward error code } b = b[n:] } /* ... */ } }
The real challenge with this case was not managing the depth of the recursion, but the depth at which the library appears in a typical software project’s dependency hierarchy. Unlike the standard library or high-level libraries, Protocol Buffers are not something with which you often need to directly interface. In our case, the library was used as part of another library we relied on to extract text from Apple Pages files, but as propagating the fix through the transitive dependency to the direct dependency and then finally to our product turned out to be time consuming, we chose to protect our customers by disabling the affected feature instead. Limiting the size of the input was not an option: word processing documents commonly exceed the eight megabytes required for the attack, and it was not possible to tell whether a document contained a malicious Protocol Buffer stream without parsing it.
This is a common challenge with deep recursion: because the parsers most prone to crashes are often low-level libraries, and mitigating the crash from the outside by limiting the input is often infeasible, application developers are not well-positioned to fix such issues even when they identify them.
Missing recursion depth check go/parser
Even when recursion depth has been accounted for, consistently checking for limits can be tricky in code where recursive patterns are abundant. One example comes from CVE-2024-34155, a crash in go/parser
, and its fix, the addition of a single missed check to code that was otherwise enforcing limits correctly:
func (p *parser) parseLiteralValue(typ ast.Expr) ast.Expr { + defer decNestLev(incNestLev(p))
Here incNestLev
increases the value of an internal counter by one, and once the function returns, decNestLev
decreases the value correspondingly. If the value ever goes over a pre-determined threshold, the incNestLev
panics. Why is it so easy to miss such an obvious check? Because the parser’s internal call graph looks like this:
Anatomy of a Stack Exhaustion Crash
While Go didn’t always have a maximum stack limit – an explicit threshold that triggers a fatal error – one has existed since version 1.2 released in 2013. Since then, little has changed. The same limits determined to be “implausibly large” in 2013 still remain. The larger maximum stack ceiling was introduced in 2020, but it only has a practical impact on error messages.
The runtime/newstack
function defined in stack.go is responsible for enforcing stack limits. Whenever new stack needs to be allocated, the function first checks whether the new size is within both limits. If either of the limits is exceeded, the check throws.
if newsize > maxstacksize || newsize > maxstackceiling { if maxstacksize < maxstackceiling { print("runtime: goroutine stack exceeds ", maxstacksize, "-byte limit\n") } else { print("runtime: goroutine stack exceeds ", maxstackceiling, "-byte limit\n") } print("runtime: sp=", hex(sp), " stack=[", hex(gp.stack.lo), ", ", hex(gp.stack.hi), "]\n") throw("stack overflow") }
In Go, throw is not a keyword or a language construct like it is in many other languages. Instead, it is an error handling pattern used by some parts of the runtime internals.
Fatal errors
In panic.go
, the Go runtime defines several types of errors, the most familiar of which are likely the standard panic
and its various flavors: division by zero, out of bounds errors, nil pointer dereferences, and so forth. A panic is not what happens on stack exhaustion – instead, a fatal error is thrown. Fatal errors also come in flavors, but the particular flavor here is the one triggered by the function throw
.
One of the main reasons for using fatal errors is that panicking is complicated and may itself grow the stack, which is, of course, not possible when the stack limit has already been reached. On the other hand, the throw
function is a nosplit function, meaning that it will never grow the stack.
The user-facing difference between panics and fatal errors is evident from the word “fatal”: unlike panics, from which fatal errors cannot be recovered:
if dopanic_m(gp, pc, sp) { // crash uses a decent amount of nosplit stack and we're already // low on stack in throw, so crash on the system stack (unlike // fatalpanic). crash() }
Implications
When a Go program panics, the runtime unwinds the stack, executes every defer
statement on the way, and crashes the program only if no recover
call is encountered before reaching the end. When the runtime throws a fatal error, none of this happens. Not only does that mean recovery is impossible, but no other code in defer
statements is executed either. Because deferred code is not run, data loss is highly likely: even well-guarded code might fail to save its state because it simply stops running.
Deep recursion is not only a temporary Denial-of-Service issue but can lead to files not being written to disk and even the system ending up in a corrupted state, which may enable other attacks beyond DoS.
Guarding against stack exhaustion
From the real-world examples above, two general approaches to mitigating deep recursion are easy to identify: eliminating recursion altogether by converting algorithms to iterative forms or explicitly tracking recursion depth and limiting it to a predetermined maximum. Both approaches have their pros and cons as well as different implementation considerations.
The third option is to use architectural design patterns that reduce the impact of crashes.
Preferring iteration
Many recursive algorithms, especially tail-recursive ones such as our original Fibonacci example, are easily converted to an iterative form:
// IterativeFibonacci prints the Fibonacci sequence 1, 1, 2, 3, 5, ... func IterativeFibonacci() { f := []uint64{} var next uint64 = 1 for { if len(f) > 1 { next = f[len(f)-2] + f[len(f)-1] } fmt.Printf("%d, ", next) f = append(f, next) } }
When this is possible without the loss of readability, it is often the best choice, as the iterative approach leaves no room for error – no risk of accidentally missing a check or implementing one incorrectly.
Sometimes achieving the iterative form requires tricks like explicitly constructing a “stack” using slices and storing it on the heap – an approach taken by many parsers like golang.org/x/net/html
, which explicitly tracks several stacks and does its parsing by iteratively invoking a reference to a function instead of recursing.
Such tricks can, however, make the code difficult to read and difficult for static analysis tools to reason about.
Limiting recursion depth
There are several approaches to implementing recursion depth limits. Perhaps the most obvious one involves introducing an additional function parameter to track the depth and fail gracefully when a constant limit is exceeded:
const maxDepth = 10_000 func DoRecursiveThing(a, b string, depth int) error { if depth > maxDepth { return errors.New("max depth exceeded") } /* ... */ return DoRecursiveThing(a, b, depth + 1) }
An alternative to this is to let the caller specify a limit and count downwards. In such a case, the caller should be well-informed of the implications of setting the limit too high:
func DoRecursiveThing(a, b string, maxDepth int) error { if maxDepth < 0 { return errors.New("max depth exceeded") } /* ... */ return DoRecursiveThing(a, b, maxDepth - 1) }
When the caller does not need to be able to control the recursion depth, it is usually good practice to hide the depth parameter from the public API:
const maxDepth = 10_000 func DoRecursiveThing(a, b string) error { return doRecursiveThingWithLimit(a, b, 0) } func doRecursiveThingWithLimit(a, b string, depth int) error { if depth > maxDepth { return errors.New("max depth exceeded") } /* ... */ return doRecursiveThingWithLimit(a, b, depth + 1) }
When dealing with recursive methods and struct receivers, it’s also possible to track the depth on the struct instead:
const maxDepth = 10_000 type Thing struct { depth int } func (t *Thing) DoRecursiveThing(a, b string) error { if t.depth > maxDepth { return errors.New(“max depth exceeded”) } /* … */ t.depth++ return t.DoRecursiveThing(a, b) }With this approach, it’s important to notice that if the recursion consists of anything other than tail calls, depth needs to be tracked in both directions: increased when going into a function and decreased when coming out of one. If there are multiple recursive paths, including the check in each function can become a lot of copy-paste, which is error-prone. The defer
-based approach taken by go/parser
is one solution:
const maxDepth = 10_000 type Thing struct { depth int } func increaseDepth(t *Thing) *Thing { t.depth++ if t.depth > maxDepth { panic("max depth exceeded") } return t }
Regardless of the style chosen, the biggest risk with all of these approaches is forgetting to add the recursion check in some path where it’s required.
Architecting for crash tolerance and recovery
By choosing an appropriate software architecture, it’s also possible to reduce the impact of crashes caused by stack exhaustion. At Mattermost, we implement high-availability clusters where a single process crashing does not take down the entire system – instead, there’s always another process available to take over. With Kubernetes, a crash is also automatically recovered from in a timely manner, meaning that an attacker can’t take the entire cluster down one by one. Things like database writes are guarded with transactions to avoid corruption.
Nevertheless, such protections can only be a defense-in-depth measure and a reduction in impact – even the best possible architecture does not fully protect against crashes. A crash can still be visible to users, for example individual HTTP errors seen by those whose requests were being processed at the time of the crash by the same server.
Fixing the problem at its source
There is no technical reason why the Go runtime has to crash a program on stack exhaustion. A recoverable panic would be equally possible. That does not mean implementing the latter would be straightforward. On the contrary, managing available stack while already low on stack and panicking is obviously non-trivial. Yet making stack exhaustion recoverable would have significant security and reliability benefits to the whole Go ecosystem.
The question then becomes whether it’s worth implementing such a change. On one hand, there is the complexity of managing limited stack availability during panicking. On the other hand, there are numerous security issues already caused by the current behavior. Our answer is a solid maybe, which is why we want to open the topic for wider discussion. With that in mind, we have written a proposal on the Go issue tracker and hope to see comments there.
Appendix: Stack exhaustion bugs in the Go standard library
For a better understanding of the scale of the issue, below is a non-exhaustive sampling of stack exhaustion vulnerabilities fixed in the Go standard library in the past. At the time of writing, the 12 CVEs account for 9% of all CVEs ever published for Go.
ID | Package | Description |
---|---|---|
Issue #10415 | encoding/gob | stack overflow |
Issue #15879 | path/filepath | Glob with UNC path causes stack overflow |
CVE-2022-1962 | go/parser | stack exhaustion in all Parse* functions |
CVE-2022-24675 | encoding/pem | fix stack overflow in Decode |
CVE-2022-24921 | regexp | stack exhaustion compiling deeply nested expressions |
CVE-2022-28131 | encoding/xml | stack exhaustion in Decoder.Skip |
CVE-2022-30630 | io/fs | stack exhaustion in Glob |
CVE-2022-30631 | compress/gzip | stack exhaustion in Reader.Read |
CVE-2022-30632 | path/filepath | stack exhaustion in Glob |
CVE-2022-30633 | encoding/xml | stack exhaustion in Unmarshal |
CVE-2022-30635 | encoding/gob | stack exhaustion in Decoder.Decode |
CVE-2024-34155 | go/parser | stack exhaustion in all Parse* functions |
CVE-2024-34156 | encoding/gob | stack exhaustion in Decoder.Decode |
CVE-2024-34158 | go/build/constraint | stack exhaustion in Parse |
In third-party libraries with similar functionality as the affected packages here, stack exhaustion is, anecdotally, even more common.