Once upon a time, I was going to take what I learned writing the custom syntax-highlighting editing component of Grail Diary and turn it into a stand-alone tutorial on text editing. I ran out of time to work on that after writing one item: An overview of how you can take a custom data structure for text editing (a piece table) and give it a natural API by conforming to Swift Collection
. I don’t want this material to die, so I’ve moved it over here.
The Theory
What’s so hard about editing text? Let’s ignore for the moment the problems with even storing Unicode text, with its encodings, multi-byte characters, etc. If you put those considerations aside, the abstract model for a text file is an array of characters. An array is about as simple a data structure as you can get. What’s the problem?
The answer, of course, is that changing things in an array can be expensive. Appending to or removing from the end of an array is cheap. Any other operation, though, means copying elements to make room for the new element (or to remove existing elements). And of course in a text editor, you want to make changes all throughout the text, not just at the end. That’s kind of the point. If your editor’s main data structure for text is “an array of characters”, it’s doing a ton of memory copying on every keystroke whenever the cursor is anywhere but the very end of the file.
So we need something better. But what?
One option is to store the file as a linked list of lines, and each line is an array of characters. You still need to do copying as you insert and remove characters, but you’re now only copying characters on the same line instead of all characters to the end of the file. If you’re implementing a source code editor, where you can assume that lines are all of a reasonable maximum length, you can get far with this approach.
Next up in sophistication is a data structure known as a gap buffer. The main idea behind a gap buffer is that edits to a text file aren’t randomly distributed throughout the text file — they exhibit a lot of locality. If you insert a character at offset 42 in the file, the next insertion is much more likely to be at offset 43 than any other location, and the next deletion is likely to be at offset 42 than any other location. Basically, where the cursor is is where edits are likely to be. A gap buffer makes edits at the cursor really cheap, but you pay a cost to move the cursor.
A gap buffer does this by storing the text in an array that’s much larger than what’s needed to store the text. This gives you a lot of free space inside the array (the “gap”), and the key insight is you can pay a cost to move the gap to the location of the cursor to make insertions and deletions at the cursor really cheap.
While you can implement world-class editors with a gap buffer, for Scrap Paper we’re going to use a third approach, called a Piece Table. Remember how we said that appending to the end of an array is cheap? The piece table exploits that by keeping two arrays. One read-only array contains the original file contents. The second append-only array contains all characters inserted at any time during the editing session. Finally, the piece table tells you how to build the file as a sequence of “pieces” from the different arrays.
Just as with the gap buffer, a piece table works efficiently because most edits to a text file are localized. When you insert character after character into the same spot, you’ll end up with a pretty compact representation of the “pieces” constructed from the two arrays. For example, I edited this file in a version of Scrap Paper that recorded all of the changes that I made to the text file (backspaces and all). At the end of my editing session of 2276 individual edits, I had 48 pieces representing the contents of the file.
One more bit of theory: String
, NSString
, and unicode
I glossed over the challenges of representing text earlier. It’s now time to pay a little attention to that.
- The Swift
String
struct and the Objective-C NSString
class made different engineering choices about how to store and model strings. Swift models its strings as an array of “characters” and encodes those characters in UTF-8. The NSString
class, in contrast, does not expose individual Unicode characters, and it uses UTF-16 encoding internally.
- The TextKit classes are from the NSString era.
- Since we will be interfacing a lot with TextKit, we’re going to use the NSString convention and model our text as an array of UTF-16 code points.
Let’s build a Piece Table!
With the theory out of the way, it’s time to do some building.
public struct PieceTable {
private let originalContents: [unichar]
private var addedContents: [unichar]
private enum PieceSource {
case original
case added
}
private struct Piece {
let source: PieceSource
var startIndex: Int
var endIndex: Int
}
private var pieces: [Piece]
public init(_ string: String) {
self.originalContents = Array(string.utf16)
self.addedContents = []
self.pieces = [Piece(source: .original, startIndex: 0, endIndex: originalContents.count)]
}
}
This code defines the stored properties we need for a piece table:
originalContents
is the read-only copy of the characters from the file we are trying to edit.
addedContents
is an append-only array of all characters added during an edit session.
pieces
describes the logical contents of the file as a series of contiguous characters from either originalContents
or addedContents
.
Conforming to Collection
To make PieceTable
feel Swift-y, we’re going to make it conform to a few standard protocols. First: Collection
— this will let users read characters from a piece table as easily reading characters from an array. In Swift, a Collection
is a data structure that contains elements that can be accessed by an index. If you’ve used arrays in Swift, you’ve used a collection.
The Collection
protocol is big. While it contains over 30 methods, most of those have default implementations. To create a custom Collection
, this is all you need to implement:
protocol Collection {
associatedtype Element
associatedtype Index: Comparable
var startIndex: { get }
var endIndex: { get }
func index(after position: Index) -> Index
subscript(position: Index) -> Element
}
If your only exposure to Collection
has been through arrays, you may have assumed that the index needs to be an integer. Not so! The Collection
protocol gives implementations a ton of flexibility about the index type. You can use any type so long as:
- You can efficiently return the index of the first element of the collection
- You can efficiently return the index that means “you’ve moved past the last element of the collection”
- Given an index, you can efficiently return the next index in the collection.
For our piece table, we are going to need a custom index type. To find a character in the piece table, we will use two values: The index of the piece in the pieces
table, and then the index of the character within the contents array. With this information, we can easily figure out the character at an index (use the piece index find the correct contents array, then return the character at the correct character index). It’s a tiny bit more complicated to figure out the index that comes after the next index, because you have to consider two cases: If the current index represents a character at the end of a piece, you have to move to the next piece; otherwise you move to the next character in the current piece.
With this overview, here is the minimal code to have a piece table conform to Collection
:
extension PieceTable: Collection {
public struct Index: Comparable {
let pieceIndex: Int
let contentIndex: Int
public static func < (lhs: PieceTable.Index, rhs: PieceTable.Index) -> Bool {
if lhs.pieceIndex != rhs.pieceIndex {
return lhs.pieceIndex < rhs.pieceIndex
}
return lhs.contentIndex < rhs.contentIndex
}
}
public var startIndex: Index { Index(pieceIndex: 0, contentIndex: pieces.first?.startIndex ?? 0) }
public var endIndex: Index { Index(pieceIndex: pieces.endIndex, contentIndex: 0) }
public func index(after i: Index) -> Index {
let piece = pieces[i.pieceIndex]
if i.contentIndex + 1 < piece.endIndex {
return Index(pieceIndex: i.pieceIndex, contentIndex: i.contentIndex + 1)
}
let nextPieceIndex = i.pieceIndex + 1
if nextPieceIndex < pieces.endIndex {
return Index(pieceIndex: nextPieceIndex, contentIndex: pieces[nextPieceIndex].startIndex)
} else {
return Index(pieceIndex: nextPieceIndex, contentIndex: 0)
}
}
private func sourceArray(for source: PieceSource) -> [unichar] {
switch source {
case .original:
return originalContents
case .added:
return addedContents
}
}
public subscript(position: Index) -> unichar {
let sourceArray = self.sourceArray(for: pieces[position.pieceIndex].source)
return sourceArray[position.contentIndex]
}
}
Conforming to RangeReplaceableCollection
We can now iterate through the contents of a PieceTable
. However, we don’t have a way to modify the contents of the PieceTable
. To add this capability, we are going to make PieceTable
conform to RangeReplaceableCollection
. This protocol has a single required method, replaceSubrange(_:with:)
. If you implement this method, you get a ton of other APIs for free.
For our implementation of replaceSubrange
, we have to do two high-level jobs:
- Append the new characters to the end of
addedContents
. Remember, in a piece table, we only ever add characters — never delete — and they always get added to the end of the array. This is the easy part.
- The hard part: Update
pieces
to reflect the new contents of the file. The performance of the piece table will depend on how many entries are in pieces
, so we need to take care to avoid creating unneeded items.
This implementation manages the complexity of updating the pieces
array by creating a stand-alone change description that contains the new piece table entries. When constructing the change description, the implementation adheres to two rules to minimize the size of the pieces
array:
- No empty pieces! If an edit creates a
Piece
with no characters, it’s removed.
- If it is possible to coalesce two neighboring pieces into one, do it.
Here is the code that adds conformance to RangeReplaceableCollection
:
extension PieceTable: RangeReplaceableCollection {
private struct ChangeDescription {
private(set) var values: [Piece] = []
var lowerBound: Int?
var upperBound: Int?
mutating func appendPiece(_ piece: Piece) {
guard !piece.isEmpty else { return }
if let last = values.last, last.source == piece.source, last.endIndex == piece.startIndex {
values[values.count - 1].endIndex = piece.endIndex
} else {
values.append(piece)
}
}
}
private func safelyAddToDescription(
_ description: inout ChangeDescription,
modifyPieceAt index: Int,
modificationBlock: (inout Piece) -> Void
) {
guard pieces.indices.contains(index) else { return }
var piece = pieces[index]
modificationBlock(&piece)
description.lowerBound = description.lowerBound.map { Swift.min($0, index) } ?? index
description.upperBound = description.upperBound.map { Swift.max($0, index) } ?? index
description.appendPiece(piece)
}
mutating private func applyChangeDescription(_ changeDescription: ChangeDescription) {
let range: Range<Int>
if let minIndex = changeDescription.lowerBound, let maxIndex = changeDescription.upperBound {
range = minIndex ..< maxIndex + 1
} else {
range = pieces.endIndex ..< pieces.endIndex
}
pieces.replaceSubrange(range, with: changeDescription.values)
}
public mutating func replaceSubrange<C, R>(
_ subrange: R,
with newElements: C
) where C: Collection, R: RangeExpression, unichar == C.Element, Index == R.Bound {
let range = subrange.relative(to: self)
var changeDescription = ChangeDescription()
safelyAddToDescription(&changeDescription, modifyPieceAt: range.lowerBound.pieceIndex - 1) { _ in
}
safelyAddToDescription(&changeDescription, modifyPieceAt: range.lowerBound.pieceIndex) { piece in
piece.endIndex = range.lowerBound.contentIndex
}
if !newElements.isEmpty {
let index = addedContents.endIndex
addedContents.append(contentsOf: newElements)
let addedPiece = Piece(source: .added, startIndex: index, endIndex: addedContents.endIndex)
changeDescription.appendPiece(addedPiece)
}
safelyAddToDescription(&changeDescription, modifyPieceAt: range.upperBound.pieceIndex) { piece in
piece.startIndex = range.upperBound.contentIndex
}
applyChangeDescription(changeDescription)
}
}
Does it make a difference?
For large file sizes, yes!
I gathered a trace of all of the edits I made to a text buffer for a couple of minutes of editing a file. I then replayed that trace on an NSTextStorage
object and on a PieceTable
, timing how long it took to perform all of the edits on files of different sizes. This is the result:
For the NSTextStorage
, the time to perform the edits increases linearly with the file size. For PieceTable
, however, the time to perform the edits is independent of file size; PieceTable
operations will get slower as the complexity of the edit history increases.