Sourcebot

In Zoekt, shards are managed in memory by the shardedSearcher struct, which holds a collection of rankedShard pointers. Each rankedShard represents a single shard and contains a zoekt.Searcher instance, which is responsible for the actual search operations on that shard's data.

Here's a breakdown of how shards are laid out in memory:

Shard Management: `shardedSearcher`

The shardedSearcher struct, defined in shards/shards.go shards.go:201-213, is the primary component for managing shards in memory. It contains:

A map of rankedShard pointers: shards map[string]*rankedShard shards.go:209-209. The key of the map is the file path of the shard.
A sorted slice of rankedShard pointers: ranked atomic.Value shards.go:212-212. This slice is sorted by priority to optimize search performance.
A scheduler to limit the number of parallel queries.

Shards are loaded from disk via the loadShard function shards.go:1200-1217, which in turn calls zoekt.NewSearcher to create a Searcher for the shard's index file.

In-Memory Shard Representation: `indexData`

The zoekt.Searcher interface is implemented by the indexData struct, defined in read.go. The NewSearcher function read.go:564-577 reads an IndexFile and populates an indexData struct with the shard's data.

The indexData struct holds all the data for a single shard in memory. This includes:

File Contents: The raw content of all files in the shard is stored in a byte slice. The boundaries field read.go:279-279 stores the start and end offsets for each file's content.
N-grams and Postings Lists: Zoekt uses an n-gram index for fast substring searches.
- contentNgrams: A btreeIndex for the n-grams found in file contents read.go:314-314.
- fileNameNgrams: A btreeIndex for n-grams in file names read.go:331-331.
- The btreeIndex itself doesn't store the postings lists directly in memory but reads them from the IndexFile on demand.
File and Branch Metadata:
- fileNameContent and fileNameIndex: The names of all files in the shard read.go:324-329.
- fileBranchMasks: Bitmasks that associate files with their respective branches read.go:319-319.
- repoMetaData: A slice of Repository structs containing metadata for each repository in the shard read.go:265-265.
Symbols and Sections:
- symbols: Holds data related to symbol definitions and their kinds read.go:285-285.
- docSectionsIndex: An index for document sections, used for structured data read.go:283-283.

The IndexFile that indexData reads from is often a memory-mapped file, which allows for efficient access to the shard's data without loading the entire file into RAM at once. The readSectionBlob function read.go:198-200 is used to read chunks of data from the IndexFile as needed.

Shard Management: shardedSearcher

In-Memory Shard Representation: indexData

Shard Management: `shardedSearcher`

In-Memory Shard Representation: `indexData`