madhadron

Where is my database?

In the end, it’s all bits on a disk, but when you connect to a database, where is the thing you’re connecting to and where is it writing to disk?

The question boils down to:

  1. What process are the computations for the database running in?
  2. Where are the files that process is reading and writing?

For familiar programs like text editors or word processors, we open a particular file to edit it and save our changes back to disk, and the computations for doing so are happening in the process we’re directly interacting with (our text editor or word processor).

Files are fairly straightforward when the use case is a human explicitly saving a file that they’re the only one working on. It gets harder quickly, though. What happens if you have parallel writes to different parts of a file, while you are trying to read from others? What happens if the machine loses power while you’re writing? Plus, what format do you write data to the file in?

(There are a couple more I really wanted to link, but the search engines seem to have eaten them.)

The next step is to use a library that hides these difficulties with files and handles how to encode data on disk. We call these embedded databases. The best known is SQLite, but there are many more, ranging from the ancient DBM and BerkeleyDB to modern key-value stores like Kyoto Cabinet and RocksDB. We’ll focus on SQLite for a moment since it demonstrates the important points.

People are accustomed to opening a SQLite database with the SQLite command line tool. That is, they open a shell and type

$ sqlite3 mydatabase.db

and get a prompt where they can run SQL commands. The sqlite3 command is a program that listens for commands from the user and calls the SQLite library. It also provides a number of convenience commands for working interactively with the database such as adjusting the format the program prints rows in, but those are part of sqlite3 the command line utility, not SQLite the library.

When you write your own program and use the SQLite library, the sqlite3 command is not involved at all and those convenience commands don’t exist. We call these databases “embedded” because they get run inside the same process as the rest of our program.

In both cases, the actual data is stored in the file you open. SQLite keeps all its data in one file. This is true for many of the embedded databases, though some, like RocksDB, use a large number of files.

When more than one program wants to access a database, embedded databases become a hindrance because each program will need to open the file and somehow coordinate their access via that file. File systems simply do not provide adequate support for such a task. To work around this, databases like PostgreSQL or SQL Server that are meant for many concurrent users run in their own process. The database’s process listens for connections on a network socket, and each program that wants to use the database opens a connection to the database’s process. Your program sends commands in a format the database understands over that connection and the database sends results back. You usually run one copy of these databases on a single machine, and it stores its files in a single location that nothing else uses.

So, if you’re using PostgreSQL on the same machine as your program, you would usually use a library in your program that understands how to send commands to PostgreSQL in a format it understands. Using that, your program opens a connection to a socket and sends a command. PostgreSQL’s process accepts and runs the command and makes changes to the files where it stores its data.