Curated learning notes on computer science topics (2/4).
- Data independence: applications are insulated from how data is logically structured and physically stored.
- Impedance mismatch: the disconnect between the object model in the application code and the relational model in the database system.
- Driving forces behind NoSQL:
- Need for greater scalability.
- Widespread preference for free and open source software.
- Specialized query operations.
- More dynamic and expressive data models. - Compared with the relational model:
- The document model is better for one-to-many relationships with more flexible schemas and performance gains due to storage locality.
- The graph model is targeted at many-to-many relationships. - Motivation of normalization: human-readable information is prone to changes and duplicating them would incur write overheads and risk inconsistencies.
- Motivation of denormalization: creating redundant data to improve read performance at the expense of write overheads.
- Normal forms of the relational model:
- Second normal form (2NF): 1NF with no partial dependency.
- Third normal form (3NF): 2NF with no transitive dependency.
- Boyce-Codd normal form (BCNF): a non-trivial functional dependency must be a key constraint of the relation.
- For any relation, a lossless-join dependency-preserving decomposition into 3NF is always possible.
- Decomposition into BCNF is not always dependency-preserving. - Schema-on-read vs schema-on-write:
- Schema-on-read burdens consumers with data interpretation, while schema-on-write forces producers to put data in a standardized format.
- Schema-on-read is similar to runtime type checking, whereas schema-on-write is similar to compile-time type checking. - Advantages of declarative query languages (vs imperative query languages):
- Conciseness and ease to work with.
- Potential for performance optimizations and parallel execution. - Independent operators for set algebra:
- renaming (ρ)
- projection (π)
- selection (σ)
- product (×)
- union (∪)
- difference (-) - Some intermediate SQL features:
- aggregates with HAVING clauses
- nested queries
- common table expressions (CTE)
- window functions (ex: LC 512) - In SQL, a subquery is correlated if it references attributes from surrounding subqueries.
- Indexes speed up read queries but slow down writes.
- Two main schools of thought for OLTP systems:
- Page-oriented storage engines (e.g. B-trees) write blocks in place to achieve consistent performance (no merging or compaction) and strong support for transactional semantics.
- Log-structured storage engines (e.g. LSM-trees) leverage sequential writes to contain write amplification and disk fragmentation. - Advantages of SSTables over hash indexes:
- No need to keep an index of all the keys in memory.
- Simple and efficient segment merge.
- Potential for compression of consecutive records. - B+ trees only store keys and values in leaf nodes, whereas B-trees store them in all nodes.
- Latch crabbing/coupling on B-trees:
- For search, unlatch the parent node on latching the child node.
- For mutation, unlatch an internal node when it’s guaranteed to not be split or merged.
- To reduce contention on the root, optimistically take the read-path latching approach until reaching the leaf and restart with write-path latching only if it does require splits or merges. - As a rule of thumb, LSM-trees are typically faster for blind writes, whereas B-trees are thought to be faster for reads.
- Hash-based indexes are good for equality selections, whereas B+ trees are good for range selections and sequential scans.
- Prefix key compression increases the fanout of a B+ tree and reduces the height.
- ISAM is a static tree structure that is typically used when inserts and deletes are rare.
- Cuckoo hashing uses multiple hash tables with different hash functions.
- Extendible hashing and linear hashing enable resizing the hash table on demand without rebuilding the entire table.
- Extensible hashing splits a bucket when it’s full and doubles the slot array when the global depth increases.
- Linear hashing splits the bucket at the split pointer and incrementally expands the slot array. - R-trees are balanced in the face of changing data, whereas kd-trees are typically bulk loaded.
- Common memory-efficient probabilistic data structures:
- A Bloom filter approximates the set membership with zero false negative.
- A count-min sketch approximates occurrences of elements.
- A HyperLogLog approximates the number of distinct elements. - A Levenshtein automaton supports efficient search for words within a given edit distance.
- The performance advantage of in-memory databases is usually because they avoid the overheads of serialization and deserialization, given the fact that operating systems actively cache recently used disk blocks.
- Transaction processing vs analytics:
- The bottleneck of OLTP systems is usually disk seek time, whereas the bottleneck of OLAP systems is usually disk bandwidth.
- OLTP databases are usually row-oriented, whereas OLAP databases are usually column-oriented. - Storage models:
- N-ary storage model (row stores, typically for OLTP)
- Decomposition storage mode (column stores, typically for OLAP)
- Partition attribute across (PAX, column-oriented within a row group) - In the n-ary storage model:
- A file can either be a data file or an index file.
- A file consists of fixed-sized pages, which are either data pages or directory pages.
- A page is a collection of records, either of a fixed length or of variable lengths. - Column-oriented storage enables efficient compression techniques, such as bitmap encoding (for small value spaces) and run-length encoding (for consecutive duplicate values).
- Dictionary compression enables storing fixed-length values for variable-length data in column stores.
- Delta store: stage updates in a row store and asynchronously migrate them to a column store in batches.
- Pointer swizzling translates disk addresses into memory addresses.
- Atomicity, isolation and durability are properties of the database, whereas consistency (in ACID) is a property of the application.
- Atomicity has nothing to do with concurrency, but means “all or nothing”. - Anomalies of weak isolation levels:
- Dirty read: a transaction reads uncommitted data of another transaction.
- Dirty write: a transaction overwrites uncommitted data of another transaction.
- Read skew (unrepeatable read): a transaction reads different objects at different points in time, which can be solved by MVCC.
- Lost update: two transactions concurrently perform a read-modify-write cycle and one overwrites another’s write without incorporating its changes.
- Write skew: a transaction does some write based on some read and by the time it commits the data it read has changed. Snapshot isolation is susceptible to this anomaly.
- Phantom read: the result of a search query in one transaction is changed by writes in another transaction. Predicate locking is the theoretical antidote to this. - Multi-version concurrency control (MVCC) for snapshot isolation:
- Readers never block writers and writers never block readers.
- Keep multiple versions of an object written by different transactions.
- List all active transactions at the start of each transaction to determine which versions are visible.
- Used for backups and analytic queries. - Approaches for serializable isolation:
- Two-phase locking (2PL) + predicate locks (practically index-range locks).
- Serializable snapshot isolation (SSI). - Two-phase locking (2PL):
- In the growing phase, only acquiring or upgrading locks is permitted.
- In the shrinking phase, only releasing locks is permitted.
- The operations on all the objects could be considered to happen at a moment between the two phases.
- Cascading aborts is required for 2PL, otherwise dirty reads would occur if a transaction aborts after some of its exclusive locks are released.
- In comparison, strict 2PL does not allow a transaction to release a lock until commit or abort.
- To avoid starvation, shared locks shouldn’t be granted if there’s someone waiting for the exclusive lock. - Advantages of a buffer manager (vs OS memory management):
- ability to pin a page in the buffer pool or force it back to the disk;
- ability to adjust the replacement policy;
- ability to pre-fetch pages based on access patterns. - Buffer management policies:
- NO STEAL policy: a dirty page should not be evicted from the buffer pool if pinned by an uncommitted transaction.
- FORCE policy: a transaction should write all dirty pages back to the disk on commit.
- NO STEAL policy causes poor throughput but ensures atomicity with simple rollback.
- FORCE policy causes poor response time but provides durability with simple recovery.
- Write-ahead logging (WAL) guarantees atomicity and durability when STEAL and NO FORCE are allowed.
- A WAL record contains (1) the log sequence number (LSN), (2) the transaction ID, (3) the object ID, (5) the before value, and (5) the after value.
- WAL records could be flushed to disk in batches, as long as all the records for the updates on a dirty page are flushed to disk before the page itself is flushed to disk.
- An application gets notified of a commit only after the COMMIT record gets flushed to disk, but may get notified of an abort even if log records have not yet flushed. - ARIES crash recovery algorithm:
- Analysis phase: identify all the active transactions and dirty pages (at the time of the crash) by scanning forward from the most recent checkpoint.
- Redo phase: scan forward to redo all updates on the dirty pages in the buffer pool.
- Undo phase: scan backward to undo all updates of uncommitted transactions in the reverse order. - System catalogs are a collection of tables maintaining metadata and statistics of other tables for estimating costs and optimizing query execution.
- Steps of query compilation:
- The SQL query is parsed into an abstract syntax tree (AST).
- The AST is algebraized into a logical query plan based on system catalogs.
- The logical query plan may be optimized based on both system catalogs and cost models.
- The logical query plan is transformed into a physical query plan. - Query processing models:
- Iterator model (top-down)
- Materialization model (bottom-up, suitable for OLTP)
- Vectorization model (hybrid, typically for OLAP)
- Index nested loop join: if there is an index on the join column of one relation, make that table the inner query and exploit the index.
- Hash join: build a hash table for the smaller table, and then run the inner loop on the larger table.
- If the hash table doesn’t fit in the memory, use the same hash function to partition the two tables into buckets first.
- Use a Bloom filter to speed up the probe. - Sort-merge join: sort two relations on the join column and scan them to do a merge.
- Index-only plans: retrieving data records from the disk file can be avoided for some queries given a suitable index.
- Clustered index: the indexed row is stored directly within the index.
- Zone maps: pre-compute columnar aggregates for each block to enable skipping a block on a query.
- Column sketches: build a bitmap index by mapping value ranges to codings in an order-preserving manner, taking account of frequency distributions.
- To keep data systems in sync:
- Change data capture (CDC) is done at the database level by extracting writes in the system-of-record database and replicating them in derived data systems (in contrast with dual writes, which may have inconsistent event ordering).
- Event sourcing is done at the application level by keeping an immutable log of change events (for write throughput and auditability).
- CQRS (command query responsibility segregation) further transforms write-optimized event logs into read-optimized views.
- The derived dataset is the place where the read and write paths meet, representing a tradeoff between the two types of workloads.¹ - Lambda architecture:
- A stream processor consumes events and updates an approximate view.
- A batch processor reprocesses the events and creates a more accurate view. - Complex event processing (CEP) reverses the relationship between queries and data, such that events continuously flow past standing queries in search of queries that match them.