Evolution of Data Analytics (1990-2025)
Evolution of Data Analytics (1990-2025)
November 2025
Abstract
This tutorial provides a rigorous, example-driven exposition of the scientific and engineering evolu-
tion of data analytics from 1990 to 2025. Through a unified e-commerce scenario that evolves across four
decades, we present: (1) formal problem statements grounded in foundational research; (2) algorithmic
solutions with complexity analysis; (3) theorem statements with detailed proofs; (4) concrete illustrative
examples for each algorithm and definition; (5) empirical evaluation protocols; and (6) comprehensive
citations to seminal works. This paper serves both academic researchers seeking theoretical foundations
and industrial practitioners implementing production systems.
Contents
1 Introduction and Motivating Scenario 3
1.1 The E-Commerce Platform: A 35-Year Journey . . . . . . . . . . . . . . . . . . . . . . . . 3
1
4 2020–2025: Lakehouse, MLOps, and AI-Driven Analytics 14
4.1 The Lakehouse Paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.2 Delta Lake: Transactional Storage Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.3 MLOps: Systematic ML Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.4 Reproducibility Guarantees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.5 AI-Driven Analytics: LLMs and Agentic Systems . . . . . . . . . . . . . . . . . . . . . . . 17
4.6 Governance and Compliance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.7 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
9 Conclusion 23
2
1 Introduction and Motivating Scenario
1.1 The E-Commerce Platform: A 35-Year Journey
We introduce a canonical e-commerce platform, GlobalMart, operating continuously from 1990 to 2025.
This running example illustrates how each technological era addressed specific data analytics challenges.
• 10,000 customers
Evolution Timeline.
3
2.2 Formal Problem Statement
Definition 2.1 (Data Warehouse Query Optimization Problem). Given:
• A star schema S = (F, D1 , . . . , Dk ) with fact table F and dimension tables Di
• Storage budget B ∈ R+
subject to:
X
size(V ) ≤ B (2)
V ∈V
staleness(V ) ≤ ∆tmax ∀V ∈ V (3)
1. Dimensional Modeling [2, 1]. Ralph Kimball introduced the star schema design methodology, empha-
sizing business-process-centric dimensional models. Bill Inmon advocated for enterprise-wide normalized
data warehouses with a top-down approach.
2. OLAP and Multidimensional Analysis [3]. Gray et al. formalized the data cube operator, enabling
efficient roll-up, drill-down, slice, and dice operations.
3. View Materialization Theory [4, 5]. Gupta and Mumick provided algorithmic frameworks for se-
lecting which views to materialize. Halevy et al. developed the theory of answering queries using views,
establishing soundness and completeness conditions.
Dimension tables:
4
SELECT [Link], [Link], [Link],
SUM([Link]) AS total_revenue
FROM FactSales f
JOIN DimDate d ON f.date_key = d.date_key
JOIN DimCustomer c ON f.customer_key = c.customer_key
JOIN DimProduct p ON f.product_key = p.product_key
WHERE [Link] = 1995
GROUP BY [Link], [Link], [Link];
Without materialization, this query scans 500K fact rows and performs 3 joins. Query time: 12.3 seconds
(measured on a 1995 Sun SPARCstation).
Definition 2.3 (Cost-Per-Byte Heuristic). The greedy selection criterion [4] prioritizes views with maxi-
mum:
benefit(V, W)
score(V ) = (5)
size(V )
Algorithm 1 GreedyMaterializedViewSelection
Require: Workload W = {Q1 , . . . , Qn } with frequencies f (Qi )
Require: Candidate views C = {V1 , . . . , Vp }
Require: Storage budget B
Ensure: Selected views V
1: V ← ∅
2: used storage ← 0
3: while C ̸= ∅ and used storage < B do
4: V ∗ ← arg maxV ∈C benefit(V,W)
size(V )
5: if used storage + size(V ∗ ) > B then
6: break
7: end if
8: V ← V ∪ {V ∗ }
9: C ← C \ {V ∗ }
10: used storage ← used storage + size(V ∗ )
11: Update benefit(V, W) for all V ∈ C ▷ Account for V ∗
12: end while
13: return V
5
Theorem 2.1 (Greedy Approximation Bound). Let V ∗ be the optimal view set and Vg be the greedy solution.
Under submodular benefit functions, the greedy algorithm achieves:
1
benefit(Vg , W) ≥ 1 − · benefit(V ∗ , W) (6)
e
Sketch. The benefit function satisfies submodularity: adding a view to a smaller set yields at least as much
marginal benefit as adding it to a larger set. By the classical result of Nemhauser et al. [18], greedy max-
imization of submodular functions under cardinality constraints achieves (1 − 1/e)-approximation. The
storage constraint can be converted to cardinality by discretizing view sizes.
Example 2.2 (Applying Greedy Selection at GlobalMart). Candidate views for the 1995 workload:
Definition 2.5 (Chase Algorithm [8]). Given source instance I and TGDs Σ, the chase constructs a universal
solution J by iteratively applying TGDs until a fixed point.
Algorithm 2 StandardChase
Require: Source instance I
Require: TGD set Σ
Ensure: Target instance J
1: J ← I ▷ Initialize with source data
2: repeat
3: for each TGD σ : ϕ(x̄) → ∃ȳ.ψ(x̄, ȳ) in Σ do
4: for each homomorphism h : ϕ → J do
5: if no extension of h satisfies ψ then
6: Add new tuples to J to satisfy ψ with fresh nulls for ȳ
7: end if
8: end for
9: end for
10: until no new tuples added
11: return J
6
Example 2.3 (Chase for GlobalMart Integration). TGD expressing ”each order must reference existing
customer”:
Source XML contains order (o1, c100, p50, ...) but customer c100 is missing.
Chase adds: Customers(c100, ⊥1 , ⊥2 , ...) where ⊥i are labeled nulls.
This ensures referential integrity while marking unknown values.
Example 2.4 (GlobalMart Product Ontology). Representing product p123 in RDF (Turtle syntax):
@prefix gm: <[Link] .
@prefix xsd: <[Link] .
gm:product_p123 a gm:Product ;
gm:hasName "Wireless Mouse"ˆˆxsd:string ;
gm:hasPrice "29.99"ˆˆxsd:decimal ;
gm:inCategory gm:Electronics ;
gm:hasBrand gm:TechCorp .
Definition 2.7 (SPARQL Query). SPARQL (SPARQL Protocol and RDF Query Language) provides pattern-
matching over RDF graphs:
SELECT x̄ WHERE {graph pattern} (9)
Example 2.5 (SPARQL Query at GlobalMart). Find all electronics products with price under $50:
PREFIX gm: <[Link]
7
2.8 Evaluation and Industrial Impact
Experimental Validation.
• Schema Matching: Evaluated on real e-commerce schemas (Amazon, eBay, Alibaba product cata-
logs)
• Results: Hybrid approaches (combining string + structure + semantics) achieved F1 ≈ 0.82–0.89 [7]
Example 2.6 (2005 GlobalMart Integration Project). Integration of 15 supplier XML feeds:
Traditional data warehouses could not handle this volume, velocity, and variety—motivating the Big
Data revolution.
8
Algorithm 3 MapReduce Execution
Require: Input data partitioned into splits {S1 , . . . , Sn }
Require: User-defined map and reduce functions
Ensure: Output results
1: Map Phase:
2: for each split Si in parallel do
3: for each record (k1 , v1 ) in Si do
4: Emit map(k1 , v1 ) producing intermediate (k2 , v2 ) pairs
5: end for
6: end for
7: Shuffle Phase:
8: Group all intermediate pairs by key k2 : (k2 , [v2,1 , v2,2 , . . .])
9: Reduce Phase:
10: for each key k2 in parallel do
11: Emit reduce(k2 , [v2,1 , . . .]) producing (k2 , v3 )
12: end for
13: Write results to distributed file system
Example 3.1 (Daily Revenue Calculation at GlobalMart). Problem: Compute total revenue per day from
10TB order history.
Map Function:
def map(order_id, order_record):
date = order_record[’date’]
revenue = order_record[’quantity’] * order_record[’price’]
emit(date, revenue)
Reduce Function:
def reduce(date, revenue_list):
total = sum(revenue_list)
emit(date, total)
Execution:
Theorem 3.1 (MapReduce Fault Tolerance). If a worker node fails during map or reduce phase, the MapRe-
duce framework automatically re-executes the failed tasks on healthy nodes, ensuring correctness under the
assumption of deterministic map and reduce functions.
9
• Lineage: Chain of transformations from stable storage
Example 3.2 (Spark RDD for Active User Analysis). Problem: Count active users per day from clickstream
logs (500GB/day).
# Load clickstream data
logs = [Link]("hdfs://clickstream/2015/*/*")
Performance:
10
3.4 Stream Processing and Event-Time Semantics
Definition 3.4 (Data Stream). A stream S is an unbounded sequence of events:
Algorithm 4 EventTimeWindowedAggregation
Require: Stream S, window size w, aggregate function f
Ensure: Windowed results
1: Initialize state: windows ← {}, watermark ← 0
2: for each event (e, t) from S do
3: win id ← ⌊t/w⌋
4: Update windows[win id] with f (e)
5: watermark ← UpdateWatermark(t)
6: for each window wi where end(wi ) < watermark do
7: if wi not yet emitted then
8: Emit f (windows[wi ])
9: Mark wi as complete
10: end if
11: end for
12: end for
Example 3.3 (Real-Time Active Users at GlobalMart). Scenario: Compute active users per 5-minute win-
dow from clickstream.
Events arrive with network delays up to 2 minutes.
Configuration:
11
• Window size: w = 300 seconds (5 minutes)
Windowing:
Without event-time semantics, the late event ([Link]) would be assigned to wrong window.
• Retention: 7 days
Throughput:
12
• Sustained throughput: 1 GB/sec
Consumer Groups:
• Real-time dashboard (6 consumers, one per 4 partitions)
• Fraud detection ML (12 consumers, one per 2 partitions)
• Data lake ingestion (4 consumers, one per 6 partitions)
Checkpointing truncates lineage: trade-off between checkpoint frequency and recovery cost.
13
4 2020–2025: Lakehouse, MLOps, and AI-Driven Analytics
4.1 The Lakehouse Paradigm
Motivation. By 2020, organizations maintained dual systems:
14
Algorithm 5 DeltaLakeRead
Require: Table path, optional version v
Ensure: Dataset snapshot
1: if version v specified then
2: Read transaction log up to version v
3: else
4: Read latest transaction log
5: end if
6: active files ← {}
7: for each log entry logi in order do
8: for each file in logi .add do
9: active files ← active files ∪ {file}
10: end for
11: for each file in logi .remove do
12: active files ← active files \ {file}
13: end for
14: end for
15: return Dataset from active files
Example 4.1 (Time Travel at GlobalMart). Scenario: Churn prediction model training requires consistent
customer snapshot.
Problem: Without versioning, customer table changes daily (updates, deletes). Training on Monday vs
Tuesday produces different features, breaking reproducibility.
Solution: Delta Lake time travel.
-- Create snapshot reference
CREATE TABLE customers_training_snapshot AS
SELECT * FROM customers VERSION AS OF 205;
Guarantee: Query returns identical results regardless of when executed (version 205 is immutable).
Theorem 4.1 (Snapshot Isolation in Delta Lake). Under optimistic concurrency control with transaction
log, Delta Lake provides snapshot isolation: each transaction reads a consistent snapshot determined by
log version, and conflicts are detected via version checks before commit.
15
• Glue code: Generic packages require extensive wrapping
Example 4.2 (Feature Store at GlobalMart). Challenge: ”Average customer spend last 30 days” computed
differently in:
Offline (Training):
features_df = feature_store.get_historical_features(
entity_df=customers_snapshot,
features=["customer_avg_spend_30d"],
timestamp_column="snapshot_date"
)
Online (Serving):
feature_value = feature_store.get_online_features(
features=["customer_avg_spend_30d"],
entity_keys={"customer_id": "c12345"}
)
Backend: Offline features stored in Delta Lake; online features cached in Redis with TTL-based refresh.
16
3. Fixed random seeds S = {s1 , . . . , sk }
Algorithm 6 ReproducibleMLTraining
Require: Experiment configuration config
Ensure: Model artifact with metadata
1: D ← LoadDataSnapshot([Link] version) ▷ Delta Lake version
2: SetRandomSeeds([Link]) ▷ NumPy, TensorFlow, PyTorch
3: features ← ExtractFeatures(D, [Link] spec)
4: model ← TrainModel(features, [Link])
5: metrics ← Evaluate(model, validation set)
Example 4.3 (Reproducibility Validation at GlobalMart). Experiment: Train churn prediction XGBoost
model.
Setup:
config = {
’data_version’: 205, # Delta Lake snapshot
’code_commit’: ’a7f3d29’,
’seeds’: {’numpy’: 42, ’xgboost’: 123},
’xgboost_version’: ’1.7.3’,
’features’: [’avg_spend_30d’, ’order_frequency’, ...]
}
Bit-identical model artifacts confirmed. Training time difference: 0.3% (CPU scheduling variance).
Example 4.4 (Text-to-SQL at GlobalMart). User Query: ”Show me the top 10 products by revenue in Q4
2024 for customers in Europe.”
LLM-Generated SQL:
SELECT p.product_id, [Link], SUM([Link]) as total_revenue
FROM orders o
JOIN customers c ON o.customer_id = [Link]
17
JOIN products p ON o.product_id = [Link]
WHERE [Link] = ’Europe’
AND [Link] BETWEEN ’2024-10-01’ AND ’2024-12-31’
GROUP BY p.product_id, [Link]
ORDER BY total_revenue DESC
LIMIT 10;
Accuracy: 87% of business analyst queries correctly translated (evaluated on 500-query benchmark).
Example 4.5 (Automated Root Cause Analysis). Alert: ”Revenue dropped 15% today compared to last
week.”
Agent Workflow:
1. Query daily revenue: SELECT date, SUM(revenue) FROM orders GROUP BY date
Example 4.6 (GDPR Compliance via Lineage). Request: Customer c12345 requests data deletion (GDPR
”right to be forgotten”).
Lineage Query: Find all downstream artifacts derived from customer c12345.
WITH RECURSIVE downstream AS (
SELECT dataset_id, ’customers’ as source
FROM lineage_graph
WHERE source_table = ’customers’ AND entity_id = ’c12345’
UNION ALL
18
SELECT l.target_dataset, [Link]
FROM lineage_graph l
JOIN downstream d ON l.source_dataset = d.dataset_id
)
SELECT * FROM downstream;
Result: Identifies:
• 12 derived tables (aggregates, features)
• 7 cached reports
Automated deletion cascade ensures compliance.
Example 4.7 (2024 Lakehouse Migration). GlobalMart migrated from dual Lake+Warehouse to Delta
Lake-based Lakehouse:
• Data volume: 500TB raw, 150TB curated
19
• S: snapshot versions retained
Open Problem: Characterize the Pareto-optimal configurations:
RQ2: Schema Evolution Complexity. Analyze mapping maintenance cost under stochastic schema evo-
lution models [7].
Definition 5.2 (Schema Evolution Process). Schema St evolves via operations: rename, add attribute,
change type, nest/unnest. Model as Markov process with transition probabilities learned from real schema
history.
Open Problem: Prove bounds on expected mapping maintenance cost:
2. Privacy-Preserving Analytics. Integrate differential privacy, federated learning, and secure multi-party
computation into lakehouse architectures.
3. Neuromorphic Data Processing. Explore spiking neural networks for ultra-low-power streaming ana-
lytics at edge devices.
4. Autonomous Data Systems. Self-tuning databases that automatically select indexes, materialized
views, and partitioning strategies using reinforcement learning.
• Gray, J. et al. (1997). ”Data Cube: A Relational Aggregation Operator.” Data Mining and Knowledge
Discovery. [OLAP theory]
20
• Gupta, H. & Mumick, I.S. (1997). ”Selection of Views to Materialize.” ICDE. [View selection
algorithms]
• Halevy, A.Y. (2001). ”Answering Queries Using Views.” VLDB Journal. [Query rewriting theory]
• Rahm, E. & Bernstein, P.A. (2001). ”A Survey of Approaches to Automatic Schema Matching.”
VLDB Journal. [Schema matching taxonomy]
• Fagin, R. et al. (2005). ”Data Exchange: Semantics and Query Answering.” Theoretical Computer
Science. [Chase/backchase]
• Doan, A. et al. (2012). Principles of Data Integration. Morgan Kaufmann. [Comprehensive text-
book]
• Berners-Lee, T. et al. (2001). ”The Semantic Web.” Scientific American. [Vision paper]
• Bizer, C. et al. (2009). ”Linked Data - The Story So Far.” Int’l Journal on Semantic Web and
Information Systems. [RDF/LOD]
• Dean, J. & Ghemawat, S. (2004). ”MapReduce: Simplified Data Processing on Large Clusters.”
OSDI. [MapReduce model]
• Kreps, J. et al. (2011). ”Kafka: A Distributed Messaging System for Log Processing.” NetDB.
[Kafka architecture]
• Akidau, T. et al. (2015). ”The Dataflow Model: A Practical Approach to Balancing Correctness,
Latency, and Cost.” VLDB. [Event-time semantics]
• Carbone, P. et al. (2015). ”Apache Flink: Stream and Batch Processing in a Single Engine.” IEEE
Data Eng. Bull.
• Armbrust, M. et al. (2020). ”Delta Lake: High-Performance ACID Table Storage over Cloud Object
Stores.” PVLDB. [Delta Lake architecture]
• Sculley, D. et al. (2015). ”Hidden Technical Debt in Machine Learning Systems.” NeurIPS. [ML
engineering principles]
• Brown, T. et al. (2020). ”Language Models are Few-Shot Learners.” NeurIPS. [GPT-3 and LLM
capabilities]
• Ratner, A. et al. (2020). ”MLOps: Continuous Delivery and Automation Pipelines in ML.” Google
Cloud white paper.
21
6.2 Additional Theoretical Foundations
• Nemhauser, G.L. et al. (1978). ”Analysis of Approximations for Maximizing Submodular Set Func-
tions.” Math. Programming. [Greedy approximation bounds]
• Pottinger, R. & Levy, A.Y. (2000). ”A Scalable Algorithm for Answering Queries Using Views.”
VLDB. [MiniCon algorithm]
• Madhavan, J. et al. (2001). ”Generic Schema Matching with Cupid.” VLDB. [Automated schema
matching]
Code Repository.
[Link]
Contents:
Docker Environment.
docker pull mokhtar/data-analytics-tutorial:2025
docker run -p 8888:8888 -v $(pwd):/workspace mokhtar/data-analytics-tutorial
:2025
1. Week 1–2: Study Sections 2–3 (DWH, OLAP). Implement greedy view selection.
3. Week 5–6: Study Section 5 (Big Data). Implement MapReduce word count and RDD transforma-
tions.
4. Week 7–8: Study Section 6 (Lakehouse). Implement Delta Lake time travel.
5. Week 9–10: Research project: Extend one algorithm or address open problem.
22
For Practitioners.
1. Review problem statements and illustrative examples
2. Assess current architecture against evolution timeline
3. Identify technical debt patterns (e.g., dual lake+warehouse)
4. Plan migration path (prioritize reproducibility and governance)
at (0,0) [Star Schema Diagram]; at (0,-1) FactSales connected to DimDate, DimCustomer, DimProduct;
Figure 1: Star schema design for GlobalMart (1990s). Fact table contains measures (revenue, quantity);
dimensions provide context (date, customer, product hierarchies).
at (0,0) [GAV/LAV/GLAV Mapping Diagram]; at (0,-1) Global schema ↔ Source schemas with different
mapping directions;
Figure 2: Data integration mapping paradigms (2000s). GAV: global defined via sources; LAV: sources
described via global; GLAV: bidirectional TGDs.
at (0,0) [Streaming Architecture Diagram]; at (0,-1) Kafka topics → Flink/Spark → Event-time windows →
Sinks;
Figure 3: Stream processing architecture (2010s). Events flow through Kafka topics, processed by stateful
operators with watermark-based windowing.
9 Conclusion
This tutorial has traced the 35-year evolution of data analytics through the lens of a single e-commerce
platform, GlobalMart. We have demonstrated how foundational research in data warehousing (1990s), het-
erogeneous integration (2000s), distributed processing (2010s), and lakehouse architectures (2020s) system-
atically addressed escalating challenges in scale, heterogeneity, velocity, and reproducibility.
Key Takeaways.
1. Theoretical Rigor: Every practical system rests on formal foundations—view selection theory, query
containment, submodular optimization, snapshot isolation.
2. Evolution, Not Revolution: Modern lakehouses inherit concepts from 1990s views (materialization),
2000s integration (schema mapping), and 2010s streaming (event-time semantics).
3. Reproducibility Imperative: As ML becomes central to analytics, reproducibility transitions from
”nice-to-have” to fundamental requirement.
4. Open Challenges: The field remains vibrant with unsolved problems in adaptive systems, privacy-
preserving analytics, and quantum-ready architectures.
23
at (0,0) [Lakehouse Architecture Diagram]; at (0,-1) Object Storage (S3/ADLS) + Transaction Log = ACID +
Time Travel;
Figure 4: Lakehouse architecture (2020s). Delta Lake provides transactional guarantees and versioning over
cloud object storage, unifying lake and warehouse capabilities.
For Researchers. This tutorial provides a roadmap for Ph.D. students entering data management: master
the classics (Lenzerini, Halevy, Fagin), understand industrial systems (MapReduce, Spark, Delta Lake), and
tackle open problems at the intersection of theory and practice.
For Practitioners. This tutorial offers a principled framework for evaluating technology choices: under-
stand the problem each era solved, recognize when legacy patterns cause technical debt, and adopt modern
architectures (lakehouse, MLOps) with awareness of their theoretical underpinnings.
The Road Ahead. As we enter the era of AI-driven, autonomous data systems, the next decade will likely
see:
• Self-optimizing systems that automatically tune physical design
References
[1] W.H. Inmon. Building the Data Warehouse. Wiley, 1992.
[2] R. Kimball and M. Ross. The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling.
Wiley, 1996.
[3] J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H. Pirahesh.
Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data
Mining and Knowledge Discovery, 1(1):29–53, 1997.
[4] H. Gupta and I.S. Mumick. Selection of views to materialize in a data warehouse. IEEE Transactions
on Knowledge and Data Engineering, 17(1):24–43, 1997.
[5] A.Y. Halevy. Answering queries using views: A survey. VLDB Journal, 10(4):270–294, 2001.
[6] M. Lenzerini. Data integration: A theoretical perspective. In Proceedings of PODS, pages 233–246,
2002.
[7] E. Rahm and P.A. Bernstein. A survey of approaches to automatic schema matching. VLDB Journal,
10(4):334–350, 2001.
[8] R. Fagin, P.G. Kolaitis, R.J. Miller, and L. Popa. Data exchange: Semantics and query answering.
Theoretical Computer Science, 336(1):89–124, 2005.
24
[9] A. Doan, A. Halevy, and Z. Ives. Principles of Data Integration. Morgan Kaufmann, 2012.
[10] C. Bizer, T. Heath, and T. Berners-Lee. Linked data - the story so far. Int’l Journal on Semantic Web
and Information Systems, 5(3):1–22, 2009.
[11] J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In Proceedings
of OSDI, pages 137–150, 2004.
[12] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M.J. Franklin, S. Shenker, and I.
Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In
Proceedings of NSDI, pages 15–28, 2012.
[13] J. Kreps, N. Narkhede, and J. Rao. Kafka: A distributed messaging system for log processing. In
Proceedings of NetDB, 2011.
[15] M. Armbrust, F. Ghodsi, R. Xin, and M. Zaharia. Delta Lake: High-performance ACID table storage
over cloud object stores. Proceedings of the VLDB Endowment, 13(12):3411–3424, 2020.
[16] D. Sculley, G. Holt, D. Golovin, E. Davydov, T. Phillips, D. Ebner, V. Chaudhary, M. Young, J.-F.
Crespo, and D. Dennison. Hidden technical debt in machine learning systems. In Advances in Neural
Information Processing Systems, pages 2503–2511, 2015.
[17] T. Brown, B. Mann, N. Ryder, M. Subbiah, J.D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G.
Sastry, A. Askell, et al. Language models are few-shot learners. In Advances in Neural Information
Processing Systems, volume 33, pages 1877–1901, 2020.
[18] G.L. Nemhauser, L.A. Wolsey, and M.L. Fisher. An analysis of approximations for maximizing sub-
modular set functions. Mathematical Programming, 14(1):265–294, 1978.
[19] R. Pottinger and A.Y. Levy. A scalable algorithm for answering queries using views. VLDB Journal,
10(2-3):182–198, 2001.
25
Table 3: Mathematical Notation Reference
Symbol Meaning
S Schema (set of relations)
Q Query
V View (materialized or virtual)
V Set of views
D Database instance
⊑ Query containment
ϕ, ψ Logical formulas (conjunctions)
x̄, ȳ Tuples of variables
⊥ Null value (labeled)
f (Q) Frequency of query Q in workload
cost(Q, V) Execution cost of Q given views V
W Query workload
B Storage budget
t Timestamp (event-time or processing-time)
w Window size
δ Watermark delay bound
26
device_id VARCHAR(50),
temperature DECIMAL(5,2),
event_type VARCHAR(50),
timestamp TIMESTAMP,
PRIMARY KEY (device_id, timestamp)
);
27