代写实现数据库的Optimizer和Execution Engine两大模块。
Requirement
In this assignment, you’ll be creating an optimizer and execution engine for a
relational database management system. Your system must take a restricted
subset of SQL as an input, evaluate those queries against a dataset, and
produce the correct results. You may write your system in any language you
choose, although five extra credit points will be awarded to anyone who does
not use Java.
Inputs
Your program will communicate through standard input and standard output. Each
time your program is invoked, the following will occur:
- A comma-separated line of paths to CSV files, each corrosponding to a relation (e.g., A.csv is the data for the relation A), will be sent to your program.
- There will be a pause of a pre-determined amount of time, in which your program can prepare for query processing (e.g., convert each CSV file into a different format, build index structures).
- After the pause, the next line of input will contain the number of queries you are to process.
- The rest of the input will contain SQL queries which you should process.
After processing each SQL query, you should output the result to standard
output (e.g., System.out.println in Java).
Relations
The first line of input sent to your program is a comma-separated list of
paths to CSV files. For example, this line might look like:
data/xxxs/A.csv,data/xxxs/B.csv,data/xxxs/C.csv
This line would corrospond to a database with three relations (A, B, and C),
each stored in the respective file. This line is not ensured to be in any
particular order.
Each CSV file will contain data, for example:
123,43,45,31,809
54,-623,693,1942,123
which corrosponds to a table with 5 columns and two rows. CSV files will never
have a header row. All values in each CSV file are (potentially negative)
integers. All values are within the range of Java’s int primitive (i.e., a
32-bit signed integer).
All relation names will be a single uppercase English ASCII character (i.e., A
through Z). There are thus a maximum of 26 relations.
Indexing time
After the first line of input, your program will recieve no input for a fixed
period of time, called indexing time. During this time, you can read each CSV
file and write files to disk. You should use this time to prepare for query
processing.
Query processing
After the indexing time delay, you will recieve the total number of queries to
process, followed by the queries themselves. For example:
SELECT SUM(D.c0), SUM(D.c4), SUM(C.c1)
FROM A, B, C, D
WHERE A.c1 = B.c0 AND A.c3 = D.c0 AND C.c2 = D.c2
AND D.c3 > -9496;
SELECT SUM(A.c1)
FROM A, C, D
WHERE A.c2 = C.c0 AND A.c3 = D.c0 AND C.c2 = D.c2
AND C.c2 [ 2247;
SELECT SUM(A.c6), SUM(A.c2)
FROM A, B, C, D
WHERE A.c2 = C.c0 AND A.c1 = B.c0 AND C.c2 = D.c2
AND A.c4 = -9868;
Each SQL query will occupy four lines.
- The first line will contain several SUM aggregates, specifying a number of columns that should be totaled for each query. You may assume that SUM is the only aggregate used, and that all columns in the SELECT clause are aggregated. In other words, each SQL query will output only a single row, and each column of that output row will be the sum of some column. There will never be a GROUP BY clause.
- The second line will contain the FROM clause, which will always contain at least one relation.
- The third line will contain the first part of the WHERE clause. Each predicate on this line will be separated by an AND, and will represent a simple equijoin predicate.
- The final line will contain all non-join predicates. Each predicate on this line will contain a column on the left hand side and a constant value on the right hand side. Only the operators
=
,>
, and<
will be used. There may be zero or more of these predicates. This line will always end in a semicolon.
Each query will be separated by a blank line.
Your program must output the correct result for each query, in the order the
queries are given. Each query result should be formatted as a comma-separated
list, with no spacing. For example:
132,5432191
may be the output for a particular query. Each query result should end in a
newline.
Data
Several datasets are provided, ranging in size from XXXS (small enough to
compute by hand), and L (multiple gigabytes). These datasets are provided in
the data/ directory. Each dataset also comes with a queries.sql file
containing the queries that will be sent to your program, and a answers.txt
file containing the correct results for each query.
You may not read the answers.txt file from your program. Assume the
answers.txt file is not available during query processing.
Program setup
Your code should include two shell scripts:
- compile.sh: this script should invoke all the commands required to compile your code, e.g. call javac. If your code does not need to be compiled, you may leave this file blank. The compile.sh script may not access any data in the data directory.
- run.sh: this script should run your program, accepting input and producing query results as output. For example, this script may contain java MyProgram or python3 main.py.
To run your program while you are developing, you may use our run_with_data.py
script. For convenience, this script waits only 1 second before sending the
query workload. This script will call run.sh and feed your program the proper
input, printing out the results produced by your program. For example, to run
using the XXXS dataset:
python3 run_with_data.py ../data/xxxs
To check the correctness of your program, you may use our grade.py script.
This script will call run.sh in much the same way as run_with_data.py, but
will also enforce a particular amount of indexing time, a timeout, and will
automatically check your query results against the correct answer. For
example, to run your code with 5 seconds of indexing time and a 30 second
timeout on the XXXS dataset:
python3 grade.py 5 30 ../data/xxxs
The grade.py script will record your program’s standard error and standard out
into proc_stderr and proc_stdout.
Finally, the package.sh script will pack up everything in your working
directory into a single dist.tar.gz package, which you can submit for grading.
Please note the maximum file size limit for this package will be 5MB (and thus
you should ensure not to keep the data directory in the same folder).
Data
The dataset is too large to distribute over LATTE. You can download it.
Uncompressed, the data is a little over 5GB.
Environment
The performance of your code will be measured in the same environment as every
other student. Your code will execute on a machine with 1GB of RAM and
approximately 50GB of hard disk space. You should pay special attention to how
your program uses memory. If you use over 1GB of memory, your program will
crash. This machine will not be connected to the Internet. You may safely
write data to the working directory your program is executed in.
Several compilers and programming languages are installed, including gcc, g++,
javac, rust, cargo, python3. For python3, numpy and pandas are installed.
You can create a virtual machine of the exact execution environment we will
test your code with using vagrant. To SSH into such a machine, install
Vagrant, create a new directory, and execute:
vagrant init RyanMarcus/arch-dev
vagrant up
vagrant ssh
You can edit the resulting Vagrantfile to limit the amount of memory the VM
gets (set it to 1024 for 1GB).
Submitting to the server
Within the first two weeks of the assignment being out, we will provide an
evaluation server. You will be able to submit your code to the evaluation
server, which will execute and time your performance. This performance will
then be recorded, and your submitted code will be saved.