编写应用程序,实现数据挖掘中的 ETL 和 OLAP 流程。
![ETL](https://upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Conventional_ETL_Diagram.jpg/220px-
Conventional_ETL_Diagram.jpg)
Assignment Overview
In this assignment, you need to write a program to accomplish a few ETL and
OLAP operations. The major objective of this assignment is to get yourself
familiar with the process of constructing and using a data warehouse.
Detailed Requirements
Overview
A data warehouse is a subject-oriented, integrated, time-variant, and
nonvolatile collection of data in support of management’s decision-making. The
following figure illustrates the process of constructing and using a data
warehouse.
ETL
The process of extracting data from source systems and bringing it into the
data warehouse is commonly called ETL, which stands for Extraction,
Transformation, and Loading. Note that ETL refers to a broad process, and not
three well-defined steps. Normally, ETL includes extraction, cleaning,
transformation, loading, and refreshing. Nevertheless, the entire process is
known as ETL.
OLAP
Data warehouses provide online analytical processing (OLAP) tools for the
interactive analysis of multidimensional data of varied granularities, which
facilitates effective data generalization and data mining. Typical OLAP
operations include roll up, drill down, slice, dice, and pivot.
Sample Data Set
A simple data set file titled “Car_Sales_Data_Set.csv” is used as the original
data set for our data warehouse. Specifically:
- a) CSV stands for Comma-Separated Values. A CSV file is a text file that uses a comma to separate values. Often, the first record in a CSV file is a header line including a list of field names. Therefore, it is very easy to dig into a CSV file and look for useful information. You can use any text editor to open a CSV file and view its content. More details about CSV can be found here: https://en.wikipedia.org/wiki/Comma-separated_values
- b) The sample data set is provided via brightspace.
- c) There are 100 records (not including the header line) in the data set.
- d) The data set includes 6 fields: Record_ID, Country, Time_Year, Time_Quarter, Car_Manufacturer, Sales_Units.
- e) In this data warehouse, there are 3 dimensions (i.e. Country, Time, and Car_Manufacturer) and 1 measure (i.e. Sales_Units). In addition, the dimension of Time is associated with a two-level concept hierarchy:
Time_Quarter < Time_Year
.
Required ETL Operation
In practice, the ETL involves many different operations. In this assignment,
we focus on the operation of sorting. Specifically, your program needs to sort
the records three times.
- a) The first sorting deals with the field of “Country”. Once it is done, records associated with Canada are in front of records associated with United States. Now we have two sequential groups of records: Canada_Set and US_Set. Note that both groups are still in the same CSV file.
- b) The second sorting is based on the result of the first sorting. The second sorting deals with the field of “Time_Year”. Once it is done, records associated with Canada are still in front of records associated with United States; however, within the Canada_Set, records associated with 2017 are in front of records associated with 2018 (this also applies the US_Set). Now we have four sequential groups of records in the same CSV file: Canada_2017_Set, Canada_2018_Set, US_2017_Set, US_2018_Set.
- c) The third sorting is based on the result of the second sorting. The third sorting deals with the field of “Time_Quarter” and it does not change the order of the data groups mentioned previously. However, within each data group (e.g. Canada_2017_Set), records are sorted according to the quarter values (in ascending order, i.e. 1, 2, 3, 4). Now we have sixteen sequential groups of records: Canada_2017_1_Set, Canada_2017_2_Set, …, US_2018_3_ Set, US_2018_4_ Set.
- d) Once these sorting operations are finished, the field of Record_ID will be reset so that the first record’s Record_ID is 1, the second record’s Record_ID is 2, …, the 100-th record’s Record_ID is 100.
Required OLAP Operations
For an n-dimensional data cube with each dimension being associated with Li
levels, the number of available cuboids.
The data warehouse in this assignment involves 3 dimensions (i.e. Country,
Time, and Car_ Manufacturer). In addition, the dimension of Time is associated
with a two-level concept hierarchy: Time_Quarter < Time_Year
. Therefore,
the number of cuboids is (1+1) x (2+1) x (1+1) = 12. Specifically, the cuboids
correspond to the following tuples:
- ()
- (Country)
- (Time_Year)
- (Time_Quarter - Time_Year)
- (Car_ Manufacturer)
- (Country, Time_Year)
- (Country, Time_Quarter - Time_Year)
- (Country, Car_ Manufacturer)
- (Time_Year, Car_ Manufacturer)
- (Time_Quarter - Time_Year, Car_ Manufacturer)
- (Country, Time_Year, Car_ Manufacturer)
- (Country, Time_Quarter - Time_Year, Car_ Manufacturer)
Note that “Time_Quarter - Time_Year” means the combo of quarter and year. One
example value of this concept is 1-2017 (i.e. the first quarter of 2017).
These cuboids could be utilized to perform varied OLAP operations.
Required Program
You need to write a program that sorts the sample data set and responds to
varied OLAP queries. Here are the detailed requirements:
- a) Bluenose is used as the testing/marking platform. Therefore, the TA should be able to compile and execute your program via the command-based interface. Note that you need to have a CSID in order to access bluenose via SSH. If you do not have a CSID or you have a problem with your CSID, please contact the CS help desk, which is located on the first floor of Mona Campbell Building.
- b) You should place the sample data set file titled “Car_Sales_Data_Set.csv” in the directory where your program file is located.
- c) The name of your program should be “ETL-OALP”. After running ETL-OALP via the command-based interface, ETL-OALP should read “Car_Sales_Data_Set.csv”, sort the records according to the instructions in Section 2.5, then save the sorting result in a file named “Car_Sales_Data_Set_Sorted.csv”, which is in the same directory as ETL-OALP. Note that “Car_Sales_Data_Set_Sorted.csv” should use the CSV format.
- d) After saving the sorting result, ETL-OALP should display a list of 12 tuples (i.e. the tuples specified in Section 2.6) on the screen. Note that the sequence number associated with each tuple should also be displayed. Each of these tuples corresponds to a cuboid (i.e. a possible OLAP query) in the data cube.
- e) Then ETL-OALP should prompt the user to enter a number in the range of 1-12. Namely, ETL-OALP should prompt the user to select an OLAP query.
- f) Once the user enters a valid number (i.e. a number in the range of 1-12), ETL-OALP will process the corresponding OLAP query and display the result (including a header line, the corresponding dimension values, and aggregated Sales_Units values) on the screen. For example, if the user enters 2, then the first line of the result should be the header line. Then the detailed Country values and aggregated Sales_Units values should be displayed. Here, “XXX” represents the corresponding aggregated value. Note that when the user enters 11 or 12, the corresponding cuboids are 3D. In this case, ETL-OALP will use two 2D tables to display the results: the first one corresponds to Canada and the second one corresponds to United States.
Required Programming Language
You can use Java, C, C++, or Python as the programming language because
bluenose supports these languages (note that you need to have a CSID to access
bluenose via SSH).
Readme File
You need to complete a readme file named “Readme.txt”, which includes the
instructions that the TA could use to complete and execute your program.
Submission: Please pay attention to the following submission
requirements:
- a) You should place “Readme.txt” in the directory where your program file is located.
- b) You should place “Car_Sales_Data_Set.csv” in the directory where your program file is located.
- c) Your program file, “Readme.txt”, and “Car_Sales_Data_Set.csv” should be compressed into a zip file named “YourFirstName-YourLastName-ASN2.zip”. For example, my zip file should be called “Qiang-Ye-ASN2.zip”. Finally, you need to submit your zip file for this assignment via brightspace.
Grading Criteria
The marker will use your submitted zip file to evaluate your assignment. The
full grade is 22 points. The details of the grading criteria are presented as
follows.
- Does “Readme.txt” include enough information so that the TA can easily compile and execute the program on bluenose?
- ETL Function: After running ETL-OALP, “Car_Sales_Data_Set_Sorted.csv” (the file contains the sorted records) is generated and placed in the directory where the compiled program is located.
- OLAP Function: After sorting, ETL-OALP should display a list of 12 tuples. Then ETL-OALP should prompt the user to enter a number in the range of 1-12. Once the user enters a valid number, ETL-OALP will process the OLAP query and display the correct result on the screen.