MachineLearning代写:AlgorithmDesign


这次需要代写的作业分为五个小问题,每个问题均需要设计和实现相关算法。

Question 1

Design a dataset for training a machine learning model such that the model is
100% accurate on the training data but 0% accurate on the test data. Show an
example of the data set.

Question 2

Write an algorithm that uses deep learning to do co-training. Describe the
algorithm and design an example for illustration.

Note

Co-training has not been introduced in the lectures, you may first search for
relevant information and study it yourself to write the algorithm.
For deep learning, you may treat it as a “black box” function Y = D(x) for an
input vector x.

Question 3

Question 3.1

Design a distance function to evaluate the similarity between two customers in
the domain of online purchases, e.g. Amazon.com. Assuming the database records
the following attributes:
Customer_id
User_name (composed of less or equal to 10 characters)
Purchased_items (the set of items the bought last month)
Payment_methods (a nominal attribute of 3 values: visa, paypal, on_delivery)
Amount_spend (average amount spent per purchase in dollars and cents; it has a mean of 200.00 a standard deviation of 50, the minimum is 0.02 and the maximum is 980)
Age groups (an ordinal attribute of 4 values: <=17; 18 – 29; 30 – 49; >=50)
Purchase_reviews (the set of customer reviews submitted)
Explain your design.
Note: you can choose which attributes to be included in the distance function.

Question 3.2

In this question, you are asked to compare the results with the Minhashing and
the results with the normal Jaccard similarity. You have to use the BBCSport
data set from http://mlg.ucd.ie/datasets/bbc.html . You can download and use the pre-
processed dataset.

  1. Compute the exact Jaccard similarities for all pairs of articles. List the pairs of documents with similarity at least 0.5.
  2. Using MinHashing, generate the signature matrix with 50 hash functions. From the signature matrix compute a similarity matrix S that every S(i,j) is the similarity of articles i and j. List the pairs of documents with similarity at least 0.5.
    You may use the following
    * settings: 50 hash functions
    * Hash function: hi(r) = (air + bi) % c
    where ai and bi are randomly chosen integers less than the maximum value of r.
    c is a prime number slightly bigger than the maximum value of r List the
    values of ai, bi and c used.
  3. Compare the results in step (1) and (2) and evaluate the approximate Jaccard similarity obtained in step (2). Report the number of false positive and positive negatives.
  4. Repeat steps (2) and (3) using 100 hash functions.
  5. Report your observations.
    Note: You may use any tool or programming language.

Question 4

Let’s look at the data integration for the music industry. Identify at least
three sources of music information including:

  • One providing music metadata (e.g. about musical artists, music albums, labels, and genres) - One providing music streaming services
  • One providing music related information, e.g. popularity rankings, music reviews, or information of concerts
    Based on the topics that are introduced in this course, discuss with examples
    on how the following aspects of data integration of the above sources of music
    information can be performed? Also, in each aspect, describe with examples,
    whether there are any challenges in the “V” dimensions (variety, volume,
    veracity and velocity) and what are they?
  • a) Schema alignment, such as the construction of the mediated schema, attributing matching and schema mapping
  • b) Record linkage
  • c) Data fusion
    You may find a list of music databases for your reference here:
    https://en.wikipedia.org/wiki/List_of_online_music_databases

Question 5

Suppose the university would like to release the data of student records as
“open data” to help researchers and public communities to investigate and
discover useful information about education. What are the privacy concerns on
the publishing and release of the data? What are the techniques or approaches
required to preserve privacy? Explain with examples.


文章作者: SafePoker
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 SafePoker !
  目录