SEARCH WITHIN CONTENT
Citation Information : International Journal of Advanced Network, Monitoring and Controls. Volume 3, Issue 2, Pages 126-132, DOI: https://doi.org/10.21307/ijanmc-2018-045
License : (CC-BY-NC-ND-4.0)
Published Online: 07-May-2018
In order to solve the problem of information overload of music system under large data background, this paper studies the design scheme of distributed music recommendation system based on Hadoop. The proposed algorithm is based on the MapReduce distributed computing framework, which has high scalability and performance, and can be applied to the calculation and analysis of off-line data efficiently. The music recommendation system designed in this paper also includes client, server interface, database and ETL operation, which can calculate a set of complete recommendation system from user operation end to server and data calculation. In order to improve the accuracy of the recommendation algorithm, this paper introduces k-means clustering algorithm to improve the recommendation algorithm based on user-based collaborative filtering.The experimental results show that the accuracy of the proposed algorithm has been significantly improved after the introduction of k-means.
With the development of the mobile Internet, the amount of data generated by mobile APP has increased rapidly in recent years. On July 27, 2017, Trustdata, a well-known mobile big data monitoring platform in China released the “Analysis Report of China Mobile Internet Development in the First Half of 2017” . Among them, mobile music as a high-frequency application, the number of the user be showed a steady growth in the first half of 2017, peak DAU (daily active users) nearly 150 million. Taking Netease cloud music as an example, since its client APP was launched in April 2013,the number of users has reached 300 million, song orders 400 million. Such a huge amount of data makes the traditional single-server data storage and processing becomes more and more clumsy. Moreover, it is more and more difficult for users to find their favorite songs in huge amounts of data. After all, favorite songs are only a few, and finding them one by one can be very difficult. The past practice is that when users want to listen to some songs, they search for them by engines, but only find songs they have known. A lot of songs that users do not know, but will probably like very much, will never be heard. If there is a system specifically for users to push songs, users will spend less time searching for songs, and the stickiness from users to the system will increase. Based on this, this paper uses Hadoop, a big data computing and storage framework, to store and calculate data, so as to solve the problem in finding songs, and also improve user’s activity and stickiness. The Hadoop-based recommendation system used in this paper has the following important implications in today’s Internet context:
1) To effectively solve the problem of “information overload”, to provide users with interesting content by exploring the relationship between users and songs;
2) Hadoop cluster parallel computing and distributed storage technology has good scalability, can effectively handle massive data storage and computing problems;
3) For the enterprise, the system with the recommended function can enhance the user experience and increase the user’s activity and stickiness.
The traditional user-based collaborative filtering recommendation algorithm is based on several users that are most similar to this user’s interest, and then recommends songs that the user has not heard from these similar users. The specific method is to find each user by user similarity algorithm some of the most similar users, and then summarize the similar user listening records, from which the target user to find out the songs have not been heard, and then similar users to sort the similarity to obtain The preference of each song, the order of these songs, so as to target users recommend . Specific algorithm steps are as follows:
1) Construct users - song data representation matrix. The row vector represents the user, and the column vector represents each song. The matrix value indicates whether the user has heard of the song, 0 means that it has not been heard, and 1 means it has been heard. It is a 0-1 matrix.
2) Generate the nearest neighbor set. According to the user-song matrix, the similarity algorithm is used to calculate the user’s similarity so as to find the user set closest to the target user. Formula (1) calculates the similarity of two users.
3) Produce recommendations. The recommendation value of a song that a similar user has heard while the target user has not heard is determined by the similarity of the user with the target user. A user set must have many users that have heard the same song, and this song’s recommended value is the sum of the similarities of these users. In this way, songs that have not been heard by the target user can be sorted according to the recommended values, and songs with high recommendation values will be preferentially pushed to the target user. Equation (2)  calculates the user’s preference for song i.
Clustering is an unsupervised learning method which aggregates data with similar attributes on the basis of nomanual labeling. It combines data for similarities and the data in the same group have similarities. The data in the different group is different from each other. The improved collaborative filtering recommendation algorithm in this paper is based on the user base with high similarity, so that the algorithm has a better recommendation effect. Similar calculation directly restricts the effectiveness of the clustering effect and thus affects the final recommendation result. The principle of the algorithm is as follows: Suppose there is a group of users User, the total number of users is m, remember to be U (U1, U2, U3, …, Um), each user Ux has n attributes, recorded as Cx (Cx1, Cx2, Cx3, …, Cxn), the principle of clustering is to compare each attribute on the basis of the set U and divide into the groups of similar users . The core idea of the k-means algorithm is to divide a given group of users into k groups. Within each k group, a cluster center is set to calculate the distance of each data from the center. The minimum distance is attributed to this group.
1) The removal of free points . Of all the data points, the free points are those that are far away from all other points, and their existence will cause the deviation of the center point in the belonging class and thus the classification effect. The process of removing free points in this paper is as follows:
Let the total number of users be m, the total number of paths of all users to other users be calculated according to the formula (3):
Then the sum of all users is:
Formula (6) is the average of all user distances. For each user U, compute the distance U from all other users Lu = (Ui, Uj). If all Lu> = EMV, the user is classified as a free point, a separate category. If the number of free points is small, they can’t be classified into a category for collaborative filtering, so you can classify it to a class that is closest to a type of center point.
2) The selection of random points will also have an impact on the classification results, and fall into the local optimal solution. In order to solve this problem, the method adopted in this paper is to adopt the improved algorithm of clustering: dichotomous clustering . The idea of dichotomous clustering is to first classify all the points as a cluster and then divide it into two (k = 2 k-means clustering), and then select the class that can minimize the clustering cost function again and divide it into two formula until the number of clusters equals k. The cluster cost function is defined as the sum of squared error of the cluster, as shown in formula (7). The largest square error sum of a class, this kind of point in the distance from the center of the maximum distance, you need to be divided once again.
This section describes the implementation and testing of the entire system and what frameworks are included in each of the system’s features. First of all, introduce the top-level design; then analyze the overall framework of the system, what technologies are needed, and the overall process of the system; finally, we evaluate the recommended results from the accuracy and recall rate of recommendations.
The K-means clustering-based collaborative filtering recommendation algorithm proposed in the previous section is mainly divided into two steps, one is to use k-means to implement user clustering, and the other is to perform a recommendation algorithm based on user collaborative filtering thus generating the recommended result.
Algorithm parallelization of the flow chart is shown in figure 1. Clustering algorithm is divided into three steps. The first step is to create a user tag model: user log table and song list of commonly used tags through a step MapReduce process to generate user-tag model, the tag file as a cache file for each user’s song recording tag statistics in the tag vector Increase the number of each position to generate a user label matrix. The second step is to use k-means algorithm to calculate the cluster center point of user label matrix. After several iterations, the relatively stable center point of each cluster is determined. The third step reads the central point file as a cache, in order to classify users, by calculating the user from which one of the nearest center, put this user into which cluster. The user-based collaborative filtering algorithm will recommend collaborative filtering for users in each cluster.This step is divided into a number of steps, including counting the number of a song being listened to, counting the number of a user listening to music, calculating user similarity, generating recommended results.
The recommended system is divided into the recommended algorithm layer, server-side and client. The recommended algorithm layer uses Hadoop cluster for distributed recommendation, the server side uses the Java Servlet and MySQL database for development, and the client side uses Android for display. Log collection is collected using the ETL tool Sqoop. System overall framework is shown in Figure 2.
It can be seen from the figure that the recommended system is divided into 4 major parts: Top-level client, server, database and Hadoop layer. The client uses the Android system for display. Server development uses Java EE development. Database uses MySQL. MySQL is the intermediate data link between the client and the recommendation system. Data is stored in the database so that front-end servers and Hadoop are decoupled. The transfer of data between the database and Hadoop takes the Sqoop. Sqoop is a distributed log collection system based on Hadoop, which requires Hadoop to run, which can also speed up data transfer. The collected data is stored in HDFS. The Hadoop layer is divided into two parts, one is HDFS data storage, and the other is MapReduce distributed computing. MapReduce reads the data from HDFS and calculates it, and then saves the results back to HDFS. Sqoop reads data from the HDFS and transfers to MySQL, the client requests the server at irregular intervals, the server reads data from the database and returns it to the client, and the user’s operation of songs is fed back to the server and saved to the database.
When users use the recommendation system, the system recommends different songs to different users, which requires users to log in the system with different user names. In order to log in, the system also has to provide the user registration function. To listen to different songs, users must also be able to search for songs and add tags to the songs. Because we have no songs to play, we also need a collection or a favorite feature for users. The server side designs the corresponding interfaces based on these requirements, and the database also designs different tables to store the content. Accordingly, we have designed the function diagram of the system, as shown in Figure 3.
As can be seen from the figure, the system function is decomposed into four modules. Clients have user registration, login, search songs, tag songs and play features. Play actually refers to favorites or likes. Because the song involves copyright issues, it can only be replaced by favorites or likes. These functions correspond to the server-side interface to provide data. Data is read from the database, so the database is designed with three tables: the user table, used to register and log in; the table for recording users’ listening to songs, with the largest amount of data; song table storage song’s basic information, including tag information, search songs and add tags.
Algorithm parallelization is based on the k-means clustering algorithm introduced in the previous chapter and user-based collaborative filtering recommendation algorithm is implemented in the distributed system, the distributed recommendation algorithm is divided into two modules, one is distributed k-means Clustering algorithm, one is the distributed collaborative filtering recommendation algorithm, and finally the two algorithms connected to achieve the work of the entire distributed recommendation algorithm.
The data input of the K-means algorithm is a matrix, because here the user is clustered, so the user-label matrix is first formed by data preprocessing. User listening records are large files and need to be designed in parallel. Listener recorded data into the HDFS behind the addition of each song’s label, the formation of (user id, song id, song time, label) this format records. Then use MapReduce to parallelize the user-label matrices in one step. Then the user-label matrix is clustered. The clustering and parallelization algorithm is designed as follows: The first step scans all the original points and randomly selects k points as the center point of the first step cluster. The second step is to calculate the cluster of all the points to each center point, and to point each point to the closest center point cluster. The third step to repeat the second step to meet the termination conditions. The fourth step is to calculate all the user points, the user assigned to their respective clusters. Algorithm architecture is shown in Figure 4.
According to the formula (1) and (2) in the first section, combined with the rules of Hadoop distributed design, the following steps are designed for the recommendation algorithm: The first step is to count how many songs each user listens to; the second step is to count the number of times each song is heard; the third step is to calculate the similarity of every two users, but only the similarity of the two users who have heard the same song need to be computed; the fourth step, is to calculate each user’s recommending value for each song to form a recommend list; The final step is to sum up the recommending values for a user listening to the same song. The above steps are implemented based on MapReduce, and the entire work flow requires a number of Map and Reduce to complete. Figure 5 shows the MapReduce architecture based on the user collaborative filtering algorithm.
In the same situation as the data-set, many times of repeated experiments were done on the traditional user-based collaborative filtering algorithm, the recommendation algorithm after introducing K-means clustering and the recommendation algorithm after the improved clustering. And the stable results were selected to analyze .
Environment construction and configuration include Hadoop cluster, Sqoop, development environment and Web server. The cluster environment is built using VMware virtual machines.
The cluster is built using three nodes, one Master and two Slaves.
Host configuration Hardware environment: CPU Intel i5-4590, quad-core, 3.30 GHz, RAM 8 GB;
Software Environment: OS Centos7, java Environment jdk1.8, server tomcat8.0, hadoop 2.7.4;
Development environment: Eclipse, windows10, hadoop.plugin;
Music data used in this experiment are from the network, including more than 50,000 users, more than 1,700,000 user operations, and 730 tags.
Figure 6 shows a screenshot of the recommended results for the Android phone.
The Precision and Recall  are used as evaluation indices.Each user’s song date is sorted in descending order, with the top 80% as the training set, and the remaining 20% as the test set for experimental evaluation.
After repeated experiments that the precision of three algorithms changes with the K-value is shown in line chart as Figure 7:
As shown in Figure 7, the precision after introducing the K-means clustering algorithm is better than the traditional collaborative filtering algorithm, and when the K value is 4, the precision is increased by about 0.65%, which is the best classification. After the clustering algorithm is improved, the precision is nearly 0.15% higher than unimproved when the K value is 5.
Figure 8 is the Recall change chart for three algorithms. The Recall of K-means clustering algorithm is better than that of the traditional collaborative filtering algorithm, and when the K value is 4, the Recall can increase about 0.65%. When the K value is 5, the Recall increases nearly 0.15% after the clustering algorithm is improved; but as the K value increases, the improved clustering is lower than the unimproved clustering algorithm. Because the user number is fixed, after removing the free point, each classification will be affected. Some classification number is very few, then the recommended result is inaccurate, thus the whole effect of the recommendation algorithm is affected.
This paper puts forward the design and implementation of the Hadoop-based music recommendation system, and improves the traditional user-based collaborative filtering recommendation algorithm, and improves the precision of the recommendation algorithm by using the song tag for users clustering. Hadoop, a scalable and high-performance distributed computing platform, provides a reference for the design of a music recommendation system in the background of large data. The recommendation algorithm of K-means clustering and collaborative filtering is designed based on MapReduce distributed framework, which has some reference significance for the distributed design of the recommendation algorithm.