With the rapid development of China’s economy and the increase in tourism consumption, the number of people in traveling in domestic tourism has increased rapidly each year, and more travelers choose privately customized travel routes, so reasonable travel route is generated based on the actual users’ needs has become a hot research spot in the current industry and academia. However, as far as practical application is concerned, the planning of travel routes is a comprehensive and complex task. Reasonable travel routes include comprehensive features such as reasonable travel cities, travel time, transportation methods, and itinerary arrangements. At present, the traditional method is basically that the customer manager can manually plan the suitable travel route for the user through collecting the user’s needs, and then modify and adjust by communicating with the customer. The problem that this brings is that the customer manager needs to compare information such as users’ needs, travel price, travel time, travel transportation, and scenic spot arrangements when planning numerous travel routes. Obviously, the traditional methods have significant disadvantages such as low efficiency and long time-consuming. Bring a great burden to the staff and it is incompatible with the development of the current industry.

In order to solve the above problems, we put the historical travel routes collected as data sets in the paper, and a travel route recommendation and generation algorithm based on LDA and collaborative filtering is designed. Reasonable city recommendation list and playing time are the basis and focus of route planning. The paper is based on the many shortcomings in the traditional travel route planning method, and takes the city’s recommendation and time planning as the main focuses on work. In this work, different recommendation algorithms were designed, including a recommendation algorithm based on Latent Dirichlet Allocation (LDA) and collaborative filtering. By analyzing the performance of the recommendation algorithm on the data sets, the recommendation algorithm is improved and optimized. The LDA algorithm based on KDE (Kernel Density Estimation) and classification, the collaborative filtering algorithm based on KDE and classification. The final experimental results show that the optimal city list and travel time generated by the recommended algorithm are more reasonable and satisfy the actual use of the user.

Tourism industry has become an important part of national economy within the rapid development of China’s national economy in these years, and the number of travelers has also been gradually increasing. According to the data shown by National Bureau of Statistics, the consumption brought by tourism has also increased year by year. The tourism industry has shown an accelerated convergence of online and offline. Traditional travel agencies have been unable to meet consumers’ need and development of modern tourism. Based on the above situation, the online development mode of tourism has become a research hotspot in academia and industry. At present, the development carrier of online tourism is mainly online travel websites (such as: tuniu.com, tongcheng.com, etc.). The traditional travel website was designed by B/S mode [

In recent years, algorithms for travel route generation, LDA, and collaborative filtering have been reported many times. Ma Zhangbao et al. [

Research and application of recommendation algorithms such as collaborative filtering and LDA (Latent Dirichlet allocation) are reported at home and abroad. Yajun L [

From the current research status at home and abroad, there is no relevant scholars and enterprises can provide a feasible and accurate method to meet actual requirements. The current research results focus on the recommended method of designing travel routes under the premise of mastering user information and historical route data, which is, recommending the travel route in historical routes through user’s historical information, so for the new user’s demand, it can’t generate a new route that meets the user’s needs. At the same time, through the learning and researching for collaborative filtering and LDA algorithm, it is found that these algorithms are feasible and applied in this paper. According to that, we will show the method of recommendation and generation of tourist routes based on LDA and collaborative filtering below.

The planning of a travel route is a complex and comprehensive process that requires consideration of many factors, such as user’s demand, the price of route, interest arrangement, and transportation. The basic theory of route planning and generation involves multiple disciplines, including data mining, statistical machine learning, network search, pattern recognition, and spatial data mining. A scientific travel route can display as many tourist attractions and landscapes as possible to visitors, thereby improving satisfaction and happiness of tourists and promoting the long-term development of tourism industry. In recent years, with the rapid development of artificial intelligence technology, route planning algorithms such as genetic algorithm, particle swarm optimization algorithm, simulated annealing algorithm, ant colony algorithm and immune algorithm have been emerged. The planning and generation of a travel route mainly involves generating recommended city according to user’s needs, and reasonably planning the playing time of recommended city.

This paper takes the collected historical travel route datasets of Japan as researching object, mainly studies the recommendation and generation scheme of travel city time-space list in the travel route planning. It is proposed to use LDA and collaborative filtering to design the travel city recommendation algorithm, using KDE algorithm to optimize the playing time of each city, and then generate a time-space list of user’s playing city. In the experimental part, the results of topic city model based LDA and different travel route recommendation algorithms are introduced in detail. The relevant city error rate of topic city model based LDA under different parameters is compared and the optimal model parameters are obtained. Finally, the performance of different recommendation algorithms is evaluated and analyzed.

The travel route recommendation algorithm based on KDE and classification mainly includes three modules. They are data preprocessing and feature extraction module, playing time estimation module based on KDE, topic city generation module based on LDA and travel route generation module or recommended city generation module based on collaborative filtering. The data preprocessing and feature extraction module mainly transforms the original data set into a travel route text set through operations such as data cleaning, classification and feature extraction, that is, it conforms to the input format of LDA model, such as the document-content distribution format. The original data set comes from the travel historical data set of Japan, and there are about 5,000 travel routes. The playing time estimation module based on KDE mainly uses the KDE algorithm to calculate users’ total playing time and the playing time of input cities, improving the accuracy of the playing time and the quality of recommended algorithm. The topic city generation module based on LDA is the core module of entire algorithm. In this module, the topic-probability distribution under the travel route text and the characteristic city probability distribution under each topic are calculated through established travel city topic model based on LDA. In turn, the probability distribution of characteristic cities is converted into a list of recommended cities. The topic city generation module based on collaborative filtering is also the core module of entire algorithm. In this module, the list of recommended cities satisfying conditions is calculated through the collaborative filtering algorithm. The travel route generation module is the total output module of algorithm. After processing the output result of previous module, a complete travel route is finally formed, including users’ total playing time, the list of travel cities, and the list of playing time of each city. The system structure of algorithm is shown in Figure

LDA travel route recommendation algorithm based on KDE and classification

Collaborative filtering travel route recommendation algorithm based on KDE and Classification

Data preprocessing is basis for algorithm to get good training and output results. In the data cleaning and preprocessing module, feature extraction and data classification are mainly completed. The given original data set is mainly json format travel route data. Each route is an ordered list of multiple city lists. The attributes of each city list include the city ID (id), trip name or city name (name), type, travel time (travel_times or transit_time). Data cleaning is used to extract useful feature data in the data set and complete the missing data. Then the extracted data is sorted according to specific rules, where we classify according to the number of cities of route. Finally, through data preprocessing, the data is organized into a data set that can be used as the LDA model input, such as a document-content distribution format. The specific data preprocessing steps are:

Reading the json data file using python code, the city name (name), travel time (travel_times), and route number (plan_id) of each route are read;

Calculating the number of writing for each city, the specific number of writing = the total playing time (hour) / 4;

Writing the extracted features into different output files according to a specific format ([number of lines, route id, city name]);

According to the number of cities in each route, the output will be classified according to the number of cities 4, 4-5, 6-7, 8-10, 10, and stored in the corresponding files;

The writing of data is completed and the file is saved.

ROUTE BASIC ATTRIBUTE TABLE

id | name | plan_id | type | hours | daysep | |
---|---|---|---|---|---|---|

City id | City name | Route id | Route type | Playing time | The flag of end of day | |

string | string | string | string | list | bool | |

‘263’ | ‘Osaka’ | ‘3799’ | ‘place’ | [4.0,8.0] | true |

In general, the total time for users to play is calculated based on people’s experience theoretics to formulate specific rules, for example, total hours of playing (days) = total time of playing(days) * playing time of every day (8 hours). The playing time of city that the user wants to go to is calculated by multiplying the probability of topic distribution obtained by LDA by the total playing time. In practical applications, it is found that this method does not have a certain degree of flexibility and cannot adapt to all user inputs. The resulting time error is relatively large, resulting in poor recommendation quality. Therefore, in this paper, we decided to use the KDE (Kernel Density Estimation) algorithm to estimate total time for users to play and the playing time of city that users want to go to, improving the recommendation quality. Assuming that t1, t2, ... tn are n samples of total playing time t, and the probability density function of total playing time is

Where h is the bandwidth, n is the number of samples, and K () is the kernel function.

The algorithm steps for playing time estimation based on KDE are:

According to the number of days of playing, the historical routes will be categorized into five categories: 1-3 days, 4-5 days, 6-7 days, 8-10 days and 10 days or more;

Determining the corresponding route data category according to the number of days of playing input by users;

Reading the playing time of each route of a specific category, and saving it as a list A;

Using list A as the input data of kernel density estimation function to obtain the kernel density estimation function;

Randomly sample a function value as the total time for user to play, expressed as H, according to the obtained kernel density function;

Repeat the above steps to obtain the list of playing time G of input cities.

According to the above algorithm, we can get total time for user to play, expressed as H and the list of playing time for input cities. These two values will be used later in the topic city generation module based on LDA.

The LDA model is a probabilistic topic model for modeling discrete data sets (such as document sets). LDA is essentially an unsupervised machine learning model that can express high-dimensional text word space as low-dimensional topic space, ignoring text-related category information. The LDA model gets a brief description of document by making topic modeling of document set, retaining the essential statistical information and helping to efficiently process large-scale document sets [

Travel city topic model based on LDA

For topic c, a word polynomial distribution vector φ on the topic is obtained based on Dirichlet distribution Dir (β);

The number of words N obtained from the Poisson distribution P;

According to the Dirichlet distribution Dir (α), a topic distribution probability vector θ of the text is obtained;

For each word w n in the text N words:

Polynomial distribution from θ Multinomial (θ) randomly selects a topic z;

Select a word as w n from the polynomial conditional probability distribution Multinomial(φ) of topic z.

To obtain the probability distribution of a characteristic city, we need to use model parameter estimation methods to estimate word probability distribution under each topic and topic probability distribution of each text. The more commonly used parameter estimation methods are the expected propagation algorithm, variational Bayesian inference and Gibbs sampling [

Using the Gibbs sampling method, the topic of each word is sampled, and the parameter estimation problem can be converted into calculating the conditional probability of topic sequence under word sequence.

In the above expression, _{i}_{t}_{k} is the prior probability of Dirichlet of topic k. Based on the above calculation results and the topic number of each word obtained, parameters to be calculated can be calculated by the following equation:

_{k}

^{m,k}

INPUT AND OUTPUT OF TRAVEL CITY TOPIC MODEL BASED ON LDA

input: preprocessed and classified travel route text set (one route for one line) The number of topic K, hyperparameters α and β |
---|

output: |

Because there are a large number of travel routes in the data set of this research, we can regard each travel route as a user before applying the collaborative filtering recommendation algorithm and consider each travel city in each route as a item. Obviously the number of items in this research is much larger than the number of users, so we use a project-based collaborative filtering recommendation algorithm. The algorithm of a travel city generation module based on collaborative filtering is divided into the following three steps:

Dic={‘route1’:{‘city-1’: playing time-1,‘city-2’: playtime-2,...,‘city-n’: playtime-n},‘route 2’:{‘city-1’: playtime-1,‘city-2’: playing time-2,...,‘city-n’:playing time-n},..., ‘route-n’:{‘city-1’: playtime-1,‘city-2’: playing time-2,., ‘city-n’: playing time-n}}

The travel route generation module is an integrated module and an output module of the entire algorithm. Through playing time estimation module based on KDE, the total time for users to play H and the playing time list G for input cities can be obtained.The topic city generation module based on the LDA can be used to get the probabilistic distribution of characteristic city—recommended cities list. We need to normalize the probability of extracted topical city, find out playing time of recommended city based on processed probability value, and finally form a complete travel route. The main process of travel route generation module is as follows:

a. rest ← H − sum(G) #Calculating total playing time of recommended cities list

b. sum_prop ← 0 #Assigning the total probability value =0

c. recom_list=get_recom() #Getting a list of recommended cities, the form: [[city-1, probability value-1],[city2, probability value-2],...]

d. trip_list ← null #Assigning the route list to null

e. for i←0 to size(Recommended city list size) do sum_prop-^sum_prop+ recom_list [i][

f. for i←0 to size(Recommended city list size) do recom_list [i][

g. for i←0 to size(Recommended city list size) do trip_list [i]^recom_list [i] repeat

h. Add the list of cities entered by the user and their playing time to trip_list

i. return trip_list

Through the travel route generation module, you can get a complete travel route. The specific route format is [[city-1, playing time-1], [city-2, playing time-2]. [city-n, playing time-n]].

The evaluation of experimental results is an important work, this chapter mainly shows and evaluates the experimental results of different recommended algorithms, including the results of the topic city generation based on LDA, the results of the LDA travel route recommendation algorithm based on KDE and classification, the results of the collaborative filtering travel route recommendation algorithm based on KDE and classification, The performance of different travel route recommendation and generation algorithms based on LDA and the relevant city error rate are compared under different parameters. In recommendation field, commonly used evaluation indicators include recall rate and precision rate [

Precision rate = the number of items user likes / the number of items recommended by the system;

Recall rate = the number of all user’s favorite items in the recommended list / the number of all user’s favorite items in the system

Based on the concept and calculation methods of precision rate and recall rate, combined with the research content of this paper, we propose two evaluation indicators of the relevant city error rate and route correlation rate, which are used to evaluate the results of topic city generation model based LDA and route generation results of recommendation algorithm respectively. In popular terms, the relevant city error rate is the probability that a tourist city is classified as a wrong topic (route). Here we use P (e) to represent, which can be calculated by the following formula:

In the above formula, Ci represents the number of tourist cities that are classified as the wrong topic in the probability distribution of the i-th topic, that is, in the historical routes, the city is not in the same route as any other city in the topic city. Ai represents the total number of cities in the probability distribution of the i-th topic. Therefore, the lower the relevant city error rate, the higher the quality of the model output, the more easily accepted. In the practical application, the related city error rate is generally not more than 0.2.

According to the relevant city error rate above, we can get the route correlation rate calculation method. Here, we use R (t) to represent:

In the above formula, _{i}_{i} represents the total number of cities in the i-th generation route. Because in the recommendation process, if there are cities that have no relevance with other cities in the recommended route, it is often unacceptable. Therefore, the higher the route correlation rate, the better the performance of the route recommendation and generation algorithm, the more consistent with the user’s expectations. In practical applications, the route correlation rate is generally not less than 80%.

The value of topic K of LDA model, the number of iterations, and the hyperparameters α and β all affect the probability distribution of the final topic city. Therefore, in order to obtain the optimal topic city probability distribution, we examine the effect of probability distribution of the topic city under different parameters. In order to ensure the uniformity of the experimental premises, the sample set of all the experimental results below is a set of 8-10 tourist route texts.

THE EXPERIMENTAL RESULTS OF DIFFERENT VALUES OF HYPERPARAMETER A

5 | 10 | 15 | 20 | 25 | 30 | 35 | 40 | 45 | 50 | |

4.72 | 4.16 | 4.02 | 3.38 | 3.21 | 3.16 | 3.82 | 4.12 | 4.68 | 5.16 | |

0.282 | 0.254 | 0.236 | 0.192 | 0.166 | 0.171 | 0.179 | 0.216 | 0.249 | 0.288 |

The experimental results of different values of hyperparameter α

We set the initial number of topic K = 50, the number of iterations: niter = 500, the hyperparameter α = 25, then the hyperparameter β takes 0.01, 0.05, 0.1, until 0.50.

THE EXPERIMENTAL RESULTS OF DIFFERENT VALUES OF HYPERPARAMETER B

0.01 | 0.05 | 0.10 | 0.15 | 0.20 | 0.25 | 0.30 | 0.35 | 0.40 | 0.50 | |

5.62 | 4.42 | 4.02 | 3.32 | 4.23 | 5.10 | 5.82 | 5.92 | 6.58 | 7.21 | |

0.282 | 0.254 | 0.236 | 0.172 | 0.198 | 0.216 | 0.232 | 0.299 | 0.328 | 0.356 |

The experimental results of different values of hyperparameter β

We set the number of iterations: niter = 500, the hyperparameter α = 25. β = 0.15, then the value of topic K takes 4, 6, 8, until 22.

THE EXPERIMENTAL RESULTS OF DIFFERENT NUMBER OF TOPIC K

4 | 6 | 8 | 10 | 12 | 14 | 16 | 18 | 20 | 22 | |

5.62 | 4.42 | 4.02 | 3.32 | 3.26 | 5.10 | 5.82 | 5.92 | 6.58 | 7.21 | |

0.223 | 0.214 | 0.205 | 0.196 | 0.182 | 0.226 | 0.265 | 0.314 | 0.408 | 0.516 |

The experimental results of different number of topic K

We set the initial number of topic K = 12, the hyperparameter α = 25, β = 0.15, then the number of iterations take 300, 400, 500, until 1200.

THE EXPERIMENTAL RESULTS OF DIFFERENT NUMBER OF ITERATIONS N

300 | 400 | 500 | 600 | 700 | 800 | 900 | 1000 | 1100 | 1200 | |

6.15 | 5.86 | 4.02 | 3.98 | 3.82 | 3.64 | 3.12 | 3.31 | 3.41 | 3.53 | |

0.332 | 0.308 | 0.275 | 0.262 | 0.236 | 0.214 | 0.161 | 0.172 | 0.181 | 0.194 |

The experimental results of different number of iterations n

From the above experimental results, it can be concluded that the optimal parameters of topic city generation model based LDA are k=12, hyperparameter α=25, β=0.15, and the number of iterations n=900. Under the optimal parameters, the relevant city error rate is 0.161, which is an acceptable error rate in practical applications.

In order to reduce contingency of experimental results and improve confidence of experimental results, in the experiment evaluation of this section, we carry out the following experimental steps:

Randomly generating 50 groups of input city list and playing time;

Using the random generated input city list and playing time as input to the recommendation algorithm

Recording the output of algorithm obtained from 50 sets of input data, and taking the average value of relevant city error rate of 50 sets of experiments, denoted as Ei

Repeating the above steps (1)-(3) for 10 times to obtain the value of Ei for each time.

The route correlation rate of LDA travel route recommendation algorithm based on KDE and classification

In order to reduce the contingency of experimental results and improve the confidence of experimental results, in the experimental evaluation of this section, we also carry out following experimental steps:

Randomly generating 50 groups of input city list and playing time;

Using the randomly generated input city list and playing time as input to the recommendation algorithm

Recording the output of algorithm obtained from 50 sets of input data, and taking the average value of relevant city error rate of 50 sets of experiments, denoted as Ei

Repeating the above steps (1)-(3) for 10 times to obtain the value of Ei for each time.

The route correlation rate of collaborative filtering travel route recommendation algorithm based on KDE and classification

THE OUTPUT RESULTS OF DIFFERENT ALGORITHM

total days of travel | 7 | |

cities that user wants to go | Osaka, Nagoya | |

No improved LDA recommended algorithm | [Naoshima: 2.5, Yamanashi: 1.8, Osaka: 56.4, Nagoya: 29.8] | |

No improved collaborative filtering recommendation algorithm | [Yakushima: 12.5, Naoshima: 8.6, Osaka: 42.8, Nagoya: 26.2] | |

LDA travel route recommendation algorithm based on KDE and classification | [Kyoto: 42.4, Nakafurano-cho: 3.9, Osaka: 15.5, Nagoya: 16.5] | |

collaborative filtering travel route recommendation algorithm based on KDE and classification | [Kyoto: 24.2, Tokyo: 20.3, Osaka: 15.5, Nagoya: 16.5] |

In order to evaluate model results and recommended algorithms, this chapter first proposed new evaluation indicator based on recall rate and precision rate realization principles, relevant city error rate and route correlation rate. Then, the influences of different number of topic K, number of iterations, the hyperparameters α and β on the LDA topic city generation model are introduced. After many experiments, the optimal model parameters are determined to be k=12, α=25, β = 0.15, niter = 900. Finally, the performance of different recommendation algorithms is evaluated. It can be seen from experimental results that collaborative filtering travel route recommendation algorithm based KDE and classification is slightly higher than the route correlation rate of LDA travel route recommendation algorithm based KDE and classification by about 5%. In the actual application process, different recommendation algorithms can be selected according to users’ actual demand. The final experimental results show that the optimization effect of proposed algorithm by using the classification method and KDE algorithm is obvious. The LDA and collaborative filtering algorithm optimized by classification method improves the route correlation rate and makes the route correlation rate indicator reach more than 90%. The KDE algorithm is used to optimize playing time, which makes playing time of cities more reasonable, which proves that the method of this paper has great reference value.

This paper proposed the travel route recommendation and generation algorithm based on LDA and collaborative filtering. The core of algorithm is LDA topic model and collaborative filtering. The LDA and collaborative filtering travel route recommendation algorithm based on KDE and classification are proposed in this paper. Although optimized algorithm designed has achieved good performance, but it still needs a lot of work to be done, including:

The recommendation algorithm based on LDA topic model has a certain degree of randomness in generating the recommended city list, there will be not related to the historical routes in resulting travel routes, but within the acceptable error rate. The output of recommendation algorithm based on collaborative filtering is relatively fixed and does not generate new feasible routes. And although the collaborative filtering algorithm does not have randomness problems, due to the irrelevance of user s’ input city list, a certain error rate will also occur. Therefore, we can study a method that can combine the LDA topic model and collaborative filtering algorithm to make the performance of the recommendation algorithm better.

So far, the hyperparameters of LDA model, such as the number of topic k, a and ß, are mainly adjusted manually by empirical rules, resulting in a huge amount of experimental work. Later, we can consider some methods of adding reinforcement learning and self-game, and propose a method that can learn the optimal parameters. This is also a research hotspot in the field of machine learning in recent years.

Further studying the evaluation method of travel route, because the evaluation of travel route has certain subjectivity, so this brings certain difficulties to actual assessment. At present, only quantifiable indicators can be extracted to evaluate part reasonability of travel route. So evaluation indicators may not be comprehensive. Later, we can study and propose a comprehensive and reasonable evaluation method of travel route.

This paper is partially supported by Qinghai University Student Science and Technology Innovation Fund Project (No. 2017-QX-12);