Keywords: Online tourism reviews provide a crucial source of information for the tourism industry, and determining
Online travel reviews whether they can be effectively identified is key to influencing tourism decision-making. The purpose of this
Semantic association analysis paper is to identify themes and compare differences in online travel reviews. A semantic association analysis was
Opinion mining applied to extract thematic words and construct a semantic association network from 165,429 reviews obtained
Social network analysis
from three major online travel agencies (OTAs) in China. The findings show that there are apparent dis-
crepancies on these platforms in terms of thematic words, the distribution of topics, structural properties and
community relationships. In particular, the results of network visualization can clearly identify hot topics and
the social network relationships of thematic words. The proposed analytical framework expands our under-
standing of the methodological challenges and offers novel insights for mining the opinions for the benefit of
tourists, hotels and tourism enterprises and OTAs.
1. Introduction et al., 2017; Hu, Chen, & Chou, 2017; Schuckert, Liu, & Law, 2015a).
Industry data indicate that approximately 77% of prospective travellers
With the rapid development of Web 2.0 technologies, online user- will either “always” or “usually” not make decisions until they read
generated content (UGC) such as online travel reviews, has been widely online reviews (Ye et al., 2009). Travellers can minimize travel costs
used in the tourism and hospitality industry. Tourists tend to share their and obtain indirect purchasing experiences by browsing information
travel experiences through online travel agencies (OTAs) such as nodes (e.g., government tourism websites, online travel portals),
TripAdvisor (Guo, Barnes, & Jia, 2017; Liu, Schuckert, & Law, 2018), thereby reducing their perceived uncertainty and achieving an enjoy-
Expedia (Xiang, Schwartz, Gerdes, & Uysal, 2015), Yelp (Papathanassis able psychological experience (Lian & Yu, 2017; Xiang et al., 2015; Ye,
& Knolle, 2011), Lvmama (Lian & Yu, 2017), Ctrip (Ye, Law, & Gu, Law, Gu, & Chen, 2011).
2009), Qunar (Zhang et al., 2016a, b), etc. Such descriptions of tourists' Online reviews have the characteristics of noise. Tourists post de-
experiences, which are actively shared by users, have widely been re- ceptive reviews when they want to achieve some purpose (Min, Lim, &
garded as a typical type of online travel review. Online travel reviews Magnini, 2015); for example, visitors might sometimes post a fake
include hotel reviews, restaurant reviews, and attraction reviews (Xiang positive review to avoid unnecessary trouble or to receive kickbacks
et al., 2015), which are the most popular sources of information for (Schuckert, Liu, & Law, 2015b), resulting in consumers' inability to
tourists in obtaining travel information and making travel plans as well quickly gain access to useful information (Liu & Park, 2015). Therefore,
as booking tickets and hotels (Li & Yi, 2014; Zhang et al., 2016a, b). the question of whether online travel reviews can be identified and
Unlike ordinary consumer products, the adoption of travel involves adopted by latent consumers is the key to influencing tourism decisions.
more than a simple purchase decision, and tourism consumption is Online reviews not only provide convenience to consumers in the
based on public willingness. Consumers are easily influenced by other search for information but also increase consumers' cognitive costs.
people's opinions, and their willingness to search for the opinions and Consumers can become confused and lost when faced with a massive
experiences of peer consumers before purchasing a product has been quantity of online reviews, resulting in poorer decision-making and
found to be relevant (García-Pablos, Cuadros, & Linaza, 2016; Guo intangible pressure (Bellman, Johnson, Lohse, & Mandel, 2010;
Received 11 November 2018; Received in revised form 31 January 2019; Accepted 18 March 2019
Schuckert et al., 2015a). The quality of online review information af- semantic association analysis and SNA. Section 3 provides a methodo-
fects consumers’ attitudes towards adopting information sources and logical description of our study accompanied by an introduction to how
making tourism decisions (Filieri & Mcleay, 2014; Schuckert et al., to collect and analyse data. Section 4 conducts an analysis of the cur-
2015a). The length, social characteristics, readability, accuracy of in- rent study's results, including statistical analysis of thematic words, the
formation, perceived value, and language style of reviews are the main construction of bigrams of the co-occurrence of semantic association,
factors that affect the quality of information acquisition (Filieri & analysis of the structural properties of semantic association networks,
Mcleay, 2014; Hodac, Carson, & Moore, 2013; Li, Xu, Tang, Wang, & Li, and visualizations. Section 5 presents discussion, conclusions, limita-
2018; Papathanassis & Knolle, 2011; Racherla & Friske, 2012; tions and future directions.
Schuckert et al., 2015b). Papathanassis and Knolle (2011) argued that
when the readability of online reviews is too low, consumers will show 2. Background
lower reservation intention, which affects the competitiveness of
tourism enterprises (e.g., in terms of reputation and revenue). There- 2.1. Online travel reviews
fore, quickly obtaining high-quality information from a large number of
online reviews has become a critical issue in the field of practice and Online reviews are posted by consumers who have purchased and
academia. used a product, and they include consumers' experiences, evaluations,
Prior studies on online travel reviews have been conducted using and opinions (Litvin, Goldsmith, & Pan, 2008). Online reviews are a
questionnaire surveys (Min et al., 2015), mathematical models (Liu & typical type of eWOM (electronic word-of-mouth) and have important
Park, 2015), statistical analysis (Racherla & Friske, 2012; Schuckert value for consumers, enterprises and sellers (Litvin et al., 2008). For
et al., 2015b), and grounded theory (Filieri, 2016; Papathanassis & consumers, online reviews are not only a channel for obtaining product
Knolle, 2011). Online reviews are text-based and often comprise large information but also the basis for making tourism decisions. The opi-
information repositories that go beyond the analytical capabilities of nions expressed by commentators on products or services affect other
traditional econometric and statistical methods (Guo et al., 2017). It is consumers’ purchase intention (Wang & Wang, 2018). For enterprises
difficult to mine statistically meaningful differences among different and sellers, the features of online reviews, such as comprehensiveness
types of groups within the review data (Boo & Busser, 2018; Xiang, Du, (Zhao, Liang, Xiao, & Law, 2015), professionalism (Ladhari & Michaud,
Ma, & Fan, 2017). Text analysis technology based on natural language 2015), quality (Wenchin, Mingtsang, Liwen, & Lin, 2015), and re-
processing (NLP) can automatically address large amounts of customer- putation (Casaló, Flavián, Guinalíu, & Ekinci, 2015), have a significant
generated reviews and comments from the perspective of word granu- impact on product sales (Hodac et al., 2013).
larity, and this method is widely used in topic identification and opi- Online travel reviews are rich, complex bundles of information that
nion mining (García-Pablos et al., 2016). By utilizing the rich concept reflect travellers' experiences and evaluations of products (Duverger,
hierarchy structure and semantic knowledge provided by ontology, 2013; Hemmatian & Sohrabi, 2017; Litvin et al., 2008; Xiang et al.,
semantic association analysis can quickly extract significative topic 2015), which are important sources of information for tourists to fa-
words from a large number of texts and achieve business intelligence cilitate the making of travel arrangements (Bucur, 2015). Many studies
association analysis at a semantic level (Boo & Busser, 2018; Xiang have found significant effects of such online travel reviews on the
et al., 2017). The approach is more reliable than traditional text ana- tourism industry, such as consumer travel decision behaviour, tourism
lysis technology (e.g., latent Dirichlet analysis, fuzzy domain ontology, product sales, and tourism destination image. For instance, Ye et al.
and support vector machine), and the analysis results have more re- (2011), Xiang et al. (2015), and Lian and Yu (2017) explored the re-
ference value and applicability (Filieri, 2016). From a semantic point of lationship between online travel reviews and consumer purchase in-
view, the association of two documents in online travel reviews is often tention and found that consumers can reduce the cost of information
decided not only by literal repetition but also by the semantic logic search and indirectly obtain travel experience, which may promote
behind the words. The combination of semantic idea and association their willingness to purchase. Li and Yi (2014) showed that online
analysis methods can eliminate sampling bias and refine the observa- travel reviews are closely related to the financial status of tourism
tion granularity of topic identification to the word level in the research products, and hotels with a higher number of positive reviews obtain
on online travel reviews. As a result, it can significantly improve the more reservations. Online travel reviews, which reflect the reputation
efficiency and quality of online travel review analysis by accurately and satisfaction of tourism destinations, are an important part of a
mining the centre of different topic groups and the degree of association tourism destination network image and have a direct impact on tourists'
of keywords, and it can achieve value added from information to perceived quality, satisfaction, and behavioural intention (Lian & Yu,
knowledge to intelligence (He, 2013; Li et al., 2018). 2017; Yuan & Wu, 2016). Somabhai, Varma, and Somabhai (2015)
With this in mind, we attempt to mine the opinions within online argued that consumers can obtain an intuitive impression of tourism
travel reviews on the three platforms by using semantic association destinations by reading other travellers’ online travel reviews, helping
analysis, providing practical help for understanding tourists' behaviour them to reduce risk uncertainty and effectively make travel plans. Al-
and leading to improvements in tourism industry marketing. though online travel reviews provide convenience for consumers' travel
Specifically, the main aim of this study is to answer the following decisions, massive reviews have also exacerbated confusion with regard
questions: to information overload in the big data era (Fang, Ye, Kucukusta, &
Law, 2016; Ren & Hong, 2017; Zhang, 2014, pp. 1–46). Extracting key
(1) What topics are discussed in online travel reviews? points from online textual reviews is complex and challenging, but it is
(2) How can bigrams of co-occurrence phrases of semantic association crucial for predicting, interpreting and responding to customer beha-
be constructed? viour (Wang, Yang, Sun, & Jiang, 2017).
(3) What are the structural properties of semantic association net-
works? 2.2. Opinion mining of online travel reviews
(4) How can we identify the focus of a discussion by visualizing se-
mantic association networks? Opinion mining is a technology that automatically extracts online
(5) What are the differences among different OTAs? comment information by using textual analysis, including computer
language and natural language processing. It analyses people's opi-
The remainder of this paper is organized as follows. Section 2 pro- nions, appraisals, attitudes, and emotions towards organizations, enti-
vides background for the research, which reviews the status of the re- ties, person, issues, actions, topics and their attributes (García-Pablos
search on online travel reviews, opinion mining in online travel, and et al., 2016; Hemmatian & Sohrabi, 2017). The main task of opinion
mining can be divided into six categories: sentiment analysis, opinion that each individual has ties to other individuals (Wasserman & Faust,
extraction, sentiment mining, subjection analysis, affect or emotion 1994), and such ties are a means of identifying potential relationships
analysis, and review mining (Kim & Park, 2017; Rahimi & Kozak, 2016; in the data (Asiedu, 2014). Density, modularity, and network diameter
Sirgy, 2010; Xu & Mcgehee, 2016). All techniques used for opinion are often regarded as main analysis indicators. Raisi et al. (2017)
mining can be categorized into two main classes: lexicon-based ap- conducted research on a hyperlink network for the Australian tourism
proaches (Brandes, 2001; Daud, Khan, & Che, 2017; Telesford, Joyce, industry using indicators such as the density, modularity, network
Hayasaka, Burdette, & Laurienti, 2011) and machine learning (Leclerc diameter, and number of groups. They found that the hyperlink net-
& Martin, 2004; Weiler, 2002; Weiler & Walker, 2014). The lexicon- work of a destination is extremely sparse. This finding has critical im-
based method classifies text sentiment polarity by relying on a senti- plications for improving the effectiveness of information flow between
ment dictionary and linguistic knowledge approach, which includes a tourism organizations and enterprises on the Internet. Wehbe used
corpus-based approach and a dictionary-based approach. The machine modularity and density to indicate that a company's cultural security
learning approach, which benefits from machine learning algorithms, requires more connectivity and frequent quality communication
can be divided into three groups: supervised learning, semi-supervised (Wehbe, Hattab, & Hamzeh, 2016). These indicators also have a strong
learning and unsupervised learning (García-Pablos et al., 2016; explanatory role in mining the potential information of tourism texts
Hemmatian & Sohrabi, 2017). The approach extracts the text in- (Casanueva, Gallego, & Garcíasánchez, 2014; Chen, Liang, Hong, & Gu,
formation from product reviews by feature construction technology 2015). Using the social network analysis method, Ji, Li, and Chen
(such as the bag-of-words) and uses a classification method to analyse (2016) revealed that the spatial structure of self-service travel in
online reviews (Raisi, Baggio, Barratt-Pugh, & Willson, 2017). Within Yunnan province is characterized by “closely contacting with each
the online travel review context, the proposed opinion mining method other locally, although the overall link is loose”. However, there are still
has significant advantages of acceptable accuracy and resource savings some gaps in the research on identifying the potential needs of con-
in dealing with unstructured review texts (Bucur, 2015). Guo et al. sumers based on word granularity. Therefore, this study conducts a
(2017) analysed 266,544 online reviews extracted from 25,670 hotels semantic association analysis of online travel reviews based on the
located in 16 countries to identify key dimensions of customer service perspective of social network theory to explore the potential needs of
based on latent Dirichlet analysis (LDA) data mining methods and un- tourists and discover the connections among these needs.
covered 19 controllable dimensions that are key for hotels to manage
their interactions with visitors. Ali, Kwak, and Kim (2016) proposed a 3. Method
fuzzy domain ontology (FDO) and support vector machine (SVM) opi-
nion mining system to automatically classify online reviews. 3.1. Research design
2.3. Semantic association analysis and SNA To identify tourists’ potential demand from online travel reviews
and improve customer satisfaction, we propose a semantic association
Semantic association analysis, which is an important text analysis analysis approach for practical guidelines in fields related to opinion
method (Alemanmeza, Halaschekwiener, Sahoo, Sheth, & Arpinar, mining, as shown in Fig. 1, which summarizes the process used in this
2005; Xiang, Tian, & Huang, 2007), was first mentioned in the study of study. In this paper, web crawlers are first used to extract online travel
the brain's response to expected words (Kutas & Hillyard, 1984). The reviews from three major OTA websites in China. Then, the data are
core idea of semantic association analysis is that semantics are defined pre-processed to build a review dataset using natural language pro-
by the co-occurrence of two words in a sentence with high-frequency cessing (NLP), such as Jieba and NLTK, in Python programming lan-
words as a node, taking into consideration the frequency of high-fre- guage, including data cleaning and tokenization. Finally, statistical
quency phrase co-occurrence as a link between nodes (Alemanmeza analysis of thematic words, semantic association analysis and visuali-
et al., 2005; García-Pablos et al., 2016; Schuckert et al., 2015a; Xiang zation are performed on the data.
et al., 2007). Common methods include natural language processing
technology, topic models, ontology, etc. Bigram co-occurrence can 3.2. Data collection
prevent information loss and distortion in linguistic evaluation in-
formation aggregation and computation and make the calculation re- In previous studies, most articles only retrieved online travel review
sults more accurate (Herrera & Martinez, 2000). In particular, semantic data from a single platform (Banerjee & Chua, 2016; Cheng, Fu, Sun,
association analysis is based on external knowledge and a semantic Bilgihan, & Okumus, 2019; Fang et al., 2016; Guo et al., 2017; Kim,
lexicon to construct a model of feature words, which enables better text Park, Yun, & Yun, 2017; Liu et al., 2018; Schuckert et al., 2015b; Ye, Li,
classification than that found in prior studies. This approach has been & Wang, 2014; Zhang et al., 2016a, b) or even from only a single
widely developed in research pertaining to business intelligence, med- destination (Lui, Bartosiak, Piccoli, & Sadhya, 2018). To obtain the
ical image association, and human behaviour. For example, Yin and most representative data in this study, we choose three major OTAs
Peng (2010) designed a method that builds semantic associations be- platforms (Ctrip, Tuniu and Tongcheng) as the data source and extract
tween product features and sentiment words to identify the sentiment online travel reviews using Python programming language. Ctrip
expressed regarding each product feature from product reviews in (http://www.ctrip.com/), Tuniu (http://www.tuniu.com) and Tong-
Chinese. Zhang et al. (2016a, b) presented a CCA-PairLDA feature re- cheng (http://www.ly.com) are the top three Chinese OTAs and lead
presentation method for similarity computation between medical new business to customer (B2C) tourism e-commerce websites. As of the
images with high-level semantics. The image similarity can be calcu- end of December 2017, their market shares were 43.6%, 22.7% and
lated based on local feature sets, word frequency histograms, latent 11.1%, respectively. Ctrip, which was founded in 1999, is one of the
topic distributions, and semantic association coefficients. Kuhlmann, largest integrated online travel service companies. By the end of May
Hofmann, and Jacobs (2017) proposed that emotion valence can be 2018, it had more than 300 million registered users. Tuniu provides
regarded as a semantic super-feature in human forced-choice evalua- more than one million tourism products for consumers, covering self-
tions, and semantic association networks can be constructed to judge help, self-driving, cruise, hotels, visas, tickets for scenic spots, company
the polarity of words. tours, etc. Tongcheng is the most professional online service platform
The semantic web formed by semantic association is a type of pre- for leisure tourism in China. According to a report from iResearch.cn, in
sentation of social networks. Therefore, the opinion mining of online May 2018, the active monthly users of Ctrip, Tuniu and Tongcheng
travel reviews from the perspective of social network theory is one of were 68.55 million, 8.3 million and 22.2 million, respectively
the research contents of this study. Social network analysis emphasizes (iResearch, 2018). The platforms of Ctrip (Lu & Liu, 2016; Qi, Li, Zhu, &
Fig. 1. Framework for sematic association analysis based on online travel reviews.
Shi, 2017), Tuniu (Lian & Yu, 2017; Zhang & Zhou, 2018) and Tong- in Ctrip, Tuniu and Tongcheng. We collected approximately 47,000
cheng (Li & Yi, 2014; Wang & Wang, 2018) have been used many times reviews from Ctrip, 51,000 reviews from Tuniu, and 67,000 reviews
as samples in prior studies; therefore, the selection of these data plat- from Tongcheng. Although Tongcheng had the largest number of re-
forms for online travel reviews is reasonable. views, Tuniu had the highest number of Chinese words (approx.
The data collection took place in June 2018. We collected review 4,084,000) and the highest average length of reviews (80.32 words).
data on the Top ten tourist cities in China released by the 2018 Global The purpose of tokenization is to divide travel review texts into
Destination Marketing Summit and World Culture and Tourism Forum keywords, phrases or other meaningful elements, such as tourist spots
(ShaanxiDaily, 2018), including Shanghai, Beijing, Xi'an, Chengdu, and travel feelings (Guo et al., 2017; Xiang et al., 2017). For this study,
Hangzhou, Sanya, Hongkong, Guangzhou, Xiamen and Nanjing. Web we applied the existing open source tool LTP (see http://ir.hit.edu.cn/
crawlers (Xiang et al., 2017) designed in the Python programming ltp/) provided by the Research Centre for Social Computing and In-
language were used to mimic a user's access to the three OTA websites formation Retrieval of Harbin Institute of Technology (Che, Li, & Liu,
for all the tourist products of the ten cities, and the user ID, product 2010) and Python 3.6.4 (see https://www.python.org) to implement
name, review text and review time of the search results were down- tokenization and stop word removal for all effective reviews. The stop
loaded, as shown in Fig. 2. To maintain the timeliness of the data, we word list, which consisted of 1893 Chinese words, came from Harbin
only crawled online review information from January 2015 to June Institute of Technology and has been widely applied in opinion mining
2018. and analytics.
Because the reviews were posted in Chinese, we adopted the fol-
lowing translation method to ensure the accuracy of the translation.
3.3. Data pre-processing
First, we divided the translators into two groups, with two persons in
each group. The first group was led by the first author of this article,
All reviews collected from the three OTA platforms were pre-pro-
and the second group was led by an English teacher. All participants
cessed by four operations: data cleaning, tokenization, stop word re-
possessed an excellent level of English and were able to use native
moval and the translation of Chinese words to English. Data cleaning
English. After dividing the group, we selected the top 2000 words in
was used to detect and remove inaccurate or useless records from online
thematic word tables for each platform for translation. The translation
textual data, such as misspellings and non-target language (Guo et al.,
work of each group was performed independently and simultaneously,
2017; Xiang et al., 2017), leaving only valuable tourism-related in-
and no discussion was held until the formal translation was completed.
formation. In this study, we first deleted reviews with a textual length
If uncertain thematic words were found in the translation, the partici-
of less than 15 words. The length of the textual portion of the review
pants marked them as “uncertain” thematic words and discussed them
positively affects “helpfulness” perceptions (Racherla & Friske, 2012).
in the next step. After completing the first translation, we compared the
Cai, Xu, and Wu (2014) also found that information value was poor
translation results of the two groups one by one to avoid translation
when the length of Chinese language-based online reviews was less
bias, especially with regard to the “uncertain” thematic words.
than 15 words. Second, because multiple repeated reviews posted by a
single user may lead to statistical bias, we kept only one record in a
user's duplicate record. Finally, we deleted the reviews that were ad-
vertisements to ensure the authenticity and accuracy of the data sam-
ples. Table 1 presents the results after the data cleaning of the reviews
Table 1 整体不错。住宿和餐饮挺好的,行程安排也好。就是导游服务能再提
The results after data cleaning. 高就更好了。
Review Platform N of Reviews N of Chinese Words Avg. Length of Reviews
The translation is as follows:
Overall good. The accommodations and dining are fine, and the
Ctrip 47,446 3,428,609 72.26 schedule is also good. It would be even better if the tour guide service
Tuniu 50,844 4,083,948 80.32 could be further improved.
Tongcheng 67,139 3,484,343 51.90
After pre-processing of the review, the text was as follows:
3.4. Data analysis 务/能/再/提高/就/更好/了/。
The results of the semantic association bigrams co-occurrence of
3.4.1. Analysis of thematic words thematic words are shown in Table 2.
A thematic word is a word with a definite meaning and the char- Then, we accumulated the frequency of the bigram co-occurrence
acteristics of conciseness and timeliness as well a large amount of in- phrases to generate co-occurrence phrase lists for the three platforms
formation. An analysis of thematic words aims to remove words that are for social network analysis. Gephi, a popular open-source software for
meaningless and thereby affect the research results of the keywords graph and network analysis developed by the research institutions of
extracted from the online review text through data pre-processing. It SciencesPo and Linkfluence in France, which is widely applied in the
also aims to calculate the frequency of keywords in the text (Yuan & fields of social network analysis, biology, and genomics (Bastian,
Wu, 2016). The Jieba toolkit was used for the tokenization and stop Heymann, & Jacomy, 2009; Jacomy, Venturini, Heymann, & Bastian,
word removal of all the review data. A Chinese word segmentation 2014), was used in this study for network structure feature analysis and
module was used that developed by Chinese programmers in Python. visualization.
This approach was mostly used in the text mining and the Chinese word
segmentation in search engines. Additionally, the NLTK was used to 4. Analysis of the results
calculate the frequency of the thematic words. The calculation formula
is as shown in Equation (1): 4.1. Statistical analysis of thematic words
Fi = Ri × If only simple word frequency analysis is performed on thematic
Lt (1)
words, the meaning of the context of the words itself will not be ex-
where Fi is the frequency of a thematic word i in each platform, Ri is the plained. For example, Boo and Busser (2018) argued that hotel topics
occurrence number of i in the review text, Li is the length of i, and Lt is include thematic words such as guestroom (e.g., room amenities, view,
the length of all words in each platform. Therefore, in this study, the
range of t is from 1 to 3, representing the three OTAs platforms, re- Table 2
spectively. A sample of semantic association bigram co-occurrence.
Thematic Word Thematic Word Frequency
Table 3
Topic identification of thematic words.
Topic 1: Tour guide Topic 2: Hotel Topic 3: Service Topic 4: Scenic Spot Topic 5: Experience
Guide 1.22% Hotel 0.65% Enthusiasm 0.16% Ticket 0.43% Satisfaction 0.43%
Itinerary 0.63% Accommodation 0.19% Enjoy 0.14% Spot 0.35% Happy 0.35%
Scheduling 0.55% Overall 0.17% Driver 0.14% Location 0.22% Feeling 0.27%
Explanation 0.36% Environment 0.12% Recommendation 0.13% Queue 0.21% Kids 0.21%
Shopping 0.14% Room 0.11% Gratitude 0.12% Entertainment 0.17% Time 0.16%
Always 0.12% Breakfast 0.09% Thoughtful 0.09% Attraction 0.15% Play 0.14%
Responsible 0.10% Big 0.09% Consultation 0.09% View 0.13% Travel 0.14%
Patience 0.10% Clean 0.07% Free 0.09% Weather 0.11% Suggest 0.12%
Humour 0.08% Comfortable 0.06% Id Card 0.13% Fun 0.11% Pleasant 0.11%
Considerate 0.07% Aircraft 0.06% Care For 0.08% Scenery 0.11% Whole 0.10%
level (such as responsibility, professionalism and humour) was an im- topics are the focus of attention, such as room–clean, hotel–location,
portant factor affecting tourist satisfaction. Compared to Ctrip (15.7%) hotel–room, hotel–environment. To some extent, this result further veri-
and Tuniu (17.6%), Topic 2 (i.e., hotel) was much more prominent in fies the distribution of the five topics mentioned above.
Tongcheng (35.6%), which may be related to its strategy of leisure
travel. The leisure travellers were solo groups, and they were more 4.3. Structural properties of the semantic association network
sensitive to the environment around the hotel (Kim & Park, 2017;
Mccain, Jang, & Hu, 2005; Radojevic, Stanisic, & Stanic, 2015). For To analyse the network structural properties of each platform, we
Topic 4 (i.e., scenic spot), Ctrip (11.6%) was lower compared to Tuniu imported the top 2000 bigram co-occurrence phrases into Gephi in a
(14.0%) and Tongcheng (19.2%). CSV file format to construct an undirected network. The results of the
social network analysis show that the three networks were non-fully
4.2. Constructing bigram co-occurrence of semantic associations connected networks, and there were a certain number of isolated nodes
with no links to others. Isolated nodes were excluded from our analysis
To explore the relationship between thematic words, we constructed (Raisi et al., 2017), including the visualizations in the next section,
a bigram co-occurrence of semantic association using bigram tools in unless otherwise stated. The isolated nodes of Ctrip, Tuniu, and Tong-
the NLTK of the Chinese corpus (Bird, Klein, & Loper, 2009). Because cheng were 18, 46, and 298, respectively.
the bigram phrases generated in this paper were undirected data, we Table 6 shows the results of the measurement in this paper, in-
added the frequency of the two bigram phrases that contained thematic cluding nodes, density, average degree, etc. The three networks are
words repeatedly in different locations, such as A-B and B-A. Conse- obviously divided into two categories: Ctrip and Tuniu are quite si-
quently, we obtained more than 200 thousand bigram co-occurrence milar, and Tongcheng is quite different. There are 1,462 nodes in the
phrases from each platform. According to Equation (2) formulated in Tongcheng network, which is the largest of the three platforms; Ctrip
the above statistical analysis of thematic words, the value of I1 is over and Tuniu have almost the same number of nodes. The density of the
150 thousand, and the value of T is more than 550; therefore, the three networks is less than 0.02, which indicates that the networks are
number of phrases that can be used is not more than 100. Therefore, to extremely sparse. In particular, the network density of Tongcheng is
better reveal the relationship between thematic words, we chose the top only 0.002, which means that out of 1,000 possible links, only 2 of them
2000 bigram co-occurrence phrases for semantic association analysis. actually exist in the network. Compared to those of Ctrip and Tuniu, the
The value of co-occurrence frequency on the three platforms is ap- average degree of Tongcheng is the smallest at 2.488. The three plat-
proximately 10, covering most high-frequency words. forms, especially Tongcheng, are low-density networks characterized
Table 5 presents the top 10 bigram co-occurrence phrases of each by more scattered topics, which may reduce the impact of reviews on
OTA. The weight represents the accumulated value of the co-occurrence potential tourists. After excluding the isolated nodes, all the bigram co-
of two thematic words. As shown in Table 5, the top 10 bigram co- occurrence phrases of the three platforms were connected together in
occurrence phrases on Ctrip and Tuniu are very similar, and itinerar- one component. The network diameter refers to the maximum distance
y–scheduling, guide–explanation, guide–satisfaction, guide–scheduling are between any two nodes in the network, representing the closeness of
most frequently mentioned by users, meaning that the travel itinerary connection among the nodes. The diameter of Tongcheng is 13, which is
and the service quality of the guide were extremely important factors far greater than the values for Ctrip and Tuniu, meaning that in-
for the users of Ctrip and Tuniu. The overall service quality of tour formation is passed from one node to another 13 times at most in the
guides or leaders is very important to the satisfaction of tourists and Tongcheng network.
influence consumers’ decisions in the selection of all-inclusive tours Modularity is one of the most common methods for community
(Caber & Albayrak, 2016; Heung, 2008; Mossberg, 1995). However, the detection and can help to illuminate the intermediate structure of the
top 10 bigram co-occurrence phrases shown on Tongcheng are ob- network (Raisi et al., 2017). The value of the modularity index ranges
viously different compared to Ctrip and Tuniu, and the hotel-related from 0 to 1. It is generally acknowledged that when the index of
Table 5
Bigram co-occurrence phrases on the three platforms.
Ctrip Tuniu Tongcheng
Table 6 thematic words. The links represent the association between thematic
Network statistics for bigram co-occurrence phrases. words, and the thickness of the connection lines represents the strength
Index Ctrip Tuniu Tongcheng
of the relationship.
As shown in Fig. 4, the core nodes in the Ctrip network can be di-
Nodes 428 452 1462 vided into two major communities: guides and hotels. The hotel com-
Edges 1991 1976 1819 munity is relatively small in scale, and the main association phrases
Density 0.017 0.015 0.002
Avg. degree 9.304 8.743 2.488
include hotel–breakfast, hotel–environment, hotel–room, hotel–comfortable,
No. of connected components 1 1 1 hotel–clean, hotel–location, and hotel–shuttle, which represent the main
Diameter 7 7 13 factors that tourists consider when choosing a hotel. The internal re-
Modularity 0.289 0.329 0.514 lationship within the guide community is quite close and includes six
No. of communities 18 13 53
core nodes: guide, hotel, satisfaction, itinerary, scheduling, and ex-
Avg. path length 2.688 2.674 4.233
Avg. clustering coefficient 0.647 0.659 0.113 planation. The internal links of the guide community are close, in-
cluding six core nodes: tour guide, satisfaction, travel, schedule, ar-
rangement and explanation, among which scheduling–itinerary,
modularity is greater than 0.44 (Liu & Du, 2017), the independence of guide–explanation, guide–satisfaction, guide–itinerary, guide–scheduling,
the network community will be relatively high. Using the algorithm and guide-happy have strong associations.
provided by Blondel, Guillaume, Lambiotte, and Lefebvre (2008), we The guide community can be further divided into five sub-com-
obtained the modularity and number of communities of the three munities: guide, itinerary, satisfaction, explanation, and happiness. The
platforms. As the results in Table 6 indicate, the modularity indexes of guide-centred sub-community is mainly concerned with the personal
Ctrip and Tuniu are relatively low. It is worth noting that Tongcheng characteristics of the guide, who is described by thematic words such as
has a modularity index of 0.518 and a community number of 53. These “expectation,” “handsome,” “hard,” and “excellent.” The guide's orga-
results indicate that the topics within each community are relatively nization is the focus of the itinerary-centred sub-community, including
concentrated; however, the relationship among the communities is re- thematic words such as “scheduling,” “spot,” “play,” “route,” and
latively loose. “tight.” The third sub-community, which is centred on satisfaction,
In addition, the three platforms show meaningful phenomena in the represents the effect of guide services, such as “enthusiasm,” “overall,”
properties of small-world networks, which are characterized by highly “gratitude,” and “considerate.” The sub-community that is centred on
clustered and small characteristic path lengths (Watts & Strogatz, explanation refers to the quality of the evaluation of guides at scenic
1998). Telesford et al. (2011) proposed that a small-world network can spots and comprises core thematic words such as “knowledge,” “hu-
be identified by a clustering coefficient and average path length. Using mour,” “patience,” “fun,” “professional,” and “history.” The happiness-
the algorithm of Brandes (2001) and Latapy (2008), we obtain the centred sub-community reflects travel experiences with words such as
average path lengths of Ctrip, Tuniu and Tongcheng as 2.688, 2.674, “pleasant,” “feeling,” “thoughtful,” “family,” “parents,” “kids,” and
and 4.233, with average clustering coefficients of 0.647, 0.659, and “friend.”
0.113, respectively. According to the research of (Raisi et al., 2017), Apart from slight differences in link strength, the manifestations of
Ctrip and Tuniu show discernible but not excessive small-world net- Figs. 4 and 5 are almost the same, indicating that users of Ctrip and
work properties. However, Tongcheng, which is characterized by low Tuniu have no essential differences and that they have common con-
clustering and large characteristic path lengths, does not have small- cerns. However, compared to Ctrip and Tuniu, Fig. 6 has an entirely
world network properties, resulting in a lower efficiency of information different manifestation, which shows the visualization of Tongcheng in
dissemination than the previous two networks. a preview ratio of 80%. As shown in Fig. 6, the hotel is the most im-
portant node. Contrary to Figs. 4 and 5, the hotel-centred community
4.4. Semantic association network visualization occupies more than half of the whole network, with high-intensity as-
sociation phrases: hotel–room, hotel–location, hotel–reception, hotel–sa-
Visualization is the transformation of text data into a network graph tisfaction, hotel–environment, hotel–breakfast, room–clean, etc. However,
that shows the links among thematic words in the form of nodes and despite the larger scale of the hotel community, there is no clear dis-
lines (Zhao, Gao, Guo, & Tao, 2009). To make the semantic relationship tinction between its sub-communities, which can be roughly divided
of thematic words more intuitive, Gephi is used for the network vi- into hotels and rooms. The distribution of other nodes is very scattered
sualization in this paper. We use the layout algorithm of ForceAltas2 to and includes words such as time, queue, entertainment, ticket, perfor-
draw a network graph. Compared to other layout algorithms, For- mance, kids, happy, animal, etc., and the relationship between each
ceAltas2 has a better measured quality (Jacomy et al., 2014). The ba- word is quite loose, which indicates that the topics of online travel
lanced state graphs of Ctrip, Tuniu and Tongcheng are shown in Fig. 4, reviews are not clear and that consumers have no common concerns.
Fig. 5 and Fig. 6, respectively. The nodes represent the corresponding The emergence of this phenomenon may be related to the platform
thematic words. Larger nodes indicate greater concern with the strategy and user characteristics of Tongcheng, which make it difficult
Z. Hou, et al. Tourism Management 74 (2019) 276–289
for consumers to obtain valuable information from online travel re- topics within the Tongcheng group are relatively centralized, commu-
views. nication among the groups is very loose. This loose layout can lead to
low connectivity among networks and a low degree of association
5. Discussion and conclusion among groups. This may be due to low user acceptance of a tourism
product, which affects the quality of the reviews they post (Chatterjee,
5.1. Main findings 2001). On the other hand, we find that not all of the platforms have the
characteristics of small-world networks. In our study, Tongcheng has a
First, this study found that there are differences in the core concerns larger average path length (4.233) and a smaller clustering coefficient
of users on the three platforms studied. The users of Ctrip and Tuniu (0.113); thus, it does not have small-world network characteristics
focus on guides and experience, while Tongcheng users are more con- (Telesford et al., 2011). In the network, a short average path length can
cerned about hotels and experience. The results of the classification quickly transmit information and reduce costs (Zhang & Guo, 2014).
statistical analysis of thematic words show that the users of Ctrip and Obviously, in this study, the efficiency of information transmission in
Tuniu pay 37.8% and 31.3% attention to guides, respectively, far ex- Tongcheng is lower than in Ctrip and Tuniu. This inefficient transmis-
ceeding Tongcheng's 8.3%. This finding shows that the service level of sion of information reduces Tongcheng users' perception of tourism
tour guides has a great influence on the satisfaction of tourists on the products, which may reduce their willingness to make purchase deci-
two platforms, which supports the results of recent research (Tsai et al., sions.
2015). However, Tongcheng users pay more attention to hotel-related Third, our research can accurately identify the hot topics of online
information such as the environment and facilities (Guo et al., 2017; travel reviews and the social network relationships formed by hot to-
Xiang et al., 2017), with attention reaching 35.6%. We argue that the pics. Our study found that the Ctrip and Tunniu platforms are composed
thematic tendencies discussed by users reflect the core competitiveness of two communities: guides and hotels. There are several sub-commu-
and marketing strategy of each platform. For example, Ctrip and Tuniu nities in the tour community, such as the sub-community of satisfaction,
employ mature tourism product development and design systems, the sub-community of explanation, and the sub-community of travel
which can provide personalized travel itineraries to tourists. In perso- itinerary. Among these, the connection strength between the guide
nalized tourism, the service of tour guides is key to ensuring tourists' community and the explanation sub-community is greater, indicating
satisfaction. Leisure tourism is the main business in Tongcheng, and that users usually consider the information of the guide and the guide's
leisure travellers are more sensitive to hotels and their environment explanation at the same time. Therefore, this study suggests that im-
(Kim & Park, 2017). It is not surprising that such users are mainly proving the service level of tour guides based on the guide's explanation
concerned about hotel information. ability and itinerary is the key to increasing the satisfaction of con-
Second, we obtained meaningful discoveries on network structure sumers on Ctrip and Tuniu. On the Tongcheng platform, the hotel
properties. On the one hand, we found that the modularity indexes of community is the largest. It is worth noting that the hotel community
Ctrip and Tuniu are very low, 0.289 and 0.329, respectively, indicating occupies more than half of the whole network; however, the strength of
that the network community is scattered. It is noteworthy that the its associated sub-communities, such as the room sub-community, the
modularity index value of Tongcheng exceeds the threshold of 0.44 but location sub-community, and the reception sub-community, is not very
the number of modular groups is large, indicating that although the high. Other communities in the network have a lower degree of
association with the hotel community. This low-association network texts of online travel reviews and then constructed bigram co-occur-
indicates that Tongcheng's reviews are not sufficiently focused, which rence phrases using semantic association and formed a visual network
leads to the deviation of information obtained by potential consumers. graph using Gephi software. The graph depicts the intricate relationship
Accordingly, there is important practical value in identifying ways to among the core topics from the word granularity perspective, effec-
guide users to post reviews with high information quality. tively resolving the shortcomings of accuracy of the previous studies
(Boo & Busser, 2018; Xiang et al., 2017). For example, as seen from the
5.2. Implications for research visualization graph of Ctrip, the satisfaction sub-community includes
many thematic words, such as attitude, consideration, consultation, and
First, this study provides a unique research perspective on online enthusiasm, which indicate the origin of tourist satisfaction. Further-
travel reviews. We move past the limitations of prior studies and in- more, there is a complex relationship between the satisfaction sub-
troduce social network theory into our research. This study holds that community and other sub-communities, such as the guide sub-com-
online travel reviews reflect the social interaction between reviewers munity and the hotel sub-community. Thicker connection lines indicate
and tourists, which represents a type of social relationship. Social net- greater association strength.
works, which are understood as social relationships, refer to all formal
and informal social relationships within a group of specific people, 5.3. Implications for practice
including the indirect social relationships linked by the physical en-
vironment, cultural sharing and direct social relationships (Mitchell, The results of this study have important practical implications for
2010). Therefore, the thematic words of online travel reviews can be consumer travel decision-making, for the improvement of hotel and
regarded as nodes, and the semantic associations among them can be tourism enterprises’ service quality, and for the strategic development
regarded as connections. Together, nodes and connections construct the of online tourism platforms.
social networks of online travel reviews. This finding not only considers For consumers, online travel reviews are an important source of
a new research perspective on online travel reviews but also expands information for obtaining travel services (Blomberg-Nygard &
the scope of the application of social network theory. Anderson, 2016). These reviews help them to improve the accuracy and
Second, this study proposes a novel analytical framework to extract relevance of their access to information by using our method, thereby
important topics from a large number of online travel reviews. This enhancing willingness to make purchase decisions. Online travel re-
analytical framework integrates various analysis methods, including views allow quick access to content and reduce the risk of making travel
web crawls, semantic association analysis, social network analysis, and decisions (Lian & Yu, 2017; Xiang et al., 2015; Ye et al., 2011). For
visual analysis. Compared to the previous research methods of online example, before choosing a specific tourism product, according to this
travel reviews, such as questionnaires, LDAs, sentiment analysis, and research method, consumers can obtain tourism-related information
statistical analysis (Guo et al., 2017; Hu et al., 2017; Min et al., 2015; such as travel schedules, tourists' satisfaction with tourism products or
Ren & Hong, 2017; Schuckert et al., 2015b), we can obtain more reli- destinations, hotels and surrounding environments. Additionally, by
able conclusions. On the one hand, tourists' needs can be identified analysing the differences among the core businesses of different tourism
based on word granularity by using semantic association analysis, platforms, consumers can purchase tourism products or services from
which provides strong support for mining potential customer value in the appropriate OTA depending on their needs.
online travel reviews. Objective data can be obtained through the use of The findings of this study can help hotels and tourism enterprises
web crawlers to eliminate the influence of data deviation in subjective quickly identify the hot topics of online tourism reviews and find cor-
surveys. On the other hand, social network analysis provides a quan- relations among topics. The potential information extracted from online
titative representation of the relationship between subject words in reviews is important for improving the service management and com-
online travel reviews, reveals the structural characteristics of the re- petitive advantage of hotels and tourism firms (Fang et al., 2016; Lui
lationship, and builds a bridge between macro- and micro research on et al., 2018; Miguéis & Nóvoa, 2017; Yang, Shin, Joun, & Koo, 2016).
online travel reviews. In particular, visualization turns boring textual Using this approach, enterprises can better and more accurately un-
information into interesting pictures to help us deepen our under- derstand the needs of users and improve their quality of service. Helpful
standing and quickly identify important information in online travel topic information will attract consumers, thus enhancing their will-
reviews. ingness to purchase. Therefore, managers should focus on topics that
Third, the results of this study contribute to the theoretical devel- are precise or easy to understand because these topics are more influ-
opment of the online travel review-related research by using semantic ential than fuzzy reviews. For example, “guide” in Ctrip and Tuniu is
association analysis. The method of semantic association analysis can the core thematic, and thematises with high relevance include itinerary,
reveal the hidden logical relationship behind the text, promote the scheduling, explanation, satisfaction, enthusiasm, responsible, pa-
understanding of the content of online travel reviews, and more effec- tience, etc. Thus, tourism enterprise managers should improve the
tively identify valuable topics. For example, the results of the bigram service quality of tour guides and increase professional training based
co-occurrence phrases of semantic association show that itinerar- on aspects such as tour guides’ explanatory ability, itinerary scheduling,
y–scheduling, guide–explanation, guide–satisfaction, and guide–scheduling and service attitude (Alani, Khan, & Manuel, 2017, pp. 2395–7654;
are the core topics on Ctrip and Tuniu, indicating that tour guide ser- Weiler & Walker, 2014). In particular, there is a high degree of corre-
vice quality is an important factor of tourist satisfaction, which is lation between itinerary and scheduling. Therefore, the results of this
consistent with previous studies (Caber & Albayrak, 2016; Heung, paper also provide a reference for managers to conduct differentiation
2008; Mossberg, 1995). In the analysis of the structural properties of strategies (Lui et al., 2018). In addition, opinion mining is a process of
the semantic association network, we reached a conclusion that differs knowledge discovery with which managers can design brand adver-
from (Raisi et al., 2017). For example, the phenomenon of a small- tisements and develop themes that meet consumer demand using our
world network on the Internet is not universal, and the semantic as- research method, which will provide a better experience to tourists.
sociation network of online travel reviews constructed by Tongcheng For online travel platforms, there are three aspects of practical
does not have small-world network properties. Generally, semantic significance. They can be summarized as follows.
association analysis is not only an effective tool for the opinion mining First, through the mining of online tourism reviews, the platform's
of online travel reviews but also provides a new perspective for value positioning is clearly defined from the perspective of consumers, which
discovery. provides a reference for the platform to develop strategic planning and
Finally, this study accurately reveals the intricate network re- operational strategies. As our research findings reveal, not all review
lationship through visualization. We extracted thematic words from the websites have the same quality of service and focus. For example,
Z. Hou, et al. Tourism Management 74 (2019) 276–289
