Jump to ContentJump to Main Navigation
Show Summary Details
More options …

Data and Information Management

4 Issues per year

Open Access
See all formats and pricing
More options …

An investigation on the evolution of diabetes data in social Q&A logs

Yiming Zhao / Baitong Chen / Jin Zhang / Ying Ding / Jin Mao / Lihong Zhou
Published Online: 2018-04-25 | DOI: https://doi.org/10.2478/dim-2018-0002


This study investigates the evolution of diabetics’ concerns based on the analysis of terms in the Diabetes category logs on the Yahoo! Answers website. Two sets of question-and-answer (Q&A) log data were collected: one from December 2, 2005 to December 1, 2006; the other from April 1, 2013 to March 31, 2014. Network analysis and a t-test were performed to analyze the differences in diabetics’ concerns between these two data sets. Community detection and topic evolution were used to reveal detailed changes in diabetics’ concerns in the examined period. Increases in average node degree and graph density imply that the vocabulary size that diabetics use to post questions decreases while the scope of questions has become more focused. The networks of key terms in the Q&A log data of 2005–2006 and 2013–2014 are significantly different according to the t-test analysis of the degree centrality and betweenness centrality. Specifically, there is a shift in diabetics’ focus in that they have become more concerned about daily life and other nonmedical issues, including diet, food, and nutrients. The recent changes and the evolution paths of diabetics’ concerns were visualized using an alluvial diagram. The food- and diet-related terms have become prominent, as deduced from the visualization results.

Keywords: Diabetes; diabetic; health users; user needs; social Q&A log; visualization analysis

1 Introduction

Diabetes is a common disease that can result in many complications, such as heart disease, stroke, obesity, high blood pressure, blindness, kidney disease, and nervous-system-related diseases (Zhang & Zhao, 2013). An estimated 366 million people are living with diabetes worldwide, and it is predicted that its prevalence will increase to 552 million by 2030 (Barnard, Peyrot, & Holt, 2012). According to a statistics report by the American Diabetes Association (2017), 29.1 million people in the United States (9.3% of the US population) have suffered from diabetes and spent US$245 billion in total in 2012.

Medical research on diabetes has thrived for many years; however, few research studies care about either the terms that diabetics use in social media or their concerns as reflected by those terms, leading to a huge gap between diabetics and medical professionals (Zhang, Wolfram, Wang, Hong, & Gillis, 2008; Milewski & Chen,2010; Zhang & Zhao, 2013). In most cases, practitioners provide medical services without fully understanding diabetics’ concerns, which may adversely affect clinical effectiveness and the patients’ experience (Sacristán, 2013). Therefore, it is very important to examine the terms that diabetics choose and use in social media, which reflects their real concerns.

In previous studies, we examined diabetes-related terms on Yahoo!Answers social question and answer (Q&A) website, and those terms were coded into 12 categories. Furthermore, we undertook a visualization-based analysis to ascertain term-use patterns and the relationships among diabetes-related terms from a consumer perspective (Zhang & Zhao, 2013; Zhang, Zhao, & Dimitroff, 2014).

The primary aims of the current study are to ascertain the characteristics of diabetes-related terms in social Q&A logs, reveal diabetics’ concerns and the evolution of those concerns over time, and, finally, provide insights into diabetics’ concerns for perusal by medical professionals, practitioners on social Q&A websites, and social Q&A users.

The findings of this study can be used to (a) help medical professionals to understand diabetics’ information needs better and to discover what kind of information diabetics really care about and (b) guide the optimization and improvement of social Q&A services for health users.

2 Literature review

2.1 Analysis of diabetes-related terms on social media websites

Diabetics generate tons of texts that contain large amounts of information on social media websites because diabetes can hardly be cured and diabetics have to rely on online information for long-term self-care and self-management. Furthermore, diabetics, as well as a large number of nondiabetics who have a family history of diabetes, have diabetic relatives, are overweight, or have similar signs, often seek information and request help online, especially on social Q&A websites (Nordfeldt, Johansson, Carlsson, & Hammersjo, 2005).

Diabetes-related terms that people generated in various social media websites were investigated in previous studies. For instance, texts and terms from the 15 largest Facebook groups focused on diabetes management were examined to analyze recent discussion topics, which suggested that patients with diabetes, family members, and their friends use Facebook to share personal clinical information, to request disease-specific guidance and feedback, as well as to receive emotional support (Greene, Choudhry, Kilabuk, & Shrank, 2011). The sources of raw data for similar investigations vary, from blogs and Facebook, to mobile diabetes applications.

Currently, there are generally two research approaches to study diabetes-related terms on social network websites.

One is to generate a list of terms based on the analysis of diabetics’ discourses posted on social network websites. For instance, Akay, Dragomir, and Erlandsson (2013) analyze social network websites numerically to gauge diabetics’ experiences in using medical devices and drugs. To obtain better understanding of patients’ opinions, they used self-organizing maps to generate a word list compilation, which correlates certain positive and negative word cluster groups related to medical drugs and devices.

The other approach is to use visualization methods to cluster terms in the visual space. For instance, Zhang and Zhao (2013) used the multidimensional scaling method to cluster diabetes-related terms in visual spaces to ascertain the characteristics of and relationships among terms related to diabetes.

2.2 Term co-occurrence network analysis

In the field of information science, analyses of terms concerning a certain field are widely conducted to review hot topics and their change over time, as well as to support the prediction of future trends in development and innovation (Ronda-Pupo & Guerras-Martin, 2012; Schiebel, Hörlesberger, Roche, François, & Besagni, 2010; Ding & Stirling, 2016; Lansdall-Welfare et al., 2017; Zhang, Xu, & Zhao, 2016). Studying the structure and evolution of a term co-occurrence network is particularly important for identifying hot topics and concern trends (Moerchen, Fradkin, Dejori, & Wachmann, 2008).

In an earlier study, a term co-occurrence method was developed to examine development trends in the field of science and technology, politics, and economy (Callon, Courtial, Turner, & Bauin, 1983). The term networks and communities in the network could be constructed based on the co-occurrence relationships between terms over a specific time period.

The evolvement of diabetics’ concerns will be reflected by the changes in the networks’ metrics and by the visualization of communities within the term co-occurrence network in adjacent time periods. First, it has been suggested, with regard to change of term network metrics, that the metrics of a network, such as network diameter, degree centrality, betweenness centrality, and closeness centrality, are proper coordinates for revealing hot spots and research trends (Lee, 2008). Second, the visualization of communities of a term co-occurrence network represented by terms, especially high-frequency terms, and their relationships in adjacent time periods could help in the intuitive understanding of the current status and growing trends of a topic within a range of time (Li, Shao, & Zheng, 2012).

2.3 Social Q&A websites

Social Q&A websites, also known as online Q&A or Q&A community websites, form a community whose members can post questions, answer other members’ questions, and rate or rank other members’ answers to the questions (Kim & Oh, 2009; Zhao, Detlor, & Connelly, 2016; Liu, Lin, Zheng, & Chen, 2016; Liu & Jansen, 2017). A social Q&A log is an archive of such a website, which consists of words or terms from users. At present, most academic research studies on social Q&A websites focus attention on question categorization, classification and quality assessment of answers, user satisfaction, reward structures, and motivation for participation (Gazan, 2011; Choi & Shah, 2015). Only a few researchers have utilized this superior corpus to study diabetes-related terms and term-usage patterns (Zhang & Zhao, 2013).

Social Q&A sites are effective channels for acquiring authentic content that is particularly useful in analyzing the behavior of a specific user group. Open access to a social Q&A site allows health/medical information seekers to reach those who have undergone similar symptoms/diseases and receive quick answers for immediate problem-solving (Kim, Oh, & Oh, 2008). Furthermore, a Q&A record consists of multiple sentences and paragraphs that provide users with enough space to present details. Essentially, user concerns, as well as their information needs, are closely correlated with the content they generate (Kaplan & Haenlein, 2010; Kim et al., 2012).

2.4 Summary

In summary, it is quite clear that studies on diabetes data in the context of social media are many and diverse. The sources of raw data for similar investigations vary, from blogs and Facebook, to mobile diabetes applications. These research studies have different focuses, which include recent discussion topics on diabetes management, informationseeking behavior, and evaluation of content quality. However, the existing research rarely focuses on diabetics’ concerns and the evolution of these concerns based on free texts in social Q&A logs. We investigate diabetics’ concerns and the evolution of diabetics’ concerns by analyzing the terms that diabetics use in social Q&A logs.

3 Methods

The study was conducted at two levels. On the first level, log data under the Diabetes category on Yahoo!Answers over two periods – from April 1, 2013 to March 31, 2014, and from December 2, 2005 to December 1, 2006 – were collected. The evolution of diabetics’ concerns was analyzed through comparison of the metrics of the term co-occurrence networks in those two time periods. On the second level, communities in the term co-occurrence network of 2013–2014 and their evolution paths were visualized to reveal recent changes in diabetics’ concerns.

3.1 Social Q&A log analysis

The social Q&A log used in this study was from Yahoo! Answers. Yahoo! Answers is a website covering 26 main categories, in which people ask each other questions on a topic of interest, obtain answers, as well as share facts, opinions, and personal experiences. The Health category contains 10 subcategories, including Alternative Medicine, Dental, Diet & Fitness, Diseases & Conditions, General Health Care, Men’s Health, Mental Health, Optical, Other Health, and Women’s Health. Diabetes is a section under the Diseases & Conditions subcategory in the main category of Health.

Based on existing studies, Yahoo! Answers is regarded as one of the best social Q&A websites, and it has a large base of users (Harper, Raban, Rafaeli, & Konstan, 2008). It is also worth noting that Yahoo! Answers is the second most popular Internet reference site in the world, and the most visited community Q&A site in the United States; 16.64% of Yahoo users use Yahoo! Answers (Alexa, 2017).

As a specific user group, diabetics on social Q&A sites express their concerns sufficiently in terms of text quantity, but their concerns are not static and need to be mined and analyzed dynamically. Logs on Yahoo!Answers are good data sources for this study because Yahoo! Answers encourages users to ask everything they care to and to voice their concerns without restriction. In contrast, some professional online communities for patients such as Patientslikeme and webMD.com guide users to ask questions about symptoms, medicine, and treatments. For instance, Patientslikeme presupposes three cue words, including conditions, symptoms, and treatments, in the search box. Comparatively, data from Yahoo!Answers, which is a general social Q&A website, can help us grasp users’ concerns more widely and get insights from a large base of users, including diabetics, prediabetics, and nondiabetics.

3.2 Data collection and preprocessing

Since the question and the best answer together provide the crucial context of the social Q&A log, a question and its corresponding best answer form one data record in our study. The reason to reserve a best answer in a record is that the quality of answers varies and the best answer represents the highest-quality part of all the answers (Kim & Oh, 2009). The best answer is selected in two ways: the answers that the questioners chose as the best answers to their questions and the answers that Yahoo automatically assigned based on the most votes by the Yahoo! Answers community. Clearly, the best answers are more credible and suitable for this research study than other unevaluated answers.

For data collection, an automatic crawler (Yahoo! Application Programming Interface and Yahoo Query Language) was developed to harvest resolved questions and the corresponding best answers under the Diabetes topic.

Several rules were introduced into the crawler in advance. All the collected questions were raised in the United States, to narrow the questions down to those in English. All questions brought into the raw data set were marked as “resolved” on the website since each “resolved” question had a best answer.

After the data were harvested from the Yahoo Q&A log, data-cleansing processes followed. A “stop words” list was introduced from LEMUR to filter low-frequency words (Strohman, Metzler, Turtle, & Croft, 2005). The list contains prepositions, conjunctions, auxiliaries, articles, numerals, interjections, and other function words. The Porter stemming algorithm was used to remove common morphological and inflectional endings from words and to bring variant forms of a word together (Porter, 1980).

The log data in Yahoo! Answers from April 1, 2013 to March 31, 2014 were collected, comprising 8,570 Q&A records, wherein every record contains a question and its corresponding best answer. The total number of words in the log data was 1,486,696, and the average number of words per record was 173.5.

The log data in Yahoo! Answers from December 2, 2005 to December 1, 2006 contained 5,222 Q&A records. The total number of words in the log data was 872,029, and the average number of words per record was 167.4.

3.3 Network analysis

To reveal the overall change trend of diabetics’ concerns, the data in the Q&A log from December 2, 2005 to December 1, 2006, as well as the log data from April 1, 2013 to March 31, 2014, were collected to be compared. Note that the first question under the Diabetes topic in Yahoo! Answers was raised on December 2, 2005, and thus the time span between the two data sets was long, due to the availability of data.

3.4 Construction of term co-occurrence network

Relationships between terms in the Q&A log of each time period were represented by term co-occurrence frequencies (Callon, Courtial, Turner, & Bauin, 1983). Term co-occurrence networks or their communities can be constructed based on the co-occurrence relationships between terms. In the term co-occurrence network, the node is the term and the edge is the co-occurrence relationship between two terms. If two terms co-occur in one Q&A record, an edge is established in the network. The term co-occurrence frequency is used to reveal the term association strength of the network edge.

In this study, network metrics including average node degree, graph density, network diameter, and average distance of the term co-occurrence networks, which indicate their development tendency, were adopted to analyze changes in user concerns (Albert & Barabási, 2002). Moreover, the degree centrality and betweenness centrality of each term were also calculated and analyzed.

3.5 Comparison of the average node degree and graph density of the two networks

The value of the average node degree and the graph density of the networks in 2005–2006 and 2013–2014 were calculated and compared. “Average node degree” is the arithmetic average of the co-occurrence frequencies between any pair of terms in the network (Wasserman & Faust, 1994). “Graph density” is the ratio between the actual number of edges and the maximum possible number of edges in the network (Wasserman & Faust, 1994). If two terms co-occur, an edge is generated between them in the network, and the weight of the edge is the number of records in which these two terms co-occur. The value of the average node degree and the graph density are positively correlated.

3.6 Comparison of the diameter of and average distance between the two networks

The diameter of and the average distance between networks in 2005–2006 and in 2013–2014 were calculated and compared. The “average distance” of a network is the arithmetic average of distances over all pairs of terms, and the “diameter” is the distance of the term whose distance is the longest in the network (Wasserman & Faust, 1994). A distance between two terms is equal to the number of edges from one term to the other in the term co-occurrence network.

3.7 Comparison of the lists of top 20 terms with the highest degree centrality and betweenness centrality

The values of degree centrality and betweenness centrality of every term were calculated, and we compared the lists of the top 20 terms with the highest values of the two parameters. “Degree centrality” is the number of edges that a term possesses, and “betweenness centrality” equals the number of shortest paths from all terms to all others that pass through a specific term (Wasserman & Faust, 1994). Terms that obtain a high betweenness centrality have a high probability of occurrence on a randomly chosen shortest path between two randomly chosen terms, which indicates the importance of a term as an intermediary.

3.8 Community evolution based on term co-occurrence networks

To explore recent changes and the evolution of diabetics’ concerns, the log data from 2013 to 2014 were equally divided into four subsets by time, i.e., from April 1, 2013 to June 30, 2013; July 1, 2013 to September 30, 2013; October 1, 2013 to December 31, 2013; and January 1, 2014 to March 31, 2014.

The NEViewer toolkit was used to visualize the evolution of term communities in the term co-occurrence networks. It is designed for visualizing real-time topic evolution at both micro and macro levels, based on the co-occurrence of words (Wang, Cheng, & Lu, 2014; Cheng, Wang, Lu, & Han, 2013).

3.9 Community detection in term co-occurrence networks

Four term co-occurrence networks were constructed for each Q&A log of the four time periods. The network was decomposed into several communities in subsequent steps. Partition of the network, namely community detection, is essential for the visualization of the constitution of diabetics’ concerns in the four time periods.

The Louvain algorithm was used to partition the term co-occurrence network of the Q&A logs in each time period into communities of densely connected nodes, with the nodes belonging to different communities being only sparsely connected. The Louvain algorithm is recognized as one of the best methods for community detection and has been shown to outperform all other known community detection methods in terms of computational time (Lancichinetti & Fortunato, 2009; Blondel, Guillaume, Lambiotte, & Lefebvre, 2008).

3.10 Selection of the core term to label a community

To simplify the display of communities, a core term was selected to label a community in the visualization step after all the communities were detected. The most representative term could be found by several criteria, such as centrality value and PageRank value. In this study, the term with the highest Z-value was used to represent the community in the visual environment. The Z-value measures the closeness of a specific term and other terms in the same community; the higher the Z-value of a term is, the more important this term becomes (Guimerà, Sales-Pardo, & Amaral, 2006; Rosvall & Bergstrom, 2010). Compared with other global measurements, the Z-value has a distinct advantage in finding the core term in the local community of a whole network. It is defined in Equation (1) (Guimerà et al., 2006), where ki represents the sum of the co-occurrence frequencies between term i and other terms in a community,


is the average of the sum of ki(1 ≤ in), while n is the number of terms in this community. With regard to the denominator in Equation (1),


denotes the average of the sum of the squares of ki, and


denotes the square of the average of the sum of ki.


3.11 Community evolution

For the purpose of exploring communities’ evolution, this study applied a similarity measurement to find the predecessors for community Ct+1 in time period “t+1.” If the value of the similarity is more than a threshold, the evolution relation exists between one community in time period “t” and one community in time period “Ct+1 .” The similarity between community Ct+1 and its predecessor community Ct can be effectively measured using Equation (2). Here, Cx and Cy are communities in time periods t and t+1, respectively; Vx and Vy are vocabularies in Cx and Cy; and F(v) is the frequency of term v. The denominator is the smaller value between the sum of total term frequencies in community x and the summation of total term frequencies in community y. The numerator is the sum of the frequencies of overlapping terms between community x and community y.


The visualization of communities and evolution relationships is explained below.

3.12 Visualization of community evolution

An alluvial diagram was used to visualize the evolution (Rosvall & Bergstrom, 2010; Wikipedia, 2016). In the alluvial diagram, the rectangular colored blocks represent each community, and thus the structural changes between communities can be highlighted.

The curves between the colored blocks in two time periods denote the evolution process. If the similarity between community Ct+1 and community Ct is more than a threshold, a linkage will be visualized as a colored curve between community Ct+1 and community Ct. Furthermore, if community Ct divides into two colored blocks in time period t+1, it implies that Ct divides into two communities; if two or more colored blocks in time period t merge into community Ct+1, it implies that two or more communities merge into one community (Wang et al., 2014; Cheng et al., 2013). If community Ct+1 has no predecessor, Ct+1 is a newly created community in time period t+1, and if community Ct has no successor, Ct dies out in time period t+1.

Communities in one time period were ranked by the sum of co-occurring frequencies of all terms. The higher a community is ranked, the closer it is to the top in the area of its corresponding time period in the alluvial diagram, and vice versa (Wang et al., 2014; Cheng et al., 2013). The position of a community in each period also reflects its importance as a user concern.

4 Results and discussion

4.1 The size of vocabulary that diabetic questioners and answerers use is shrinking

Users who ask and answer questions about diabetes in a social Q&A site are using a smaller vocabulary to discuss, and the scope of questions and answers is becoming more focused. Although the number of questions increased from 5,222 in 2005–2006 to 8,570 in 2013–2014, the number of core terms decreased from 3,547 to 2,701, as shown in Table 1. This implies that the size of the vocabulary that diabetics used to post questions has decreased, while the frequencies of specific terms has increased.

Table 1

The number of words remaining for analysis after preprocessing.

According to common sense, the number of core terms should increase since the number of questions in the Q&A logs increased during the two periods. But the data in our study show that users tend to use fewer terms in diabetes-related questions and answers over time. It can be seen that the content discussed under the topic of diabetes in social Q&A sites is more coherent, and users have become sophisticated when seeking and sharing information in the social Q&A website.

Further evidence comes from the comparison of term co-occurrence network metrics between 2005–2006 and 2013–2014. The results of statistical analysis of the Q&A log data from 2005–2006 and 2013–2014 are presented in Table 2.

Table 2

Term co-occurrence network metrics.

As shown in Table 2, both average node degree and graph density increase sharply over time, especially for graph density, which grows almost fourfold, from 0.008 to 0.03. The simultaneous increase of these two metrics demonstrates that the 2013–2014 network was relatively more connected and coherent than that of 2005–2006. That is to say, connections between these core terms increased, while the number of core terms that diabetics used in 2013–2014 was less than in 2005–2006.

4.2 Diabetics are concerned more about daily life than before

A remarkable change took place for term use from 2005–2006 to 2013–2014 in terms of the term co-occurrence network metrics. An independent-samples t-test analysis was applied to compare degree centrality and betweenness centrality between the term co-occurrence networks of 2005–2006 and 2013–2014. The degree centrality of terms from the network of 2005–2006 (mean =29.05) and that from 2013–2014 (mean =40.04) was significantly different (t=–4.723, P=0.000). The betweenness centrality of terms from the network of 2005–2006 (mean =413.74) and that from 2013–2014 (mean =673.98) was significantly different (t=–4.148, P=0.000). The group statistics of the t-test analysis are displayed in Table 3. The results of the t-test are shown in Table 4.

Table 3

Group statistics of the t-test.

Table 4

Independent-samples t-test.

Specifically, users were more interested in issues regarding medical fields in the earlier periods. For example, “condition”, “patient”, and “treatment” exist in the list of the top 20 terms with high degree centrality in 2005–2006 but disappear from the top 20 list in 2013–2014. Coincidentally, this phenomenon was verified in the betweenness centrality data. The top 20 terms with the highest betweenness centrality in 2005–2006 also contained “condition,” “patient,” and “treatment”; however, these terms were not present in the list of top 20 terms in 2013–2014.

In 2005–2006, other terms such as “drug” and “pain” emerged on the top 20 term list with high betweenness centrality. This suggests that a certain proportion of users’ attention was centered on medical-concerned questions at that time. In contrast, in 2013–2014, terms including “life,” “fat,” and “meal” appeared on the top 20 betweenness centrality list. These terms are more casual and informal compared with the medical terms that exist on the 2005–2006 list. Here are some examples of questions that include “life”:

Question 1: Can you live a normal life with diabetic retinopathy in your early thirties?

Question 2: How does a doctor check for prediabetes/diabetes and what happens if I get it and choose no treatment? I’m going in for a checkup; how will the doctor know if I have Type 2 diabetes or prediabetes and if I do have it, what happens if I choose to pass on treatment and live my life like I normally do

The above changes reveal a shift of diabetics’ focus. They have begun to be concerned more about daily life and other nonmedical issues, including diet, food, and nutrients. In order to confirm this finding with more data, the ranks of all terms in 2005–2006 and 2013–2014 were examined and compared. Table 5 provides some terms whose degree centralities go up sharply, along with the increase in the rank of degree centrality. It can be seen that the rank of food rises from 18 to 4 and the rank of diet rises from 51 to 5. The ranks of degree centrality of other food-related terms listed in Table 5 also show big increases. The changes describe the trend that food-related terms are becoming more important in diabetic-related questions than before.

Table 5

Terms with sharp increase of degree centrality value.

Table 6 illustrates some terms whose rank of degree centrality decreases steeply over the period ranging from 2005–2006 to 2013–2014. As shown in Table 6, the ranks of typical and common medically related terms decreased sharply. The ranks of treatment, symptom, medic, and disease decreased from 3 to 34, 4 to 7, 6 to 9, and 8 to 18, respectively. The ranks of common medicines for diabetes, such as Lantus, Humalog, Novolog, and Levemir, dropped dramatically.

Table 6

Terms with sharp decrease of degree centrality value.

These results reflect a noticeable trend: users tend to focus more attention on questions related to daily life, such as self-management of diabetes. That is, diabetics are using more everyday concepts instead of medical jargon to ask for information.

One possible explanation is that at the beginning of social Q&A services, people were eager for information about the diagnosis of diabetes and treatments. Along with the development of the Internet and the popularity of the mobile Internet, medical information about diabetes is no longer scarce and people have turned to discussion of topics about self-care and daily life related to diabetes.

These conclusions are confirmed in the recent literature. For instance, a schema was formed to describe diabetics’ term use in a Q&A site, and Social & Culture, Lifestyle, and Nutrients are shown to be important components in the schema (Zhang & Zhao, 2013; Zhang et al., 2014). Gooden and Winefield (2007) also confirmed that health users are not only seeking answers but also expressing feelings, seeking emotional support, and so on.

4.3 Recent changes and evolution path of diabetics’ concerns

In order to provide insights into how diabetics’ concerns change over contiguous time periods in recent history, the log data of 12 months from April 1 of 2013 to March 31 of 2014 were equally divided into four subsets by time. Communities from term co-occurrence networks in each time period were detected and visualized, and the relationships between communities from adjacent time periods were drawn. An alluvial diagram was used to visualize the community evolution in the four time periods {Ti| i =1, 2, 3, 4}.

In Figure 1, the numbers 1, 2, 3, 4 refer to the respective time periods. A core word with the highest Z-value is used to represent a community. The terms within three communities in T1 were split into five communities in T2, then those five communities formed four communities in T3, and the number of identified communities remained five in T4. The number of visible paths between the communities of T1 and those of T2, T3 and T4 was respectively six, four, and seven.

Community evolution in the four time periods.
Figure 1

Community evolution in the four time periods.

In T1, “blood” is the largest community containing 1,323 terms (70.79% of a total of 1,869 words) and the top 10 words with the highest Z-values are blood, sugar, symptom, medic, glucose, weight, disease, life, water, and effect. This large community is the predecessor of the following four communities in T2: symptom, blood, weight, and treatment.

From T2 to T3, according to Figure 1, the “weight” and “treatment” communities die out. It indicates that these words were not used frequently enough to form a new community, and users in T3 used the terms in these two communities less frequently than other terms.

The blood community in T2 was split into the blood and risk communities in T3, while the risk community did not contribute terms for T4 . From T3 to T4, the terms in the blood community of T3 were split into three new communities, namely, medicine, sugar, and basal.

The evolution paths of the vegetable and food communities in T1 and their derivative communities are elaborated later.

Dominant communities, such as the symptom and treatment ones, reflect the primary concerns related to diabetes of common health consumers. Topics regarding diet have emerged, except disease-related topics, according to the visualization of term communities. In another research on the analysis of diabetes data collected from social Q&A logs, Zhang and Zhao (2013) found that the category “Nutrient” is ranked at the top in terms of both the total number of unique terms in that category and the percentage of that category relative to the total number of unique terms in all discovered 12 categories. Their study suggested that food-related topics are popular and they present a very close relationship to diabetes, which confirms our results.

Previous studies used data from various Web2.0 platforms, such as blogs, Facebook, and mobile diabetes applications, to determine the topics that are of concern to diabetes users. Greene, Choudhry, Kilabuk, and Shrank (2011) studied 15 Facebook groups and showed that diabetes-related users not only requested disease-specific information but also sought emotional support. Nordfeldt, Hanberger, and Bertero (2010), as well as Nordfeldt and Berterö (2012), based on a comprehensive analysis of social networking message boards and blogs, identified that diabetics expressed positive attitudes toward Web2.0 diabetes portals. However, these attitudes of diabetes-related users on such online platforms are not evidently present in our study.

Moreover, our study presents an evolution path of users’ concerns, whereas most previous studies have only investigated diabetes users’ data focusing on a certain time period. Furthermore, in our study, the relations between adjacent term communities, such as birth, growth, merging, contraction, splitting, and death, are visualized. Comparatively, and very differently, previous studies have included visualization analysis on diabetes-related terms by generating term lists or by clustering terms in the visual space (Akay, Dragomir, & Erlandsson, 2013; Zhang & Zhao, 2013).

4.4 Food- and diet-related terms have become more important in diabetics’ concerns

The evolution path related to food and diet was extracted and is displayed in Figure 2. It is worth mentioning that the food community went through T1, T2, and T3 and, then, evolved to the diet community. Its location rose from the bottom to the second place in the attraction of diabetics’ attention. The vegetable and food communities in T1 were combined into one community in T2, and the food community in T3 inherited all terms from the food community in T2. Finally, the terms in the food community of T3 became two new communities, namely, diet and basal.

Evolution path of the community in terms of food and diet terms.
Figure 2

Evolution path of the community in terms of food and diet terms.

In summary, the alluvial diagram was used to examine the evolution patterns and the changing path of diabetics’ concerns in a visualization space. Every colored block in the alluvial diagram represents a community, with its predecessors and successors visualized. It is a process that explores where the terms in a community come from and where they go. The evolution path of each community could be extracted to view details, and the evolution path of food was extracted as an instance. These visualization results effectively demonstrate the changes of terms that diabetics chose and used.

However, one weakness of the information visualization method is that the number of displayed objects is limited because the size of a visual space is limited. In other words, an overcrowded display makes effective visual analysis impossible and even meaningless. Thus, the communities in the alluvial diagram are labeled by the core words with the highest Z-values, which simplifies the representation of communities in the visual display effectively. The alluvial diagram also ranks the communities by placing them at different levels.

5 Conclusion

This paper explored the evolution of diabetics’ concerns based on the logs taken in 2005–2006 and 2013–2014. Term co-occurrence network metrics are provided to articulate the overall changing tendency of diabetics’ concerns from 2005–2006 to 2013–2014, which contributes to the prediction of future trends in diabetics’ concerns.

The alluvial diagram has been introduced to visualize communities and the evolution patterns, including growth, contraction, merging, splitting, and end, between communities in two adjacent time periods.

There are interesting findings that need to be highlighted. For instance, diabetics tend to use targeted words to express their concerns, and they are more concerned with questions related to daily life rather than medical treatments. These findings can generate strong impacts in the following perspectives. Firstly, this study could be used to inform medical professionals of diabetics’ information needs, information services provision, and the patterns of change in the information needs. This paper is useful not only for informing medical professionals about the emerging requirements of patients but also for promoting the experience of medical services of diabetic patients.

Secondly, the change trend of diabetics’ concerns can guide the optimization and improvement of social Q&A services. For instance, social Q&A providers should take measures to encourage more experienced diabetics, emotional experts, or nutritionists to answer health-related questions, as well as to mediate interaction among diabetics, because users are becoming increasingly interested in raising questions related to daily life.

Thirdly, the methods of community detection, evolution, and visualization can be integrated into social Q&A websites. Users can access visualization results, which can help them to optimize their retrieval strategy. The combination of visual display and social Q&A websites can provide end users with diverse questioning experiences and higher search efficiency. For example, the visualization results of this study can help a prediabetic, who knows little about diabetes, to identify and comprehend what is important in the process of searching or questioning online.

Fourthly, the methods adopted in this study can be applied to the examination of the evolution of user concerns in other domains. Specifically, the method of comparison of network metrics can be used to analyze the evolution of user concerns in noncontiguous time periods. Moreover, the method of community evolution and visualization can be used to explore recent changes in user concerns in contiguous time periods.

One limitation of this paper is the bias of its data source. The data were collected only from Yahoo!Answers. In the future, we will investigate diabetics’ concerns and their evolution using hybrid data from general social Q&A websites like Yahoo!Answers, professional social Q&A websites such as Patientslikeme, diabetes-related academic articles, and electronic medical records from hospitals.


This work was supported by the National Natural Science Foundation of China (grant no. 71403190, 71420107026, and 71573197). The authors thank the Academic Team of Young Scholars at Wuhan University (Whu2016013).


  • Akay, A., Dragomir, A., & Erlandsson, B. E. (2013). A novel data-mining approach leveraging social media to monitor and respond to outcomes of diabetes drugs and treatment. In Proceedings of 1st IEEE-EMBS Special Topic Conference on Point-of-Care (POCT) Healthcare Technologies (PHT) (pp. 264-266), Bangalore, INDIA. Google Scholar

  • Albert, R., & Barabási, A. (2002). Statistical mechanics of complex networks. Review of Modern Physics, 74(1), 47-97. CrossrefGoogle Scholar

  • Alexa. (2017). The top ranked sites in references category. Retrieved October 10, 2017, from http://www.alexa.com/topsites/category/Top/Reference 

  • American Diabetes Association. (2017, July 19). The national diabetes statistics report. Retrieved July 19, 2017, from http://www.diabetes.org/diabetes-basics/statistics/ 

  • Barnard, K. D., Peyrot, M., & Holt, R. I. (2012). Psychosocial support for people with diabetes: past, present and future. Diabetic Medicine, 29(11), 1358-1360. Web of ScienceCrossrefGoogle Scholar

  • Blondel, V. D., Guillaume, J. L., Lambiotte, R., & Lefebvre, E. (2008). Fast unfolding of communities in large networks. Journal of Statistical Mechanics Theory and Experiment, 2008(10), 155-168. Web of ScienceGoogle Scholar

  • Callon, M., Courtial, J., Turner, W., & Bauin, S. (1983). From translations to problematic networks: An introduction to co-word analysis. Social Science Information, 22(2), 191-235. CrossrefGoogle Scholar

  • Cheng, Q., Wang, X, Lu, W., & Han, S. (2013). NEViewer: A new software for analyzing the evolution of research topics. In Proceedings of the 14th International Conference of the International Society of Scientometrics and Informetrics (pp. 1307-1320), Wien, Austria: Facultas Verlagsund Buchhandels AG. Google Scholar

  • Choi, E., & Shah, C. (2015). User motivations for asking questions in online Q&A services. Journal of the Association for Information Science and Technology, 67(5), 1182–1197. Web of ScienceGoogle Scholar

  • Ding, Y., & Stirling, K. (2016). Data-driven Discovery: A New Era of Exploiting the Literature and Data. Journal of Data and Information Science, 1(4), 1-9. CrossrefGoogle Scholar

  • Gooden, R., & Winefield, H. (2007). Breast and prostate cancer online discussion boards: A thematic analysis of gender differences and similarities. Journal of Health Psychology, 12(1), 103-114. CrossrefGoogle Scholar

  • Greene, J., Choudhry, N., Kilabuk, E., & Shrank, W. (2011). Online social networking by patients with diabetes: A qualitative evaluation of communication with Facebook. Journal of General Internal Medicine, 26(3), 287-292. PubMedWeb of ScienceCrossrefGoogle Scholar

  • Guimerà, R., Sales-Pardo, M., & Amaral, L. A. N. (2007). Classes of complex networks defined by role-to-role connectivity profiles. Nature physics, 3(1), 63-69.CrossrefPubMedWeb of ScienceGoogle Scholar

  • Harper, M., Raban, D., Rafaeli, S., & Konstan, J. (2008). Predictors of answer quality in online Q&A sites. In Proceedings of the 26th Annual SIGCHI Conference on Human Factors in Computing Systems (pp. 865-874). New York: ACM. Google Scholar

  • Kaplan, A., & Haenlein, M., (2010). Users of the world, unite! The challenges and opportunities of Social Media. Business Horizons, 53(1), 59-68. CrossrefWeb of ScienceGoogle Scholar

  • Kim, S., & Oh, S. (2009). Users’ relevance criteria for evaluating answers in a social Q&A site. Journal of the American Society for Information Science and Technology, 60(4), 716-727. Web of ScienceCrossrefGoogle Scholar

  • Kim, S., Oh, S., & Oh, J. (2008). Evaluating health answers in a social Q&A site. In Proceedings of the American Society for Information Science and Technology (ASIST’08) (pp. 1-6). Columbus, Ohio: Information Today. Google Scholar

  • Kim, J., Yang, M., Hwang, Y., Jeon, S., Kim, K., Jung, I. S., Choi, C., Cho, W., & Na, J (2012). Customer preference analysis based on SNS data. In Proceedings of Second International Conference on Cloud and Green Computing/Second International Conference on Social Computing and its Applications (CGC/SCA 2012) (pp. 609-613). New York: IEEE. Google Scholar

  • Lancichinetti, A., & Fortunato, S. (2009). Community detection algorithms: A comparative analysis. Physical Review E, 80(5), 56117. CrossrefWeb of ScienceGoogle Scholar

  • Lansdall-Welfare, T., Sudhahar, S., Thompson, J., Lewis, J., Team, F. N., Cristianini, N., & Callison, R. (2017). Content analysis of 150 years of British periodicals. Proceedings of the National Academy of Sciences, 114(4), E457-E465. Google Scholar

  • Lee, W. H. (2008). How to identify emerging research fields using scientometrics: An example in the field of Information Security. Scientometrics, 76(3), 503-525. CrossrefWeb of ScienceGoogle Scholar

  • Li, X., Shao, Z., & Zheng, C. (2012). Visualization of domestic PIS research of library and information Science based on co-word analysis. Journal of Intelligence, 31(8), 109-113. Google Scholar

  • Liu, Y., Lin, Z., Zheng, X., & Chen, D. (2016). Incorporating social information to perform diverse replier recommendation in question and answer communities. Journal of Information Science, 42(4), 449-464. Web of ScienceCrossrefGoogle Scholar

  • Liu, Z., & Jansen, B. J. (2017). Identifying and predicting the desire to help in social question and answering. Information Processing and Management, 53(2), 490-504. CrossrefGoogle Scholar

  • Milewski, J., & Chen, Y. (2010). Barriers of obtaining health information among diabetes patients. Studies in Health Technology and Informatics, 160(1), 18-22. PubMedGoogle Scholar

  • Moerchen, F., Fradkin, D., Dejori, M., & Wachmann, B. (2008). Emerging trend prediction in biomedical literature. In AMIA Annual Symposium Proceedings (pp. 485-489). Washington, DC: American Medical Informatics Association. Google Scholar

  • Nordfeldt, S., Johansson, C., Carlsson, E., & Hammersjo, J. (2005). Use of the Internet to search for information in type 1 diabetes children and adolescents: A cross-sectional study. Technology Health Care, 13(1), 67-74. Google Scholar

  • Nordfeldt, S., Hanberger, L., & Bertero, C. (2010). Patient and parent views on a web 2.0 diabetes portal—The management tool, the generator, and the gatekeeper: Qualitative study. Journal of Medical Internet Research, 12(2): e17. CrossrefWeb of ScienceGoogle Scholar

  • Nordfeldt, S., & Berterö, C. (2012). Young patients’ views on the open Web 2.0 childhood diabetes patient portal: A qualitative study. Future Internet, 4(2): 514-527. CrossrefGoogle Scholar

  • Porter, M. (1980). An algorithm for suffix stripping Program. Electronic Library and Information System, 14(3), 130-137. CrossrefGoogle Scholar

  • Ronda-Pupo, G., & Guerras-Martin, L. (2012). Dynamics of the evolution of the strategy concept 1962–2008: a co-word analysis. Strategic Management Journal, 33(2), 162-188. Web of ScienceCrossrefGoogle Scholar

  • Rosvall, M., & Bergstrom, C. (2010). Mapping change in large networks. PloS One, 5(1), e8694. PubMedCrossrefWeb of ScienceGoogle Scholar

  • Sacristán, J. (2013). Patient-centered medicine and patient-oriented research: improving health outcomes for individual patients. BMC Medical Informatics and Decision Making. 13(1), 1-8. Web of ScienceGoogle Scholar

  • Schiebel, E., Hörlesberger, M., Roche, I., François, C., & Besagni, D. (2010). An advanced diffusion model to identify emergent research issues: the case of optoelectronic devices. Scientometrics, 83(3), 765-781. Web of ScienceCrossrefGoogle Scholar

  • Strohman, T., Metzler, D., Turtle, H., & Croft, W. (2005) Indri: A language model-based search engine for complex queries. Proceedings of the International Conference on Intelligent Analysis. Google Scholar

  • Wasserman, S. & Faust, K. (1994). Social network analysis: methods and applications. New York: Cambridge University Press. Google Scholar

  • Wang, X., Cheng, Q., & Lu, W. (2014). Analyzing evolution of research topics with NEViewer: a new method based on dynamic co-word networks. Scientometrics, 101(2), 1253-1271. Web of ScienceCrossrefGoogle Scholar

  • Wikipedia. (2016). Alluvial diagram. Retrieved September 9, 2016, from http://en.wikipedia.org/wiki/Alluvial_diagram 

  • Zhang, J., Wolfram, D., Wang, P., Hong, Y., & Gillis, R. (2008). Visualization of health-subject analysis based on query term co-occurrences. Journal of the American Society for Information Science and Technology, 59(12), 1933-1947. Web of ScienceCrossrefGoogle Scholar

  • Zhang, J., & Zhao, Y. (2013). A user term visualization analysis based on a social question and answer log. Information Processing and Management, 49(3), 1019-1048. CrossrefGoogle Scholar

  • Zhang, J., Zhao, Y., & Dimitroff, A. (2014). A study on health care consumers’ diabetes term usage across identified categories. Aslib Journal of Information Management, 66(4), 443-463. Web of ScienceCrossrefGoogle Scholar

  • Zhang, L., Xu, K., & Zhao, J. (2016). Sleeping beauties in meme diffusion. Scientometrics, 112(5), 1-20. Web of ScienceGoogle Scholar

  • Zhao, L., Detlor, B., & Connelly, C. E. (2016). Sharing knowledge in social Q&A sites: The Unintended Consequences of Extrinsic Motivation. Journal of Management Information Systems, 33(1), 70-100. Web of ScienceCrossrefGoogle Scholar

About the article

Received: 2017-07-03

Accepted: 2018-01-11

Published Online: 2018-04-25

Citation Information: Data and Information Management, ISSN (Online) 2543-9251, DOI: https://doi.org/10.2478/dim-2018-0002.

Export Citation

© 2018 Yiming Zhao et al.. This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 3.0 License. BY-NC-ND 3.0

Comments (0)

Please log in or register to comment.
Log in