Information Filtering in Social Media During Disasters
2011 Tohoku disaster - Ishinomaki - Miyagi prefecture, Japan

Over the last 25 years, the world has seen a rise in the frequency of natural disasters in rich and poor countries alike. Today, there are more people at risk from natural hazards than ever before, with those in developing countries particularly at risk.

This essay series is intended to explore measures that have been taken, and could be taken, in order to improve responses to the threat or occurrence of natural disasters in the MENA and Indo-Pacific regions. Read more ...


 

Disaster relief is the process of providing humanitarian assistance to the people who are affected by a disaster.[1] Disaster may be man-made such as a bombing or mass shooting or natural such as hurricanes, earthquakes, or floods. Here we focus on natural disasters which can have destructive consequences in terms of area, severity, and affected people. Type, size, and location of harm is hard to predict, due to our limited understanding of natural disasters. In relief delivery processes, it is extremely challenging for first responders to make fast decisions regarding the distribution of resources with limited budget in such dynamic and unclear environment.

Relief delivery tasks are nowadays facilitated by information technology systems that are empowered by volunteers. Zook et al.[2] have studied the systems which were used during the 2010 earthquake that hit Haiti, an area with minimal access to the Internet and online maps. Volunteers around the globe (with a concentration in the U.S.) collaborated, using social media tools in CrisisCamps to help with the recovery efforts. They created portable offline crisis maps of Haiti, using the OpenStreetMap platform, with data on roads and medical facilities. The maps were appropriate for the environment which had lost its connection to the Internet. Other efforts via CrisisCamps were two systems named “We Have We Need” and “HaitiVoiceNeeds” that were developed to connect the affected people with volunteers who were not on site but were willing to help with the relief delivery. Another system empowered by volunteers is Ushahidi in which data is collected via SMS, MMS, email, and web interfaces and then geotagged on the map. Using SMS and MMS allows people with basic cellphones to contribute, which made Ushahidi more successful.

Following the success of crowdsourcing systems such as Ushahidi, much attention has been paid towards disaster relief using social media, especially Twitter, in recent years. People extensively tweet to reflect on disasters. For example, Twitter has reported that in the aftermath of Hurricane Sandy in 2012, more that 20 million event-related tweets were published. These tweets included valuable actionable information: details about the situation, issues, required resources, and location of available humanitarian assistance. Systems such as Artificial Intelligence for Disaster Response (AIDR) have been developed to automate the extraction of informative tweets. To perform this task, Imran et al.[3] extract crisis-related tweets, ask the crowd to label a subset of them (as informative or not), and use this data to train a crisis-specific classifier, who detects the remaining informative tweets.

Disaster relief systems rest on the assumption that information provided by volunteers is accurate. Hence, the existence of unwanted content in social media data not only affects its quality but also challenges relief delivery efforts that exploit it.

A major issue that systems such as AIDR face is that informative tweets are, unfortunately, overwhelmed with unwanted content such as scam messages that try to steal money from people who make donations. Unwanted content on social media can take many different forms: spam, disinformation, rumors, misinformation, and bot-generated content. Due to the ease of automatically creating accounts and generating content on social media, these unwanted content are overwhelming normal tweets. For example, three million spam tweets are published every day,[4] which drastically decreases the quality of data. Bots and bot-generated content are new phenomena in social media. Bots can influence discussions by collective attention; harnessing a large number of accounts to publish on the same topic in order to appear as the opinion of the crowd. Misinformation, disinformation, and rumors can cause panic in the aftermath of a crisis, when people are looking for the latest updates on the situation.

Disaster relief systems rest on the assumption that information provided by volunteers is accurate. Hence, the existence of unwanted content in social media data not only affects its quality but also challenges relief delivery efforts that exploit it. Here we introduce different types of unwanted data, their characteristics, and efforts toward eliminating them in social media.

What Information to Filter and How?

1.Spam
Spam is the “content designed to mislead or content that the site’s legitimate users do not wish to receive.”[5] Social media sites provide a list of activities whose elements are signs of spamming. On Twitter this includes posting harmful links, aggressively following and unfollowing users, and creating multiple accounts. On Facebook, “pages, groups or events that confuse, mislead, surprise or defraud people” are considered abusive. These sites have mechanisms to detect and suspend users who have signs of spamming. For example in Twitter, 50% of accounts which were created in 2014 have been suspended;[6] this shows the dominance of spamming on social media.

Spammers degrade the credibility of social media platforms by publishing unwanted content in large volumes. Researchers have extensively studied spammers to extract features that discriminate them from normal users and generate classifiers that detect them in action. It has been observed that spammers have specific characteristics due to their malicious intent. For example, promotional spammers advertise a specific product, they publish on the same topic, usually with similar URLs, keywords, and hashtags. Therefore, if we cluster users based on the topics of their posts, spammers will form a dense cluster distinct from those of normal users. Another feature of spammers is their network connections. Spammers attempt to make numerous connections to normal users in order to appear legitimate. However, normal users do not connect to spammers intentionally and spammers’ connections are rarely reciprocated.[7] Even the usernames of spammers are different from normal users; they have higher complexity and variety and do not reflect age and gender.[8]

As new methods are proposed, spammers adopt new behaviors to escape detection and this dynamic behavior causes detection methods to become rapidly outdated. On the other hand, spam detection methods are expensive to create as such methods are usually supervised and they require training datasets in which users are labeled as normal or spammer. So researchers always face the challenge that as spammers evolve, new training datasets are required to be labeled and new methods to be reformulated.

2. Misinformation, Disinformation, and Rumors
Misinformation is fake or inaccurate information which is unintentionally spread.[9] Users, based on trust, share the information without verifying its correctness. People enjoy sharing information and social media helps the information spread even faster in comparison to traditional networks. An example of misinformation is Ebola in 2014 when number of tweets which mentioned the virus increased to 6,000 per minute. It happened while only few cases of Ebola were observed in Newark, Miami Beach, and Washington DC and there was no major outbreak. Tweets contained inaccurate information that Ebola can spread via water and food.[10]

If disinformation goes viral (and becomes a rumor), it can cause distress among users, especially during the critical time of disasters. 

Disinformation is misleading and deliberately deceptive false information.[11] Its main distinction to misinformation is the intention behind spreading it. Differentiating disinformation from disinformation is a hard task as the intension of the authors is unknown. If disinformation goes viral (and becomes a rumor), it can cause distress among users, especially during the critical time of disasters. Distress caused by misinformation and disinformation can go beyond the effects of spam because it can provide intriguing information which cannot be verified. Moreover, timeliness of rumors helps the diffusion process, especially if they are published in times of crisis and panic when people are actively seeking for information and authorities are occupied with the relief process.

In aftermath of 2012 Hurricane Sandy, 10,350 tweets with fake images circulated on Twitter by 10,215 users. Top thirty users were responsible for 90% of posts. These findings imply the existence of a relatively small group of rumor propagators that could influence a large group of normal users to spread the disinformation. Moreover only 11 percent of users have received the content that they retweeted from their followees. This shows that users mainly exploited Twitter search for acquiring information. Rumors can be detected using manual examination by experts, however, they exhibit properties that can be used to create automatic detection methods. For example, the diffusion process of rumors is different from those of normal posts. The originality of posts are lower (most of the posts are retweets). Number of users involved in the cascade is low in comparison to its virality. Also, the depth of the resulting cascades are lower than those of normal posts as users receive information by search and the posts, mainly, do not diffuse in the friendship network.[12]

3. Bot-Generated Content 
A malicious bot is a piece of software that controls a hijacked or adversary-owned account.[13] These bots disguise as normal users and perform fraudulent tasks such as aggressive follow and unfollow on Twitter or clicking on advertisements for gaining profit on Facebook. On the other hand, benign bots are mostly self-declared and have useful activities. For example, they periodically acquire and automatically repost the content from major sources and news agencies such as United States Geological Survey (USGS) and Cable News Network (CNN).

The ability to quickly and easily create and automate bots provides a fertile ground for performing malicious activities on a large scale, such as the 2010 Massachusetts Senate Election, in which, a candidate gained 60,000 fake followers on social media.[14] To avoid such situations, researchers have studied the behaviors that can discriminate bots from normal users. Key to all insights about bot behaviors is understanding that automation is the primary feature of all bots. Bots create posts based on regular patterns throughout the week with no major change during nights or weekends. In contrast, humans exhibit sporadic posting behaviors. Moreover, bots have larger number of followers in comparison to normal users.[15] To automate the creation process of bots, usernames are chosen randomly from a dictionary, a set of keywords, or popular names. Due to this process, these usernames are of low quality and by using their text and length, the can easily be differentiated from real usernames.[16] Bots also use social media as a communication channel and the traffic they generate is different from the traffic generated by normal users. By creating a profile for each user and comparing it to its subsequent usages, anomalous behavior can be detected.[17] These features individually or in combination are used in bot detection methods.

Japan Earthquake and Tsunami 2011

The earthquake and tsunami that struck Japan in 2011 illustrate the power and potential of social media as a disaster relief tool, as well as the limitations and problems associated with its use.

On March 11, 2011, a magnitude 9.0 earthquake happened 45 miles east of Tohoku, Japan. The earthquake, an hour later, resulted the first tsunami wave to hit Japan coastlines and tsunami caused a cooling system failure at the Fukushima Daiichi Nuclear Power Plant. As a consequence of both the earthquake and tsunami, more than 15,000 people were dead and 2,500 remained missing.[18]

Disaster relief efforts were widely reflected on social media for the Japan earthquake. “Disaster Response on Facebook” is a Facebook page whose goal is facilitating preparation and recovery from natural disasters. Links to US Government information pages, requesting for donations, encouraging volunteers to help the campaigns that support the Japanese victims via “Causes” Facebook application, and news on nuclear safety were part of the activities that were provided on “Disaster Response on Facebook” in aftermath of the Japan earthquake. On Twitter, emergency numbers and information for the Japanese residents in USA to contact their families in Japan were published by US State Department. Also, train schedules, location of shelters, and Red Cross SMS number for donations were published by volunteers.[19]

Along with all the informative posts on social media after the [2011 Japan] earthquake, there were also rumors that caused distress. 

Along with all the informative posts on social media after the earthquake, there were also rumors that caused distress. Following the earthquake, a fire and explosion happened in Cosmo Oil Refinery, which was extinguished after 10 days. Rumors spread on Twitter that hazardous substances were floating in the air and would come down with rain. An example is “A fire occurred at Plant of Cosmo Oil will cause of rain with toxic substance. Pay attentions to rain. Have to carry Umbrellas.” Later the rumor was reported to be baseless both by the Japanese Government and Cosmo and this was also reflected in tweets such as “Oil co. Cosmo says rumors spreading about toxic rain from tank fire in Chiba are false http://www.cosmo-oil.co.jp/ ‘There is no such fact’”.

To prevent large scale spread of such rumors, social media users need to be guided towards “Information Leaders.”[20]. By following Information Leaders, more information with higher quality will be achieved. Two main characteristics of this subset of users are their location and posts’ topics; they are located on the site of the disaster and their posts contain the majority of all the topics which are discussed about the current disaster. Topics of posts can be easily extracted using Machine Learning methods. However, only 1% of posts have location information and discovering their location is a challenge. To overcome this challenge, methods have been proposed to find the location of tweets based on their characteristics. Tweets that were published from crisis regions after the Japan earthquake were twice as likely to be sent using mobile devices. They were also less likely to seek visibility by using multiple hashtags and were more likely to have URLs for directing users to related photos and videos.[21]

Looking Ahead

Social media is a powerful platform for communication and information dissemination. A huge number of posts are being published every day on a variety of different topics. In times of crisis, people talk about what they have witnessed, need, and can offer. This is valuable information which is usually lost among the flood of content. To gain useful information, spam, misinformation, and rumors should be filtered. We introduced these issues of irrelevant information and emphasized the importance of removing them along with an overview of methods to do so and present some challenges:

• Social media administrators want to detect and eliminate as much unwanted content as possible (high recall). On the other hand they want to have very high confidence in their detection to avoid marking any normal content as malicious (high precision). Unfortunately, current methods sacrifice recall for precision to prevent any misclassification. This process has led to huge amount of spam in social media. Designing new methods that can achieve high recall without drastically decreasing the precision is a remaining challenge in this field.

• Large populations of bots are active on major social networks. Current detection methods are implemented based on the assumption that malicious users’ behaviors are consistent over time. However, bots change their behaviors quickly to avoid suspension. To overcome this challenge, detection methods need to have the ability to be retrained frequently. Incremental Learning is a scalable approach for updating the learned model with the arrival of each new example and this technique may be useful in improving the performance of bot detection models.

• Methods of acquiring ground truth automatically and continuously are also important to overcome the challenge of changing behaviors of malicious users. An approach for continuously collecting ground truth is using honeypots. Honeypots are bots which are created to lure malicious users to follow them. As they do not show normal behavior, there is a low probability that legitimate users would follow them. Hence, all the followers of a honeypot are considered malicious.

• To detect rumors and misinformation on social media, content and diffusion patterns are used. Besides detecting the unwanted content it is also important to find its source. A user might receive a single piece of information from several paths and it is not trivial to find the neighbor who influenced a user to spread that information. A malicious user (or group of users) has the ability to spread many posts containing rumors and misinformation and it is of significant importance to find and eliminate those sources.

Nazer, Liu, and Xue are all affiliated with Arizona State University, Tempe, AZ 85287. This research was supported, in part, by NSF grant 1461886. The information reported here does not reflect the position or the policy of the funding agency.

 


[1] “Disaster Relief,” New World Encyclopedia, August 17, 2015, accessed May 17, 2016, http://www.newworldencyclopedia.org/entry/Disaster.

[2] Matthew Zook, Mark Graham, Taylor Shelton, and Sean Gorman, “Volunteered Geographic Information and Crowdsourcing Disaster Relief: A Case Study of the Haitian Earthquake,” World Medical & Health Policy, no. 2 (2010): 7–33.

[3] Muhammad Imran, Carlos Castillo, Ji Lucas, Patrick Meier, and Sarah Vieweg, “AIDR: Artificial Intelligence for Disaster Response,” In Proceedings of the International Conference on World Wide Web (2014): 159-162.

[4] Gang Wang, Christo Wilson, Xiaohan Zhao, Yibo Zhu, Manish Mohanlal, Haitao Zheng, and Ben Y. Zhao, “Serf and Turf: Crowdturfing for Fun and Profit,” In Proceedings of the International Conference on World Wide Web (2012): 679-688.

[5] Paul Heymann, Georgia Koutrika, and Hector Garcia-Molina, “Fighting Spam on Social Web Sites: A Survey of Approaches and Future Challenges,” Institute of Electrical and Electronics Engineers Internet Computing, vol. 11, no. 6 (2007): 36-45.

[6] Yoree Koh, “Only 11% of New Twitter Users in 2012 Are Still Tweeting,” March 21, 2014, accessed May 17, 2016, http://blogs.wsj.com/digits/2014/03/21/new-report-spotlights-twitters-re....

[7] Fangzhao Wu, Jinyun Shu, Yongfeng Huang, and Zhigang Yuan, “Social Spammer and Spam Message Co-detection in Microblogging with Social Context Regularization,” In Proceedings of the Association for Computing Machinery International Conference on Information and Knowledge Management (2015): 1601-1610.

[8] Reza Zafarani and Huan Liu, “10 Bits of Surprise: Detecting Malicious Users with Minimum Information,” In Proceedings of the Association for Computing Machinery International Conference on Information and Knowledge Management (2015): 423-431.

[9] Liang Wu, Fred Morstatter, Xia Hu, and Huan Liu, “Mining Misinformation in Social Media” in Big Data in Complex and Social Networks (CRC Press, Taylor & Francis Group, 2015), 1-35.

[10] Victor Luckerson, “Fear, Misinformation, and Social Media Complicate Ebola Fight,” October 8, 2014, accessed May 17, 2016, http://time.com/3479254/ebola-social-media/.

[11] Vahed Qazvinian, Emily Rosengren, Dragomir R. Radev, and Qiaozhu Mei, “Rumor Has It: Identifying Misinformation in Microblogs,” In Proceedings of the Conference on Empirical Methods in Natural Language Processing (2011): 1589-1599.

[12] Aditi Gupta, Hemank Lamba, Ponnurangam Kumaraguru, and Anupam Joshi, “Faking Sandy: Characterizing and Identifying Fake Images on Twitter during Hurricane Sandy,” In Proceedings of the International Conference on World Wide Web (2013): 729-736.

[13] Yazan Boshmaf, Ildar Muslukhov, Konstantin Beznosov, and Matei Ripeanu, “Design and Analysis of a Social Botnet,” The International Journal of Computer and Telecommunication Networking (2013): 556-578.

[14] Marion R. Just, Ann N. Crigler, Panagiotis Takis Metaxas, and Eni Mustafaraj, “It’s Trending on Twitter - An Analysis of the Twitter Manipulations in the Massachusetts 2010 Special Senate Election,” American Political Science Association Annual Meeting Paper (2012): 1:23.

[15] Kyumin Lee, James Caverlee, and Steve Webb, “The Social Honeypot Project: Protecting Online Communities from Spammers,” In Proceedings of the International Conference on World Wide Web (2010): 1139-1140.

[16] Sangho Lee and Jong Kim, “Early Filtering of Ephemeral Malicious Accounts on Twitter,” The International Journal for the Computer and Telecommunications Industry, no 54 (2014): 48-57.

[17] Vishnu Teja Kilari, Guoliang Xue, and Lingjun Li. “Host Based Detection of Advanced Miniduke Style Bots in Smartphones through User Profiling,” In Proceedings of the Electrical and Electronics Engineers Global Communications Conference (2015): 1-6.

[18] Becky Oskin, “Japan Earthquake and Tsunami of 2011: Facts and Information,” Live Science, May 7, 2015, accessed May 18, 2016, http://www.livescience.com/39110-japan-2011-earthquake-tsunami-facts.html.

[19] Sharon Gaudin, “Twitter, Facebook become lifeline after Japan Quake,” ComputerWorld, Mar 11, 2011, accessed May 18, 2016, http://www.computerworld.com/article/2506883/web-apps/twitter--facebook-....

[20] Shamanth Kumar, Fred Morstatter, Reza Zafarani, and Huan Liu, “Whom Should I Follow?: Identifying Relevant Users During Crises,” In Proceedings of Association for Computing Machinery Conference on Hypertext and Social Media (2013): 139-147.

[21] Shamanth Kumar, Xia Hu, and Huan Liu, “A Behavior Analytics Approach to Identifying Tweets from Crisis Regions,” In Proceedings of Association for Computing Machinery Conference on Hypertext and Social Media (2014): 255-260.