Newer
Older
{
"cells": [
{
"cell_type": "markdown",
"id": "commercial-brain",
"metadata": {},
"source": [
"# SVEPM 2021 - WS5 : Extraction of medical terms from non-structured (textual) data from online news and social media\n",
"## [Wednesday 24th March, 14.00-17.00](https://www.svepm2021.org/upload/pdf/SVEPM2021_WS5.pdf)\n",
"\n",
"[UMR TÉTIS](https://www.umr-tetis.fr/)\n",
"\n",
"Introduction of Natural Language Processing (NLP) methods applied to health domain : an overview of terminology extraction from online news and social network (Twitter).\n",
"\n",
"1. Twitter\n",
"\n",
" 1.1 Data acquisition\n",
" 1.1.1 Data description\n",
" 1.1.2 Prerequisite\n",
" 1.1.3 Data collection \n",
" - with keywords and account names\n",
" - Data retrieval from existing corpora\n",
" 1.1.4 Get data for this workshop\n",
" 1.2 Pre-process\n",
" 1.2.1 Filtering\n",
" 1.2.2 Tweets cleaning\n",
" 1.3 Terminology extraction\n",
" 1.3.1 Statistical method : TF-IDF\n",
" 1.3.2 Application of TF-IDF\n",
" 1.4 Data visualization\n",
"2. Online news\n",
"\n",
" 2.1 Data acquisition : PADI-web\n",
" \n",
" 2.2 Pre-process : Formatting data\n",
" \n",
" 2.3 Terminology extraction : TF-IDF\n",
" \n",
" 2.4 Data visualization"
]
},
{
"cell_type": "markdown",
"id": "corrected-cliff",
"metadata": {},
"source": [
"## 1.1.1 Twitter data description\n",
"\n",
"![tweet_example]()"
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
"outputs": [],
"source": [
"# Tweet example from Twitter doc API : https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/overview\n",
"tweet_example = {\n",
" \"created_at\": \"Thu Apr 06 15:24:15 +0000 2017\",\n",
" \"id_str\": \"850006245121695744\",\n",
" \"text\": \"1\\/ Today we\\u2019re sharing our vision for the future of the Twitter API platform!\\nhttps:\\/\\/t.co\\/XweGngmxlP\",\n",
" \"user\": {\n",
" \"id\": 2244994945,\n",
" \"name\": \"Twitter Dev\",\n",
" \"screen_name\": \"TwitterDev\",\n",
" \"location\": \"Internet\",\n",
" \"url\": \"https:\\/\\/dev.twitter.com\\/\",\n",
" \"description\": \"Your official source for Twitter Platform news, updates & events. Need technical help? Visit https:\\/\\/twittercommunity.com\\/ \\u2328\\ufe0f #TapIntoTwitter\"\n",
" },\n",
" \"place\": { \n",
" },\n",
" \"entities\": {\n",
" \"hashtags\": [ \n",
" ],\n",
" \"urls\": [\n",
" {\n",
" \"url\": \"https:\\/\\/t.co\\/XweGngmxlP\",\n",
" \"unwound\": {\n",
" \"url\": \"https:\\/\\/cards.twitter.com\\/cards\\/18ce53wgo4h\\/3xo1c\",\n",
" \"title\": \"Building the Future of the Twitter API Platform\"\n",
" }\n",
" }\n",
" ],\n",
" \"user_mentions\": [ \n",
" ]\n",
" }\n",
"}"
]
},
{
"cell_type": "code",
"source": [
"# Print tweet content and user\n",
"tweet_content = tweet_example[\"text\"]\n",
"tweet_user = tweet_example[\"user\"][\"name\"]\n",
"print(\"Print raw data: \\n\"+ tweet_content + \" from \" + tweet_user)\n",
"print(\"\\n\")\n",
"# clean tweet content : remove \"/\"\n",
"tweet_content_cleaned = tweet_example[\"text\"].replace(\"\\\\\", \"\")\n",
"print(\"Pre-process tweet: \\n\"+ tweet_content_cleaned + \" from \" + tweet_user)"
]
},
{
"cell_type": "markdown",
"id": "terminal-jurisdiction",
"metadata": {},
"source": [
"## 1.1.2 Prerequisite\n",
"\n",
"Twitter data contain personal and sensible data. We have to be compliant with [GPDR](https://en.wikipedia.org/wiki/General_Data_Protection_Regulation) and [Twitter terms and uses](https://twitter.com/en/tos#intlTerms)\n",
"\n",
"Registred users of Twitter gave their consent to Twitter's policy. They consent their data may be used for **research works**. However as soon as they change their visibility/privacy (i.e. they withdraw their consent), we are not allowed to used their data anymore. Maintaining a dataset of tweets implies to **synchronise in real time with Twitter's API**.\n",
"\n",
"To retrieve tweets automatically, we have to apply for a [Twitter dev account](https://developer.twitter.com/en/apply-for-access). To do so, we have to explain why we want an access and describe applications that will used our credentials.\n",
"## 1.1.3 Data collection \n",
"### with keywords and account names\n",
"\n",
"You can find the [script that collects tweets for MOOD](https://gitlab.irstea.fr/umr-tetis/mood/mood-tetis-tweets-collect). You can clone this repository\n"
]
},
{
"cell_type": "code",
"source": [
"# clone MOOD repository:\n",
"!git clone https://gitlab.irstea.fr/umr-tetis/mood/mood-tetis-tweets-collect.git\n"
]
},
{
"cell_type": "code",
"source": [
"# Print MOOD keywords :\n",
"import pandas as pd\n",
"mood_keywords = pd.read_csv(\"mood-tetis-tweets-collect/params/keywordsFilter.csv\")\n",
"# Group by disease\n",
"mood_diseases = mood_keywords.groupby(\"syndrome\")\n",
"for disease, keywords in mood_diseases:\n",
" print(mood_diseases.get_group(disease)[\"hashtags\"].tolist(), \"\\n\")"
]
},
{
"cell_type": "markdown",
"id": "thrown-collective",
"metadata": {},
"source": [
"### Data retrieval from existing corpora\n",
"\n",
"To be compliant with Twitter terms and uses, we can't share tweet content nor data. \n",
"Only tweet IDs can be shared from which we can retrieve all tweet contents and metadata. It's called **hydrating tweet**.\n",
"To do so, you can use the command line tool [twarc](https://github.com/DocNow/twarc). You must set your credentials and then hydrate tweets : `twarc hydrate tweet-ids.txt > tweets.jsonl`\n",
"\n",
"## 1.1.4 Get data for this workshop\n",
"\n",
"For this workshop, we are going to use a corpus of tweets in Licence CC0 (Public Domain) from [kaggle platform](https://www.kaggle.com/gpreda/pfizer-vaccine-tweets).\n",
"**If you have already a kaggle account, you can download the dataset from the link below or you can download from this link [filesender](https://filesender.renater.fr/?s=download&token=1706766d-676e-4823-a1b4-665067e5fc81#), password will be given during the workshop**.\n",
"\n",
"Then, upload the downloaded file in the data directory of your environnement :\n",
""
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "verified-defeat",
"metadata": {},
"outputs": [],
"source": [
"# import pandas library: facilitates the use of tables or matrix\n",
"import pandas as pd\n",
"# import data\n",
"tweets = pd.read_csv(\"data/vaccination_tweets.csv\")\n",
"# show metadata:\n",
"metadata_attributes = tweets.keys()\n",
"print(metadata_attributes)"
]
},
{
"cell_type": "markdown",
"id": "developing-night",
"metadata": {},
"source": [
" ## 1.2 Pre-process\n",
" ### 1.2.1 Filtering"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "activated-abortion",
"metadata": {},
"outputs": [],
"source": [
"# Dataset exploration:\n",
"# count nb of tweets \n",
"tweets_count = tweets.count()\n",
"print(tweets_count)\n",
"\n",
"# compute % of retweets in the dataset\n",
"rt_tweets = tweets[tweets[\"retweets\"] == True]\n",
"percent_of_rt = len(rt_tweets) * 100 / tweets_count[\"id\"]\n",
"print(\"\\nPercent of retweets in this corpus: \" + str(percent_of_rt))\n",
"print(\"\\n\")\n",
"\n",
"# print tweet contents\n",
"print(tweets[\"text\"].values)"
]
},
{
"cell_type": "markdown",
"id": "whole-belle",
"metadata": {},
"source": [
"### 1.2.2 Tweets cleaning"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "convinced-influence",
"metadata": {},
"outputs": [],
"source": [
"# remove url"
]
},
"metadata": {},
"source": [
" ## 1.3 Terminology extraction\n",
"\n",
" ### 1.3.1 Statistical method : TF-IDF\n",
" \n",
" **TF-IDF (Term Frequency - Inverse Document Frequency)** is a stastical measure which reflects how important a word is to a document in a corpus.\n",
"\n",
"\n",
"with :\n",
"\n",
"+ t: a term\n",
"+ d: a document\n",
"+ D : Corpus or collection of documents : D = d1, d2, ..., dn\n",
"+ freq(t,d) : frequence of the term t in document d\n",
"+ |d| : the number of terms in d\n",
"+ |D| : the number of document in D\n",
"+ |{d|t ∈ d}| : the number of documents that contain the term t\n",
"\n",
"TF-IDF gives good scores to terms that are frequently used in a single document. If a term is widely used in the corpus, IDF will be very low, it will reduces TF-IDF score of a term. Indeed, IDG varies according to log(1/x) function : Here with |D| = 30 so IDF = log(30/x)\n",
""
]
},
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
{
"cell_type": "markdown",
"id": "ecological-clone",
"metadata": {},
"source": [
" ## 1.3.2 Application of TF-IDF"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "direct-compatibility",
"metadata": {},
"outputs": [],
"source": [
"# import sci-kit-learn library\n",
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
"\n",
"# initialise TF-IDF parameters\n",
"vectorizer = TfidfVectorizer(\n",
" stop_words=\"english\",\n",
" max_features=1000,\n",
" ngram_range=(1, 3),\n",
" token_pattern='[a-zA-Z0-9#]+',\n",
")\n",
"\n",
"# Apply TF-IDF:\n",
"vectors = vectorizer.fit_transform(tweets[\"text\"])\n",
"vocabulary = vectorizer.get_feature_names()\n",
"\n",
"# Uncompress TF-IDF matrix into a sparse matrix \n",
"dense = vectors.todense()\n",
"denselist = dense.tolist()\n",
"tf_idf_matrix = pd.DataFrame(denselist, columns=vocabulary)\n",
"# print(tf_idf_matrix)\n",
"\n",
"# Tranpose matrix and get frequencies per term (rather than term frequencies per document)\n",
"tfidf_score_per_terms = tf_idf_matrix.T.sum(axis=1)\n",
"print(tfidf_score_per_terms)"
]
},
{
"cell_type": "markdown",
"id": "separate-swaziland",
"metadata": {},
"source": [
"## 1.4 Data visualization"
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
},
{
"cell_type": "code",
"execution_count": null,
"id": "standing-concentration",
"metadata": {},
"outputs": [],
"source": [
"# import dataviz libraries:\n",
"%matplotlib inline \n",
"import matplotlib.pyplot as plt\n",
"from wordcloud import WordCloud\n",
"\n",
"# initiate a wordcloud\n",
"wordcloud = WordCloud(\n",
" background_color=\"white\", \n",
" width=1600, \n",
" height=800,\n",
" max_words=50, \n",
")\n",
"\n",
"# compute the wordcloud\n",
"wordcloud.generate_from_frequencies(tfidf_score_per_terms)\n",
"plt.figure(figsize=(20, 10))\n",
"plt.imshow(wordcloud)\n",
"plt.show()"
]
},
"metadata": {},
"source": [
"## 2. Online news\n",
"\n",
"Now, let's try to extract terms from media. To do so we are going to focus on African Swine Fewer (ASF) in news arcticles.\n",
"\n",
"### 2.1 Data collection : PADI-web\n",
"\n",
"**Platform for Automated extraction of Disease Information from the web.** [[1](http://agritrop.cirad.fr/588533/1/Arsevska_et_al_PlosOne.pdf)]. [Link to PADI-web website](https://padi-web.cirad.fr/). PADI-web automatically collects news, classifies them and extracts epidemiological information (diseases, dates, symptoms, hosts and locations).\n",
"\n",
"We are going to use a subset of PADI-web focused on ASF [[2](https://dataverse.cirad.fr/dataset.xhtml?persistentId=doi:10.18167/DVN1/POIZMA)]\n",
"\n",
"[1] : Web monitoring of emerging animal infectious diseases integrated in the French Animal Health Epidemic Intelligence System.\n",
"Arsevska Elena, Valentin Sarah, Rabatel Julien, De Goër de Hervé Jocelyn, Falala Sylvain, Lancelot Renaud, Roche Mathieu.\n",
"PloS One, 13 (8) e0199960, 25 p., 2018\n",
"[http://agritrop.cirad.fr/588533/]\n",
"\n",
"[2] : PADI-web: ASF corpora :\n",
"Both corpora (news articles) have been manually collected using the query \"african swine fever outbreak\" with Google. These corpora in English have been semi-automatically normalized. They can be used as (a) input of BioTex tool in order to extract terminology, (b) input of Weka tool for data-mining tasks. Description: (1) ASFcorpus_epidemio.txt: 69 news about epidemiology aspects. The news contain a principal information of suspicion or confirmation of ASF, unknown disease or unexplained clinical signs in animals of the pig species, with a description of the event, such as place, time, number and species affected and clinical signs place, time, number and species affected and clinical signs (period: 2012-2013). (2) ASFcorpus_eco.txt: 69 news about socio-economic impact of an ASF outbreak to a country or a region, and a secondary information about the event (period: 2012-2014). (3) ASF_corpus_weka_final.arff: corpus (epidemio + socio-economic data) based on Weka format (ARFF file) for data mining tasks, e.g. classification. (2018-08-20) "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Download PADI-web ASF Corpus :\n",
"!curl -o ./data/padiweb_asf_corpus.txt https://dataverse.cirad.fr/api/access/datafile/2349?gbrecs=true"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Print ASF file\n",
"file = open(\"./data/padiweb_asf_corpus.txt\", \"r\")\n",
"for line_count, line in enumerate(file):\n",
" print(\"l.\" + str(line_count) + \": \" + line)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.2 Pre-process\n",
"\n",
"PADI-web corpus is already clean and filtred. We just need to format the PADI-web corpus into a python table. Each row will be a article."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"asf_corpus_table = pd.DataFrame()\n",
"\n",
"file = open(\"./data/padiweb_asf_corpus.txt\", \"r\")\n",
"for line in file:\n",
" if line is not \"\" and \"##########END##########\" not in line and line is not '\\n':\n",
" asf_corpus_table = asf_corpus_table.append({\"articles\" : line}, ignore_index=True)\n",
"\n",
"print(asf_corpus_table)\n",
"print(\"Number of articles: \" + str(len(asf_corpus_table)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.3 Terminology extraction : TF-IDF\n",
"\n",
"As we did for the tweet corpus, we are going to apply TF-IDF to extract descriminant terms."
]
},
{
"cell_type": "code",
"execution_count": null,
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
"metadata": {},
"outputs": [],
"source": [
"# import sci-kit-learn library\n",
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
"\n",
"# initialise TF-IDF parameters\n",
"vectorizer = TfidfVectorizer(\n",
" stop_words=\"english\",\n",
" max_features=1000,\n",
" ngram_range=(1, 3),\n",
")\n",
"\n",
"# Apply TF-IDF:\n",
"vectors = vectorizer.fit_transform(asf_corpus_table[\"articles\"])\n",
"vocabulary = vectorizer.get_feature_names()\n",
"\n",
"# Uncompress TF-IDF matrix into a sparse matrix \n",
"dense = vectors.todense()\n",
"denselist = dense.tolist()\n",
"asf_tf_idf_matrix = pd.DataFrame(denselist, columns=vocabulary)\n",
"# print(asf_tf_idf_matrix)\n",
"\n",
"# Tranpose matrix and get frequencies per term (rather than term frequencies per document)\n",
"asf_tfidf_score_per_terms = asf_tf_idf_matrix.T.sum(axis=1)\n",
"print(asf_tfidf_score_per_terms)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 2.4 Data visualization"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# import dataviz libraries:\n",
"%matplotlib inline \n",
"import matplotlib.pyplot as plt\n",
"from wordcloud import WordCloud\n",
"\n",
"# initiate a wordcloud\n",
"wordcloud = WordCloud(\n",
" background_color=\"white\", \n",
" width=1600, \n",
" height=800,\n",
" max_words=50, \n",
")\n",
"\n",
"# compute the wordcloud\n",
"wordcloud.generate_from_frequencies(asf_tfidf_score_per_terms)\n",
"plt.figure(figsize=(20, 10))\n",
"plt.imshow(wordcloud)\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
}
},
"nbformat": 4,
"nbformat_minor": 5
}