Commit bfc0822b authored by Decoupes Remy's avatar Decoupes Remy
Browse files

add a screenshot of a tweet

parent 032acb72
No related merge requests found
Showing with 14 additions and 12 deletions
+14 -12
%% Cell type:markdown id:commercial-brain tags:
# SVEPM 2021 - WS5 : Extraction of medical terms from non-structured (textual) data from online news and social media
## [Wednesday 24th March, 14.00-17.00](https://www.svepm2021.org/upload/pdf/SVEPM2021_WS5.pdf)
[UMR TÉTIS](https://www.umr-tetis.fr/)
Introduction of Natural Language Processing (NLP) methods applied to health domain : an overview of terminology extraction from online news and social network (Twitter).
1. Twitter
1.1 Data acquisition
1.1.1 Data description
1.1.2 Prerequisite
1.1.3 Data collection
- with keywords and account names
- Data retrieval from existing corpora
1.1.4 Get data for this workshop
1.2 Pre-process
1.2.1 Filtering
1.2.2 Tweets cleaning
1.3 Terminology extraction
1.3.1 Statistical method : TF-IDF
1.3.2 Application of TF-IDF
1.4 Data visualization
2. Online news
2.1 Data acquisition : PADI-web
2.2 Pre-process : Formatting data
2.3 Terminology extraction : TF-IDF
2.4 Data visualization
%% Cell type:markdown id:corrected-cliff tags:
## 1.1.1 Twitter data description
![tweet_example]()
%% Cell type:code id:complex-juice tags:
``` python
# Tweet example from Twitter doc API : https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/overview
tweet_example = {
"created_at": "Thu Apr 06 15:24:15 +0000 2017",
"id_str": "850006245121695744",
"text": "1\/ Today we\u2019re sharing our vision for the future of the Twitter API platform!\nhttps:\/\/t.co\/XweGngmxlP",
"user": {
"id": 2244994945,
"name": "Twitter Dev",
"screen_name": "TwitterDev",
"location": "Internet",
"url": "https:\/\/dev.twitter.com\/",
"description": "Your official source for Twitter Platform news, updates & events. Need technical help? Visit https:\/\/twittercommunity.com\/ \u2328\ufe0f #TapIntoTwitter"
},
"place": {
},
"entities": {
"hashtags": [
],
"urls": [
{
"url": "https:\/\/t.co\/XweGngmxlP",
"unwound": {
"url": "https:\/\/cards.twitter.com\/cards\/18ce53wgo4h\/3xo1c",
"title": "Building the Future of the Twitter API Platform"
}
}
],
"user_mentions": [
]
}
}
```
%% Cell type:code id:outer-lover tags:
``` python
# Print tweet content and user
tweet_content = tweet_example["text"]
tweet_user = tweet_example["user"]["name"]
print("Print raw data: \n"+ tweet_content + " from " + tweet_user)
print("\n")
# clean tweet content : remove "/"
tweet_content_cleaned = tweet_example["text"].replace("\\", "")
print("Pre-process tweet: \n"+ tweet_content_cleaned + " from " + tweet_user)
```
%% Cell type:markdown id:terminal-jurisdiction tags:
## 1.1.2 Prerequisite
Twitter data contain personal and sensible data. We have to be compliant with [GPDR](https://en.wikipedia.org/wiki/General_Data_Protection_Regulation) and [Twitter terms and uses](https://twitter.com/en/tos#intlTerms)
Registred users of Twitter gave their consent to Twitter's policy. They consent their data may be used for **research works**. However as soon as they change their visibility/privacy (i.e. they withdraw their consent), we are not allowed to used their data anymore. Maintaining a dataset of tweets implies to **synchronise in real time with Twitter's API**.
To retrieve tweets automatically, we have to apply for a [Twitter dev account](https://developer.twitter.com/en/apply-for-access). To do so, we have to explain why we want an access and describe applications that will used our credentials.
## 1.1.3 Data collection
### with keywords and account names
You can find the [script that collects tweets for MOOD](https://gitlab.irstea.fr/umr-tetis/mood/mood-tetis-tweets-collect). You can clone this repository
%% Cell type:code id:magnetic-arrival tags:
``` python
# clone MOOD repository:
!git clone https://gitlab.irstea.fr/umr-tetis/mood/mood-tetis-tweets-collect.git
```
%% Cell type:code id:exempt-distance tags:
``` python
# Print MOOD keywords :
import pandas as pd
mood_keywords = pd.read_csv("mood-tetis-tweets-collect/params/keywordsFilter.csv")
# Group by disease
mood_diseases = mood_keywords.groupby("syndrome")
for disease, keywords in mood_diseases:
print(mood_diseases.get_group(disease)["hashtags"].tolist(), "\n")
```
%% Cell type:markdown id:thrown-collective tags:
### Data retrieval from existing corpora
To be compliant with Twitter terms and uses, we can't share tweet content nor data.
Only tweet IDs can be shared from which we can retrieve all tweet contents and metadata. It's called **hydrating tweet**.
To do so, you can use the command line tool [twarc](https://github.com/DocNow/twarc). You must set your credentials and then hydrate tweets : `twarc hydrate tweet-ids.txt > tweets.jsonl`
## 1.1.4 Get data for this workshop
For this workshop, we are going to use a corpus of tweets in Licence CC0 (Public Domain) from [kaggle platform](https://www.kaggle.com/gpreda/pfizer-vaccine-tweets).
**If you have already a kaggle account, you can download the dataset from the link below or you can download from this link [filesender](https://filesender.renater.fr/?s=download&token=1706766d-676e-4823-a1b4-665067e5fc81#), password will be given during the workshop**.
Then, upload the downloaded file in the data directory of your environnement :
![upload_tweets](readme_ressources/svepm_upload_dataset.gif)
%% Cell type:code id:verified-defeat tags:
``` python
# import pandas library: facilitates the use of tables or matrix
import pandas as pd
# import data
tweets = pd.read_csv("data/vaccination_tweets.csv")
# show metadata:
metadata_attributes = tweets.keys()
print(metadata_attributes)
```
%% Cell type:markdown id:developing-night tags:
## 1.2 Pre-process
### 1.2.1 Filtering
%% Cell type:code id:activated-abortion tags:
``` python
# Dataset exploration:
# count nb of tweets
tweets_count = tweets.count()
print(tweets_count)
# compute % of retweets in the dataset
rt_tweets = tweets[tweets["retweets"] == True]
percent_of_rt = len(rt_tweets) * 100 / tweets_count["id"]
print("\nPercent of retweets in this corpus: " + str(percent_of_rt))
print("\n")
# print tweet contents
print(tweets["text"].values)
```
%% Cell type:markdown id:whole-belle tags:
### 1.2.2 Tweets cleaning
%% Cell type:code id:convinced-influence tags:
``` python
# remove url
```
%% Cell type:markdown id:alternate-norwegian tags:
%% Cell type:markdown id:therapeutic-weekend tags:
## 1.3 Terminology extraction
### 1.3.1 Statistical method : TF-IDF
**TF-IDF (Term Frequency - Inverse Document Frequency)** is a stastical measure which reflects how important a word is to a document in a corpus.
![equation](readme_ressources/tf-idf.png)
with :
+ t: a term
+ d: a document
+ D : Corpus or collection of documents : D = d1, d2, ..., dn
+ freq(t,d) : frequence of the term t in document d
+ |d| : the number of terms in d
+ |D| : the number of document in D
+ |{d|t ∈ d}| : the number of documents that contain the term t
TF-IDF gives good scores to terms that are frequently used in a single document. If a term is widely used in the corpus, IDF will be very low, it will reduces TF-IDF score of a term. Indeed, IDG varies according to log(1/x) function : Here with |D| = 30 so IDF = log(30/x)
![idf](readme_ressources/idf.png)
%% Cell type:markdown id:ecological-clone tags:
## 1.3.2 Application of TF-IDF
%% Cell type:code id:direct-compatibility tags:
``` python
# import sci-kit-learn library
from sklearn.feature_extraction.text import TfidfVectorizer
# initialise TF-IDF parameters
vectorizer = TfidfVectorizer(
stop_words="english",
max_features=1000,
ngram_range=(1, 3),
token_pattern='[a-zA-Z0-9#]+',
)
# Apply TF-IDF:
vectors = vectorizer.fit_transform(tweets["text"])
vocabulary = vectorizer.get_feature_names()
# Uncompress TF-IDF matrix into a sparse matrix
dense = vectors.todense()
denselist = dense.tolist()
tf_idf_matrix = pd.DataFrame(denselist, columns=vocabulary)
# print(tf_idf_matrix)
# Tranpose matrix and get frequencies per term (rather than term frequencies per document)
tfidf_score_per_terms = tf_idf_matrix.T.sum(axis=1)
print(tfidf_score_per_terms)
```
%% Cell type:markdown id:separate-swaziland tags:
## 1.4 Data visualization
%% Cell type:code id:standing-concentration tags:
``` python
# import dataviz libraries:
%matplotlib inline
import matplotlib.pyplot as plt
from wordcloud import WordCloud
# initiate a wordcloud
wordcloud = WordCloud(
background_color="white",
width=1600,
height=800,
max_words=50,
)
# compute the wordcloud
wordcloud.generate_from_frequencies(tfidf_score_per_terms)
plt.figure(figsize=(20, 10))
plt.imshow(wordcloud)
plt.show()
```
%% Cell type:markdown id:satellite-toner tags:
%% Cell type:markdown id:brilliant-humidity tags:
## 2. Online news
Now, let's try to extract terms from media. To do so we are going to focus on African Swine Fewer (ASF) in news arcticles.
### 2.1 Data collection : PADI-web
**Platform for Automated extraction of Disease Information from the web.** [[1](http://agritrop.cirad.fr/588533/1/Arsevska_et_al_PlosOne.pdf)]. [Link to PADI-web website](https://padi-web.cirad.fr/). PADI-web automatically collects news, classifies them and extracts epidemiological information (diseases, dates, symptoms, hosts and locations).
We are going to use a subset of PADI-web focused on ASF [[2](https://dataverse.cirad.fr/dataset.xhtml?persistentId=doi:10.18167/DVN1/POIZMA)]
[1] : Web monitoring of emerging animal infectious diseases integrated in the French Animal Health Epidemic Intelligence System.
Arsevska Elena, Valentin Sarah, Rabatel Julien, De Goër de Hervé Jocelyn, Falala Sylvain, Lancelot Renaud, Roche Mathieu.
PloS One, 13 (8) e0199960, 25 p., 2018
[http://agritrop.cirad.fr/588533/]
[2] : PADI-web: ASF corpora :
Both corpora (news articles) have been manually collected using the query "african swine fever outbreak" with Google. These corpora in English have been semi-automatically normalized. They can be used as (a) input of BioTex tool in order to extract terminology, (b) input of Weka tool for data-mining tasks. Description: (1) ASFcorpus_epidemio.txt: 69 news about epidemiology aspects. The news contain a principal information of suspicion or confirmation of ASF, unknown disease or unexplained clinical signs in animals of the pig species, with a description of the event, such as place, time, number and species affected and clinical signs place, time, number and species affected and clinical signs (period: 2012-2013). (2) ASFcorpus_eco.txt: 69 news about socio-economic impact of an ASF outbreak to a country or a region, and a secondary information about the event (period: 2012-2014). (3) ASF_corpus_weka_final.arff: corpus (epidemio + socio-economic data) based on Weka format (ARFF file) for data mining tasks, e.g. classification. (2018-08-20)
%% Cell type:code id:automotive-message tags:
%% Cell type:code id:literary-buddy tags:
``` python
# Download PADI-web ASF Corpus :
!curl -o ./data/padiweb_asf_corpus.txt https://dataverse.cirad.fr/api/access/datafile/2349?gbrecs=true
```
%% Cell type:code id:dimensional-conference tags:
%% Cell type:code id:lucky-party tags:
``` python
# Print ASF file
file = open("./data/padiweb_asf_corpus.txt", "r")
for line_count, line in enumerate(file):
print("l." + str(line_count) + ": " + line)
```
%% Cell type:markdown id:southern-tribute tags:
%% Cell type:markdown id:developed-tower tags:
### 2.2 Pre-process
PADI-web corpus is already clean and filtred. We just need to format the PADI-web corpus into a python table. Each row will be a article.
%% Cell type:code id:owned-airfare tags:
%% Cell type:code id:present-recycling tags:
``` python
import pandas as pd
asf_corpus_table = pd.DataFrame()
file = open("./data/padiweb_asf_corpus.txt", "r")
for line in file:
if line is not "" and "##########END##########" not in line and line is not '\n':
asf_corpus_table = asf_corpus_table.append({"articles" : line}, ignore_index=True)
print(asf_corpus_table)
print("Number of articles: " + str(len(asf_corpus_table)))
```
%% Cell type:markdown id:guilty-crowd tags:
%% Cell type:markdown id:mexican-chicken tags:
### 2.3 Terminology extraction : TF-IDF
As we did for the tweet corpus, we are going to apply TF-IDF to extract descriminant terms.
%% Cell type:code id:employed-springer tags:
%% Cell type:code id:basic-explosion tags:
``` python
# import sci-kit-learn library
from sklearn.feature_extraction.text import TfidfVectorizer
# initialise TF-IDF parameters
vectorizer = TfidfVectorizer(
stop_words="english",
max_features=1000,
ngram_range=(1, 3),
)
# Apply TF-IDF:
vectors = vectorizer.fit_transform(asf_corpus_table["articles"])
vocabulary = vectorizer.get_feature_names()
# Uncompress TF-IDF matrix into a sparse matrix
dense = vectors.todense()
denselist = dense.tolist()
asf_tf_idf_matrix = pd.DataFrame(denselist, columns=vocabulary)
# print(asf_tf_idf_matrix)
# Tranpose matrix and get frequencies per term (rather than term frequencies per document)
asf_tfidf_score_per_terms = asf_tf_idf_matrix.T.sum(axis=1)
print(asf_tfidf_score_per_terms)
```
%% Cell type:markdown id:sunset-camel tags:
%% Cell type:markdown id:domestic-significance tags:
#### 2.4 Data visualization
%% Cell type:code id:expired-setup tags:
%% Cell type:code id:identified-separation tags:
``` python
# import dataviz libraries:
%matplotlib inline
import matplotlib.pyplot as plt
from wordcloud import WordCloud
# initiate a wordcloud
wordcloud = WordCloud(
background_color="white",
width=1600,
height=800,
max_words=50,
)
# compute the wordcloud
wordcloud.generate_from_frequencies(asf_tfidf_score_per_terms)
plt.figure(figsize=(20, 10))
plt.imshow(wordcloud)
plt.show()
```
%% Cell type:code id:acquired-slovenia tags:
%% Cell type:code id:quarterly-bunny tags:
``` python
```
......
readme_ressources/twitter_example.png

32.7 KB

Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment