Commit f212a433 authored by Decoupes Remy's avatar Decoupes Remy
Browse files

explanation how upload data in a running environment

parent 6f7231b8
No related merge requests found
Showing with 14 additions and 6 deletions
+14 -6
......@@ -6,7 +6,7 @@ Introduction of Natural Language Processing (NLP) methods applied to health doma
### How to use this repository
You can launch a jupyter notebook thank to [mybinder](https://mybinder.org/) : [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/git/https%3A%2F%2Fgitlab.irstea.fr%2Fremy.decoupes%2Fsvepm2021-ws5-textmining/HEAD)
[![quick start](readme_ressources/svepm_lauch_binder.gif)
![quick start](readme_ressources/svepm_lauch_binder.gif)
### Authors
[UMR TÉTIS](https://www.umr-tetis.fr/)
......
%% Cell type:markdown id:commercial-brain tags:
# SVEPM 2021 - WS5 : Extraction of medical terms from non-structured (textual) datafrom online news and social media
## [Wednesday 24th March, 14.00-17.00](https://www.svepm2021.org/upload/pdf/SVEPM2021_WS5.pdf)
[UMR TÉTIS](https://www.umr-tetis.fr/)
Introduction of Natural Language Processing (NLP) methods applied to health domain : an overview of terminology extraction from online news and social network (Twitter).
1. Twitter
1.1 Data acquisition
1.1.1 Data description
1.1.2 Prerequisite
1.1.3 Data collection with keywords and account names
1.1.4 Data retrieval from existing corpora
1.1.3 Data collection
- with keywords and account names
- Data retrieval from existing corpora
1.1.4 Get data for this workshop
1.2 Pre-process
1.2.1 Filtering
1.2.2 Tweets cleaning
1.3 Terminology extraction
1.3.1 Statistical method : TF-IDF
1.3.2 Application of TF-IDF
1.4 Data visualization
2. Online news
2.1 Data acquisition
%% Cell type:markdown id:corrected-cliff tags:
## 1.1.1 Twitter data description
%% Cell type:code id:complex-juice tags:
``` python
# Tweet example from Twitter doc API : https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/overview
tweet_example = {
"created_at": "Thu Apr 06 15:24:15 +0000 2017",
"id_str": "850006245121695744",
"text": "1\/ Today we\u2019re sharing our vision for the future of the Twitter API platform!\nhttps:\/\/t.co\/XweGngmxlP",
"user": {
"id": 2244994945,
"name": "Twitter Dev",
"screen_name": "TwitterDev",
"location": "Internet",
"url": "https:\/\/dev.twitter.com\/",
"description": "Your official source for Twitter Platform news, updates & events. Need technical help? Visit https:\/\/twittercommunity.com\/ \u2328\ufe0f #TapIntoTwitter"
},
"place": {
},
"entities": {
"hashtags": [
],
"urls": [
{
"url": "https:\/\/t.co\/XweGngmxlP",
"unwound": {
"url": "https:\/\/cards.twitter.com\/cards\/18ce53wgo4h\/3xo1c",
"title": "Building the Future of the Twitter API Platform"
}
}
],
"user_mentions": [
]
}
}
```
%% Cell type:code id:outer-lover tags:
``` python
# Print tweet content and user
tweet_content = tweet_example["text"]
tweet_user = tweet_example["user"]["name"]
print("Print raw data: \n"+ tweet_content + " from " + tweet_user)
print("\n")
# clean tweet content : remove "/"
tweet_content_cleaned = tweet_example["text"].replace("\\", "")
print("Pre-process tweet: \n"+ tweet_content_cleaned + " from " + tweet_user)
```
%% Cell type:markdown id:terminal-jurisdiction tags:
## 1.1.2 Prerequisite
Twitter data contain personal and sensible data. We have to be compliant with [GPDR](https://en.wikipedia.org/wiki/General_Data_Protection_Regulation) and [Twitter terms and uses](https://twitter.com/en/tos#intlTerms)
Registred users of Twitter gave their consent to Twitter's policy. They consent their data may be used for **research works**. However as soon as they change their visibility/privacy (i.e. they withdraw their consent), we are not allowed to used their data anymore. Maintaining a dataset of tweets implies to **synchronise in real time with Twitter's API**.
To retrieve tweets automatically, we have to apply for a [Twitter dev account](https://developer.twitter.com/en/apply-for-access). To do so, we have to explain why we want an access and discribe applications that will used our credentials.
## 1.1.3 Data collection with keywords and account names
## 1.1.3 Data collection
### with keywords and account names
You can find the [script that collects tweets for MOOD](https://gitlab.irstea.fr/umr-tetis/mood/mood-tetis-tweets-collect). You can clone this repository
%% Cell type:code id:magnetic-arrival tags:
``` python
# clone MOOD repository:
!git clone https://gitlab.irstea.fr/umr-tetis/mood/mood-tetis-tweets-collect.git
```
%% Cell type:code id:exempt-distance tags:
``` python
# Print MOOD keywords :
import pandas as pd
mood_keywords = pd.read_csv("mood-tetis-tweets-collect/params/keywordsFilter.csv")
# Group by disease
mood_diseases = mood_keywords.groupby("syndrome")
for disease, keywords in mood_diseases:
print(mood_diseases.get_group(disease)["hashtags"].tolist(), "\n")
```
%% Cell type:markdown id:thrown-collective tags:
## 1.1.4 Data retrieval from existing corpora
### Data retrieval from existing corpora
To be compliant with Twitter terms and uses, we can't share tweet content nor data.
Only tweet IDs can be shared from which we can retrieve all tweet contents and metadata. It's called **hydrating tweet**.
To do so, you can use the command line tool [twarc](https://github.com/DocNow/twarc). You must set your credentials and then hydrate tweets : `twarc hydrate tweet-ids.txt > tweets.jsonl`
## 1.1.4 Get data for this workshop
For this workshop, we are going to use a tweets corpus in Licence CC0 (Public Domain) from [kaggle platform](https://www.kaggle.com/gpreda/pfizer-vaccine-tweets).
**If you have already a kaggle account, you can download the dataset from the link below or you can download from this link [filesender](https://filesender.renater.fr/?s=download&token=1706766d-676e-4823-a1b4-665067e5fc81#), password will be given during the workshop**. Please, now upload this file in data directory
Then, upload the downloaded file in the data directory of your environnement :
![upload_tweets](readme_ressources/svepm_upload_dataset.gif)
%% Cell type:code id:verified-defeat tags:
``` python
# import pandas library: facilitates the use of tables or matrix
import pandas as pd
# import data
tweets = pd.read_csv("data/vaccination_tweets.csv")
# show metadata:
metadata_attributes = tweets.keys()
print(metadata_attributes)
```
%% Cell type:markdown id:developing-night tags:
## 1.2 Pre-process
### 1.2.1 Filtering
%% Cell type:code id:activated-abortion tags:
``` python
# Dataset exploration:
# count nb of tweets
tweets_count = tweets.count()
#print(tweets_count)
print(tweets["user_verified"].unique())
# original_tweets_count = tweets_count[tweets_count["is_retweet"] != True]
```
%% Cell type:markdown id:whole-belle tags:
### 1.2.2 Tweets cleaning
%% Cell type:code id:convinced-influence tags:
``` python
# remove url
```
%% Cell type:markdown id:ecological-clone tags:
## 1.3.2 Application of TF-IDF
%% Cell type:code id:direct-compatibility tags:
``` python
# import sci-kit-learn library
from sklearn.feature_extraction.text import TfidfVectorizer
# initialise TF-IDF parameters
vectorizer = TfidfVectorizer(
stop_words="english",
max_features=1000,
ngram_range=(1, 3),
token_pattern='[a-zA-Z0-9#]+',
)
# Apply TF-IDF:
vectors = vectorizer.fit_transform(tweets["text"])
vocabulary = vectorizer.get_feature_names()
# Uncompress TF-IDF matrix into a sparse matrix
dense = vectors.todense()
denselist = dense.tolist()
tf_idf_matrix = pd.DataFrame(denselist, columns=vocabulary)
# print(tf_idf_matrix)
# Tranpose matrix and get frequencies per term (rather than term frequencies per document)
tfidf_score_per_terms = tf_idf_matrix.T.sum(axis=1)
print(tfidf_score_per_terms)
```
%% Cell type:markdown id:separate-swaziland tags:
## 1.4 Data visualization
%% Cell type:code id:standing-concentration tags:
``` python
# import dataviz libraries:
%matplotlib inline
import matplotlib.pyplot as plt
from wordcloud import WordCloud
# initiate a wordcloud
wordcloud = WordCloud(
background_color="white",
width=1600,
height=800,
max_words=50,
)
# compute the wordcloud
wordcloud.generate_from_frequencies(tfidf_score_per_terms)
plt.figure(figsize=(20, 10))
plt.imshow(wordcloud)
plt.show()
```
%% Cell type:code id:sound-methodology tags:
``` python
```
......
readme_ressources/svepm_upload_dataset.gif

363 KB

Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment