Commit 011c8ef0 authored by Decoupes Remy's avatar Decoupes Remy
Browse files

init repo

parents
No related merge requests found
Showing with 267 additions and 0 deletions
+267 -0
.gitignore 0 → 100644
# Twitter data :
data/
# MOOD git clone :
mood-tetis-tweets-collect/
# ipynb execution
.ipynb_checkpoints/
LICENSE 0 → 100644
This diff is collapsed.
README.md 0 → 100644
# SVEPM 2021 - WS5 : Extraction of medical terms from non-structured (textual) datafrom online news and social media
## [Wednesday 24th March, 14.00-17.00](https://www.svepm2021.org/upload/pdf/SVEPM2021_WS5.pdf)
Introduction of Natural Language Processing (NLP) methods applied to health domain : an overview of terminology extraction from online news and social network (Twitter).
## Authors
[UMR TÉTIS](https://www.umr-tetis.fr/)
## License
This code is provided under the [CeCILL-B](https://cecill.info/licences/Licence_CeCILL-B_V1-en.html) free software license agreement.
%% Cell type:markdown id:commercial-brain tags:
# SVEPM 2021 - WS5 : Extraction of medical terms from non-structured (textual) datafrom online news and social media
## [Wednesday 24th March, 14.00-17.00](https://www.svepm2021.org/upload/pdf/SVEPM2021_WS5.pdf)
[UMR TÉTIS](https://www.umr-tetis.fr/)
Introduction of Natural Language Processing (NLP) methods applied to health domain : an overview of terminology extraction from online news and social network (Twitter).
1. Twitter
1.1 Data acquisition
1.1.1 Data description
1.1.2 Prerequisite
1.1.3 Data collection with keywords and account names
1.1.4 Data retrieval from existing corpora
1.2 Pre-process
1.2.1 Filtering
1.2.2 Tweets cleaning
1.3 Terminology extraction
1.3.1 Statistical method : TF-IDF
1.3.2 Application of TF-IDF
1.4 Data visualization
2. Online news
2.1 Data acquisition
%% Cell type:markdown id:corrected-cliff tags:
## 1.1.1 Twitter data description
%% Cell type:code id:complex-juice tags:
``` python
# Tweet example from Twitter doc API : https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/overview
tweet_example = {
"created_at": "Thu Apr 06 15:24:15 +0000 2017",
"id_str": "850006245121695744",
"text": "1\/ Today we\u2019re sharing our vision for the future of the Twitter API platform!\nhttps:\/\/t.co\/XweGngmxlP",
"user": {
"id": 2244994945,
"name": "Twitter Dev",
"screen_name": "TwitterDev",
"location": "Internet",
"url": "https:\/\/dev.twitter.com\/",
"description": "Your official source for Twitter Platform news, updates & events. Need technical help? Visit https:\/\/twittercommunity.com\/ \u2328\ufe0f #TapIntoTwitter"
},
"place": {
},
"entities": {
"hashtags": [
],
"urls": [
{
"url": "https:\/\/t.co\/XweGngmxlP",
"unwound": {
"url": "https:\/\/cards.twitter.com\/cards\/18ce53wgo4h\/3xo1c",
"title": "Building the Future of the Twitter API Platform"
}
}
],
"user_mentions": [
]
}
}
```
%% Cell type:code id:outer-lover tags:
``` python
# Print tweet content and user
tweet_content = tweet_example["text"]
tweet_user = tweet_example["user"]["name"]
print("Print raw data: \n"+ tweet_content + " from " + tweet_user)
print("\n")
# clean tweet content : remove "/"
tweet_content_cleaned = tweet_example["text"].replace("\\", "")
print("Pre-process tweet: \n"+ tweet_content_cleaned + " from " + tweet_user)
```
%% Output
Print raw data:
1\/ Today we’re sharing our vision for the future of the Twitter API platform!
https:\/\/t.co\/XweGngmxlP from Twitter Dev
Pre-process tweet:
1/ Today we’re sharing our vision for the future of the Twitter API platform!
https://t.co/XweGngmxlP from Twitter Dev
%% Cell type:markdown id:terminal-jurisdiction tags:
## 1.1.2 Prerequisite
Twitter data contain personal and sensible data. We have to be compliant with [GPDR](https://en.wikipedia.org/wiki/General_Data_Protection_Regulation) and [Twitter terms and uses](https://twitter.com/en/tos#intlTerms)
Registred users of Twitter gave their consent to Twitter's policy. They consent their data may be used for **research works**. However as soon as they change their visibility/privacy (i.e. they withdraw their consent), we are not allowed to used their data anymore. Maintaining a dataset of tweets implies to **synchronise in real time with Twitter's API**.
To retrieve tweets automatically, we have to apply for a [Twitter dev account](https://developer.twitter.com/en/apply-for-access). To do so, we have to explain why we want an access and discribe applications that will used our credentials.
## 1.1.3 Data collection with keywords and account names
You can find the [script that collects tweets for MOOD](https://gitlab.irstea.fr/umr-tetis/mood/mood-tetis-tweets-collect). You can clone this repository
%% Cell type:code id:magnetic-arrival tags:
``` python
# clone MOOD repository:
!git clone https://gitlab.irstea.fr/umr-tetis/mood/mood-tetis-tweets-collect.git
```
%% Output
fatal: le chemin de destination 'mood-tetis-tweets-collect' existe déjà et n'est pas un répertoire vide.
mood-tetis-tweets-collect/params/keywordsFilter.csv
%% Cell type:code id:exempt-distance tags:
``` python
# Print MOOD keywords :
import pandas as pd
mood_keywords = pd.read_csv("mood-tetis-tweets-collect/params/keywordsFilter.csv")
# Group by disease
mood_diseases = mood_keywords.groupby("syndrome")
for disease, keywords in mood_diseases:
print(mood_diseases.get_group(disease)["hashtags"].tolist(), "\n")
```
%% Output
['AMR', 'ATB', 'AntimicrobialResistance', 'resstance', 'resistance', 'resistance', 'AntibioticResistance']
['Avian', 'BirdFlu', 'Avian', 'Bird', 'Fowl', 'HPAI', 'bird', 'AvianInfluenza', 'Avianflu', 'BirdFlu', 'FowlPlague', 'avianInfluenza']
['Chikungunya', 'Chikungunya', 'ChikungunyaFever ', 'CHIKV', 'CHIKV', 'CHIKV', 'Chikungunya', 'Chikungunya', 'Chikungunyavirus', 'ChikungunyaVirus', 'Chikungunyafever']
['MassMortalities', 'Massdie', 'MassDie_off', 'FatalIllness', 'UnknownDeath', 'unknowndeath', 'FatalIllness']
['DENV', 'DENV', 'Dengue', 'Denguefever', 'DengueFever']
['MysteriousFever', 'HaemorrhagicFever', 'FebrileIllness', 'UnknownFever']
['MassFoodPoisoning']
['Flu', 'Influenza', 'influenzavirus', 'InfluenzaVirus', 'H1N1', 'H2N2', 'H3N2', 'H3N8', 'H5N1', 'H5N2', 'H7N7', 'H9N2', 'H1N2', 'H7N1', 'H7N2', 'H7N3', 'H10N7', 'H7N9', 'H10N8', 'H5N8']
['WeilDisease', 'WeilDisease', 'Leptospira', 'Leptospirosis']
['Borreliosis', 'Lymedisease', 'LymeBoreliosis', 'Lyme', 'Lymeneuroborreliosis', 'OphthalmicLymeborreliosis ', 'Lymecarditis', 'Lymearthritis ', 'Neuroborreliosis', 'LymeEncephalitis', 'LymeArthritis ', 'Borellia', 'BorelliaInfection']
['Myelitis', 'Myelitis', 'Meningoencephalitis', 'Meningoencephalitis', 'Encephalitis', 'Encephalitis', 'Meningitis', 'Meningitis', 'tick', 'Tickfever', 'tickfever']
['lungdisease', 'LungIllness', 'MysteriousLungDisease', 'AcuteRespiratoryFailure', 'vapingillness', 'RespiratoryIllness', 'respiratorydisease']
['2019-nCoV', 'SARS-CoV-2 ', 'COVID-19', 'COVID19', 'SARS-CoV-2']
['TBEV', 'tick', 'tickencephalitis', 'loupingill', 'Powassan', 'Powassan', 'Powassan', 'PowassanVirus', 'PowassanDisease', 'PowassanVirusDisease', 'PowassanEncephalitis']
['Tularemia', 'Tularaemia', 'Francisella', 'FrancisellaTularensis']
['UndiagnosedDisease', 'undiagnosedillness', 'unexplainedillness', 'UnidentifiedIllness', 'NewVirus', 'NewDisease', 'newillness', 'unknownbacteria', 'UnknownVirus', 'UnknownIllness', 'unknowninfection', 'IllnessOutbreak', 'unidentifieddisease', 'UnknownDisease', 'Mysteriousdisease', 'mysteriousillness', 'mysteryDisease', 'mysteryillness', 'UnknownSource', 'Unknown', 'unknownviralinfection', 'unknowninfection', 'unknownillness', 'unknownfever', 'unidentifieddisease']
['westnile', 'WestNile', 'westnile', 'westnile', 'westnile', 'WestNile', 'WestNile', 'WestNile', 'WNVInfection', 'WNV', 'WNV', 'westnilevirus', 'WestNileVirus', 'WestNileInfection', 'WestNileFever']
['Zika', 'Zikavirus', 'ZikaVirus', 'ZikaFever', 'Zikafever', 'Zikainfection', 'ZikaInfection', 'ZIKV', 'ZIKV', 'ZikaDisease', 'zikadisease']
%% Cell type:markdown id:thrown-collective tags:
## 1.1.4 Data retrieval from existing corpora
pandas==1.2.3
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment