"Only tweet IDs can be shared from which we can retrieve all tweet contents and metadata. It's called **hydrating tweet**.\n",
"To do so, you can use the command line tool [twarc](https://github.com/DocNow/twarc). You must set your credentials and then hydrate tweets : `twarc hydrate tweet-ids.txt > tweets.jsonl`\n",
"\n",
"For this workshop, we are going to use a tweets corpus in Licence CC0 (Public Domain) from [kaggle platform](https://www.kaggle.com/gpreda/pfizer-vaccine-tweets)."
"For this workshop, we are going to use a tweets corpus in Licence CC0 (Public Domain) from [kaggle platform](https://www.kaggle.com/gpreda/pfizer-vaccine-tweets).\n",
"**If you have already a kaggle account, you can download the dataset from the link below or you can download from this link [filesender](https://filesender.renater.fr/?s=download&token=1706766d-676e-4823-a1b4-665067e5fc81#), password will be given during the workshop**. Please, now upload this file in data directory"
]
},
{
...
...
%% Cell type:markdown id:commercial-brain tags:
# SVEPM 2021 - WS5 : Extraction of medical terms from non-structured (textual) datafrom online news and social media
Introduction of Natural Language Processing (NLP) methods applied to health domain : an overview of terminology extraction from online news and social network (Twitter).
1. Twitter
1.1 Data acquisition
1.1.1 Data description
1.1.2 Prerequisite
1.1.3 Data collection with keywords and account names
1.1.4 Data retrieval from existing corpora
1.2 Pre-process
1.2.1 Filtering
1.2.2 Tweets cleaning
1.3 Terminology extraction
1.3.1 Statistical method : TF-IDF
1.3.2 Application of TF-IDF
1.4 Data visualization
2. Online news
2.1 Data acquisition
%% Cell type:markdown id:corrected-cliff tags:
## 1.1.1 Twitter data description
%% Cell type:code id:complex-juice tags:
``` python
# Tweet example from Twitter doc API : https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/overview
tweet_example={
"created_at":"Thu Apr 06 15:24:15 +0000 2017",
"id_str":"850006245121695744",
"text":"1\/ Today we\u2019re sharing our vision for the future of the Twitter API platform!\nhttps:\/\/t.co\/XweGngmxlP",
"user":{
"id":2244994945,
"name":"Twitter Dev",
"screen_name":"TwitterDev",
"location":"Internet",
"url":"https:\/\/dev.twitter.com\/",
"description":"Your official source for Twitter Platform news, updates & events. Need technical help? Visit https:\/\/twittercommunity.com\/ \u2328\ufe0f #TapIntoTwitter"
Twitter data contain personal and sensible data. We have to be compliant with [GPDR](https://en.wikipedia.org/wiki/General_Data_Protection_Regulation) and [Twitter terms and uses](https://twitter.com/en/tos#intlTerms)
Registred users of Twitter gave their consent to Twitter's policy. They consent their data may be used for **research works**. However as soon as they change their visibility/privacy (i.e. they withdraw their consent), we are not allowed to used their data anymore. Maintaining a dataset of tweets implies to **synchronise in real time with Twitter's API**.
To retrieve tweets automatically, we have to apply for a [Twitter dev account](https://developer.twitter.com/en/apply-for-access). To do so, we have to explain why we want an access and discribe applications that will used our credentials.
## 1.1.3 Data collection with keywords and account names
You can find the [script that collects tweets for MOOD](https://gitlab.irstea.fr/umr-tetis/mood/mood-tetis-tweets-collect). You can clone this repository
To be compliant with Twitter terms and uses, we can't share tweet content nor data.
Only tweet IDs can be shared from which we can retrieve all tweet contents and metadata. It's called **hydrating tweet**.
To do so, you can use the command line tool [twarc](https://github.com/DocNow/twarc). You must set your credentials and then hydrate tweets : `twarc hydrate tweet-ids.txt > tweets.jsonl`
For this workshop, we are going to use a tweets corpus in Licence CC0 (Public Domain) from [kaggle platform](https://www.kaggle.com/gpreda/pfizer-vaccine-tweets).
**If you have already a kaggle account, you can download the dataset from the link below or you can download from this link [filesender](https://filesender.renater.fr/?s=download&token=1706766d-676e-4823-a1b4-665067e5fc81#), password will be given during the workshop**. Please, now upload this file in data directory
%% Cell type:code id:verified-defeat tags:
``` python
# import pandas library: facilitates the use of tables or matrix