init repo

011c8ef0 · Decoupes Remy · 011c8ef0 · 011c8ef0 · 011c8ef0 · 011c8ef0
Commit 011c8ef0 authored 4 years ago by Decoupes Remy
Expand all Hide whitespace changes
Inline Side-by-side

Showing

with 267 additions and 0 deletions
+267 -0
--- a/.gitignore
+++ b/.gitignore
+# Twitter data :
+data/
+# MOOD git clone :
+mood-tetis-tweets-collect/
+# ipynb execution
+.ipynb_checkpoints/
--- a/LICENSE
+++ b/LICENSE
--- a/README.md
+++ b/README.md
+# SVEPM 2021 - WS5 : Extraction of medical terms from non-structured (textual) datafrom online news and social media
+## [Wednesday 24th March, 14.00-17.00](https://www.svepm2021.org/upload/pdf/SVEPM2021_WS5.pdf)
+Introduction of Natural Language Processing (NLP) methods applied to health domain : an overview of terminology extraction from online news and social network (Twitter).
+## Authors
+[UMR TÉTIS](https://www.umr-tetis.fr/)
+## License
+This code is provided under the [CeCILL-B](https://cecill.info/licences/Licence_CeCILL-B_V1-en.html) free software license agreement.
--- a/extraction_medical_terms_from_online_news_and_twitter.ipynb
+++ b/extraction_medical_terms_from_online_news_and_twitter.ipynb
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "commercial-brain",
+   "metadata": {},
+   "source": [
+    "# SVEPM 2021 - WS5 : Extraction of medical terms from non-structured (textual) datafrom online news and social media\n",
+    "## [Wednesday 24th March, 14.00-17.00](https://www.svepm2021.org/upload/pdf/SVEPM2021_WS5.pdf)\n",
+    "\n",
+    "[UMR TÉTIS](https://www.umr-tetis.fr/)\n",
+    "\n",
+    "Introduction of Natural Language Processing (NLP) methods applied to health domain : an overview of terminology extraction from online news and social network (Twitter).\n",
+    "\n",
+    "1. Twitter\n",
+    "\n",
+    "    1.1 Data acquisition\n",
+    "        1.1.1 Data description\n",
+    "        1.1.2 Prerequisite\n",
+    "        1.1.3 Data collection with keywords and account names\n",
+    "        1.1.4 Data retrieval from existing corpora\n",
+    "    1.2 Pre-process\n",
+    "        1.2.1 Filtering\n",
+    "        1.2.2 Tweets cleaning\n",
+    "    1.3 Terminology extraction\n",
+    "        1.3.1 Statistical method : TF-IDF\n",
+    "        1.3.2 Application of TF-IDF\n",
+    "    1.4 Data visualization\n",
+    "2. Online news\n",
+    "\n",
+    "    2.1 Data acquisition"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "corrected-cliff",
+   "metadata": {},
+   "source": [
+    "## 1.1.1 Twitter data description"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "complex-juice",
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "# Tweet example from Twitter doc API : https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/overview\n",
+    "tweet_example = {\n",
+    "  \"created_at\": \"Thu Apr 06 15:24:15 +0000 2017\",\n",
+    "  \"id_str\": \"850006245121695744\",\n",
+    "  \"text\": \"1\\/ Today we\\u2019re sharing our vision for the future of the Twitter API platform!\\nhttps:\\/\\/t.co\\/XweGngmxlP\",\n",
+    "  \"user\": {\n",
+    "    \"id\": 2244994945,\n",
+    "    \"name\": \"Twitter Dev\",\n",
+    "    \"screen_name\": \"TwitterDev\",\n",
+    "    \"location\": \"Internet\",\n",
+    "    \"url\": \"https:\\/\\/dev.twitter.com\\/\",\n",
+    "    \"description\": \"Your official source for Twitter Platform news, updates & events. Need technical help? Visit https:\\/\\/twittercommunity.com\\/ \\u2328\\ufe0f #TapIntoTwitter\"\n",
+    "  },\n",
+    "  \"place\": {   \n",
+    "  },\n",
+    "  \"entities\": {\n",
+    "    \"hashtags\": [      \n",
+    "    ],\n",
+    "    \"urls\": [\n",
+    "      {\n",
+    "        \"url\": \"https:\\/\\/t.co\\/XweGngmxlP\",\n",
+    "        \"unwound\": {\n",
+    "          \"url\": \"https:\\/\\/cards.twitter.com\\/cards\\/18ce53wgo4h\\/3xo1c\",\n",
+    "          \"title\": \"Building the Future of the Twitter API Platform\"\n",
+    "        }\n",
+    "      }\n",
+    "    ],\n",
+    "    \"user_mentions\": [     \n",
+    "    ]\n",
+    "  }\n",
+    "}"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "id": "outer-lover",
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Print raw data: \n",
+      "1\\/ Today we’re sharing our vision for the future of the Twitter API platform!\n",
+      "https:\\/\\/t.co\\/XweGngmxlP from Twitter Dev\n",
+      "\n",
+      "\n",
+      "Pre-process tweet: \n",
+      "1/ Today we’re sharing our vision for the future of the Twitter API platform!\n",
+      "https://t.co/XweGngmxlP from Twitter Dev\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Print tweet content and user\n",
+    "tweet_content = tweet_example[\"text\"]\n",
+    "tweet_user = tweet_example[\"user\"][\"name\"]\n",
+    "print(\"Print raw data: \\n\"+ tweet_content + \" from \" + tweet_user)\n",
+    "print(\"\\n\")\n",
+    "# clean tweet content : remove \"/\"\n",
+    "tweet_content_cleaned = tweet_example[\"text\"].replace(\"\\\\\", \"\")\n",
+    "print(\"Pre-process tweet: \\n\"+ tweet_content_cleaned + \" from \" + tweet_user)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "terminal-jurisdiction",
+   "metadata": {},
+   "source": [
+    "## 1.1.2 Prerequisite\n",
+    "\n",
+    "Twitter data contain personal and sensible data. We have to be compliant with [GPDR](https://en.wikipedia.org/wiki/General_Data_Protection_Regulation) and [Twitter terms and uses](https://twitter.com/en/tos#intlTerms)\n",
+    "\n",
+    "Registred users of Twitter gave their consent to Twitter's policy. They consent their data may be used for **research works**. However as soon as they change their visibility/privacy (i.e. they withdraw their consent), we are not allowed to used their data anymore. Maintaining a dataset of tweets implies to **synchronise in real time with Twitter's API**.\n",
+    "\n",
+    "To retrieve tweets automatically, we have to apply for a [Twitter dev account](https://developer.twitter.com/en/apply-for-access). To do so, we have to explain why we want an access and discribe applications that will used our credentials.\n",
+    "\n",
+    "## 1.1.3 Data collection with keywords and account names\n",
+    "\n",
+    "You can find the [script that collects tweets for MOOD](https://gitlab.irstea.fr/umr-tetis/mood/mood-tetis-tweets-collect). You can clone this repository\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 19,
+   "id": "magnetic-arrival",
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "fatal: le chemin de destination 'mood-tetis-tweets-collect' existe déjà et n'est pas un répertoire vide.\n",
+      "mood-tetis-tweets-collect/params/keywordsFilter.csv\n"
+     ]
+    }
+   ],
+   "source": [
+    "# clone MOOD repository:\n",
+    "!git clone https://gitlab.irstea.fr/umr-tetis/mood/mood-tetis-tweets-collect.git\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 26,
+   "id": "exempt-distance",
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "['AMR', 'ATB', 'AntimicrobialResistance', 'resstance', 'resistance', 'resistance', 'AntibioticResistance'] \n",
+      "\n",
+      "['Avian', 'BirdFlu', 'Avian', 'Bird', 'Fowl', 'HPAI', 'bird', 'AvianInfluenza', 'Avianflu', 'BirdFlu', 'FowlPlague', 'avianInfluenza'] \n",
+      "\n",
+      "['Chikungunya', 'Chikungunya', 'ChikungunyaFever ', 'CHIKV', 'CHIKV', 'CHIKV', 'Chikungunya', 'Chikungunya', 'Chikungunyavirus', 'ChikungunyaVirus', 'Chikungunyafever'] \n",
+      "\n",
+      "['MassMortalities', 'Massdie', 'MassDie_off', 'FatalIllness', 'UnknownDeath', 'unknowndeath', 'FatalIllness'] \n",
+      "\n",
+      "['DENV', 'DENV', 'Dengue', 'Denguefever', 'DengueFever'] \n",
+      "\n",
+      "['MysteriousFever', 'HaemorrhagicFever', 'FebrileIllness', 'UnknownFever'] \n",
+      "\n",
+      "['MassFoodPoisoning'] \n",
+      "\n",
+      "['Flu', 'Influenza', 'influenzavirus', 'InfluenzaVirus', 'H1N1', 'H2N2', 'H3N2', 'H3N8', 'H5N1', 'H5N2', 'H7N7', 'H9N2', 'H1N2', 'H7N1', 'H7N2', 'H7N3', 'H10N7', 'H7N9', 'H10N8', 'H5N8'] \n",
+      "\n",
+      "['WeilDisease', 'WeilDisease', 'Leptospira', 'Leptospirosis'] \n",
+      "\n",
+      "['Borreliosis', 'Lymedisease', 'LymeBoreliosis', 'Lyme', 'Lymeneuroborreliosis', 'OphthalmicLymeborreliosis ', 'Lymecarditis', 'Lymearthritis ', 'Neuroborreliosis', 'LymeEncephalitis', 'LymeArthritis ', 'Borellia', 'BorelliaInfection'] \n",
+      "\n",
+      "['Myelitis', 'Myelitis', 'Meningoencephalitis', 'Meningoencephalitis', 'Encephalitis', 'Encephalitis', 'Meningitis', 'Meningitis', 'tick', 'Tickfever', 'tickfever'] \n",
+      "\n",
+      "['lungdisease', 'LungIllness', 'MysteriousLungDisease', 'AcuteRespiratoryFailure', 'vapingillness', 'RespiratoryIllness', 'respiratorydisease'] \n",
+      "\n",
+      "['2019-nCoV', 'SARS-CoV-2 ', 'COVID-19', 'COVID19', 'SARS-CoV-2'] \n",
+      "\n",
+      "['TBEV', 'tick', 'tickencephalitis', 'loupingill', 'Powassan', 'Powassan', 'Powassan', 'PowassanVirus', 'PowassanDisease', 'PowassanVirusDisease', 'PowassanEncephalitis'] \n",
+      "\n",
+      "['Tularemia', 'Tularaemia', 'Francisella', 'FrancisellaTularensis'] \n",
+      "\n",
+      "['UndiagnosedDisease', 'undiagnosedillness', 'unexplainedillness', 'UnidentifiedIllness', 'NewVirus', 'NewDisease', 'newillness', 'unknownbacteria', 'UnknownVirus', 'UnknownIllness', 'unknowninfection', 'IllnessOutbreak', 'unidentifieddisease', 'UnknownDisease', 'Mysteriousdisease', 'mysteriousillness', 'mysteryDisease', 'mysteryillness', 'UnknownSource', 'Unknown', 'unknownviralinfection', 'unknowninfection', 'unknownillness', 'unknownfever', 'unidentifieddisease'] \n",
+      "\n",
+      "['westnile', 'WestNile', 'westnile', 'westnile', 'westnile', 'WestNile', 'WestNile', 'WestNile', 'WNVInfection', 'WNV', 'WNV', 'westnilevirus', 'WestNileVirus', 'WestNileInfection', 'WestNileFever'] \n",
+      "\n",
+      "['Zika', 'Zikavirus', 'ZikaVirus', 'ZikaFever', 'Zikafever', 'Zikainfection', 'ZikaInfection', 'ZIKV', 'ZIKV', 'ZikaDisease', 'zikadisease'] \n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Print MOOD keywords :\n",
+    "import pandas as pd\n",
+    "mood_keywords = pd.read_csv(\"mood-tetis-tweets-collect/params/keywordsFilter.csv\")\n",
+    "# Group by disease\n",
+    "mood_diseases = mood_keywords.groupby(\"syndrome\")\n",
+    "for disease, keywords in mood_diseases:\n",
+    "    print(mood_diseases.get_group(disease)[\"hashtags\"].tolist(), \"\\n\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "thrown-collective",
+   "metadata": {},
+   "source": [
+    "## 1.1.4 Data retrieval from existing corpora"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
+%% Cell type:markdown id:commercial-brain tags:
+# SVEPM 2021 - WS5 : Extraction of medical terms from non-structured (textual) datafrom online news and social media
+## [Wednesday 24th March, 14.00-17.00](https://www.svepm2021.org/upload/pdf/SVEPM2021_WS5.pdf)
+[UMR TÉTIS](https://www.umr-tetis.fr/)
+Introduction of Natural Language Processing (NLP) methods applied to health domain : an overview of terminology extraction from online news and social network (Twitter).
+1. Twitter
+    1.1 Data acquisition
+        1.1.1 Data description
+        1.1.2 Prerequisite
+        1.1.3 Data collection with keywords and account names
+        1.1.4 Data retrieval from existing corpora
+    1.2 Pre-process
+        1.2.1 Filtering
+        1.2.2 Tweets cleaning
+    1.3 Terminology extraction
+        1.3.1 Statistical method : TF-IDF
+        1.3.2 Application of TF-IDF
+    1.4 Data visualization
+2. Online news
+    2.1 Data acquisition
+%% Cell type:markdown id:corrected-cliff tags:
+## 1.1.1 Twitter data description
+%% Cell type:code id:complex-juice tags:
+``` python
+# Tweet example from Twitter doc API : https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/overview
+tweet_example = {
+  "created_at": "Thu Apr 06 15:24:15 +0000 2017",
+  "id_str": "850006245121695744",
+  "text": "1\/ Today we\u2019re sharing our vision for the future of the Twitter API platform!\nhttps:\/\/t.co\/XweGngmxlP",
+  "user": {
+    "id": 2244994945,
+    "name": "Twitter Dev",
+    "screen_name": "TwitterDev",
+    "location": "Internet",
+    "url": "https:\/\/dev.twitter.com\/",
+    "description": "Your official source for Twitter Platform news, updates & events. Need technical help? Visit https:\/\/twittercommunity.com\/ \u2328\ufe0f #TapIntoTwitter"
+  },
+  "place": {
+  },
+  "entities": {
+    "hashtags": [
+    ],
+    "urls": [
+      {
+        "url": "https:\/\/t.co\/XweGngmxlP",
+        "unwound": {
+          "url": "https:\/\/cards.twitter.com\/cards\/18ce53wgo4h\/3xo1c",
+          "title": "Building the Future of the Twitter API Platform"
+        }
+      }
+    ],
+    "user_mentions": [
+    ]
+  }
+}
+```
+%% Cell type:code id:outer-lover tags:
+``` python
+# Print tweet content and user
+tweet_content = tweet_example["text"]
+tweet_user = tweet_example["user"]["name"]
+print("Print raw data: \n"+ tweet_content + " from " + tweet_user)
+print("\n")
+# clean tweet content : remove "/"
+tweet_content_cleaned = tweet_example["text"].replace("\\", "")
+print("Pre-process tweet: \n"+ tweet_content_cleaned + " from " + tweet_user)
+```
+%% Output
+    Print raw data:
+    1\/ Today we’re sharing our vision for the future of the Twitter API platform!
+    https:\/\/t.co\/XweGngmxlP from Twitter Dev
+    Pre-process tweet:
+    1/ Today we’re sharing our vision for the future of the Twitter API platform!
+    https://t.co/XweGngmxlP from Twitter Dev
+%% Cell type:markdown id:terminal-jurisdiction tags:
+## 1.1.2 Prerequisite
+Twitter data contain personal and sensible data. We have to be compliant with [GPDR](https://en.wikipedia.org/wiki/General_Data_Protection_Regulation) and [Twitter terms and uses](https://twitter.com/en/tos#intlTerms)
+Registred users of Twitter gave their consent to Twitter's policy. They consent their data may be used for **research works**. However as soon as they change their visibility/privacy (i.e. they withdraw their consent), we are not allowed to used their data anymore. Maintaining a dataset of tweets implies to **synchronise in real time with Twitter's API**.
+To retrieve tweets automatically, we have to apply for a [Twitter dev account](https://developer.twitter.com/en/apply-for-access). To do so, we have to explain why we want an access and discribe applications that will used our credentials.
+## 1.1.3 Data collection with keywords and account names
+You can find the [script that collects tweets for MOOD](https://gitlab.irstea.fr/umr-tetis/mood/mood-tetis-tweets-collect). You can clone this repository
+%% Cell type:code id:magnetic-arrival tags:
+``` python
+# clone MOOD repository:
+!git clone https://gitlab.irstea.fr/umr-tetis/mood/mood-tetis-tweets-collect.git
+```
+%% Output
+    fatal: le chemin de destination 'mood-tetis-tweets-collect' existe déjà et n'est pas un répertoire vide.
+    mood-tetis-tweets-collect/params/keywordsFilter.csv
+%% Cell type:code id:exempt-distance tags:
+``` python
+# Print MOOD keywords :
+import pandas as pd
+mood_keywords = pd.read_csv("mood-tetis-tweets-collect/params/keywordsFilter.csv")
+# Group by disease
+mood_diseases = mood_keywords.groupby("syndrome")
+for disease, keywords in mood_diseases:
+    print(mood_diseases.get_group(disease)["hashtags"].tolist(), "\n")
+```
+%% Output
+    ['AMR', 'ATB', 'AntimicrobialResistance', 'resstance', 'resistance', 'resistance', 'AntibioticResistance']
+    ['Avian', 'BirdFlu', 'Avian', 'Bird', 'Fowl', 'HPAI', 'bird', 'AvianInfluenza', 'Avianflu', 'BirdFlu', 'FowlPlague', 'avianInfluenza']
+    ['Chikungunya', 'Chikungunya', 'ChikungunyaFever ', 'CHIKV', 'CHIKV', 'CHIKV', 'Chikungunya', 'Chikungunya', 'Chikungunyavirus', 'ChikungunyaVirus', 'Chikungunyafever']
+    ['MassMortalities', 'Massdie', 'MassDie_off', 'FatalIllness', 'UnknownDeath', 'unknowndeath', 'FatalIllness']
+    ['DENV', 'DENV', 'Dengue', 'Denguefever', 'DengueFever']
+    ['MysteriousFever', 'HaemorrhagicFever', 'FebrileIllness', 'UnknownFever']
+    ['MassFoodPoisoning']
+    ['Flu', 'Influenza', 'influenzavirus', 'InfluenzaVirus', 'H1N1', 'H2N2', 'H3N2', 'H3N8', 'H5N1', 'H5N2', 'H7N7', 'H9N2', 'H1N2', 'H7N1', 'H7N2', 'H7N3', 'H10N7', 'H7N9', 'H10N8', 'H5N8']
+    ['WeilDisease', 'WeilDisease', 'Leptospira', 'Leptospirosis']
+    ['Borreliosis', 'Lymedisease', 'LymeBoreliosis', 'Lyme', 'Lymeneuroborreliosis', 'OphthalmicLymeborreliosis ', 'Lymecarditis', 'Lymearthritis ', 'Neuroborreliosis', 'LymeEncephalitis', 'LymeArthritis ', 'Borellia', 'BorelliaInfection']
+    ['Myelitis', 'Myelitis', 'Meningoencephalitis', 'Meningoencephalitis', 'Encephalitis', 'Encephalitis', 'Meningitis', 'Meningitis', 'tick', 'Tickfever', 'tickfever']
+    ['lungdisease', 'LungIllness', 'MysteriousLungDisease', 'AcuteRespiratoryFailure', 'vapingillness', 'RespiratoryIllness', 'respiratorydisease']
+    ['2019-nCoV', 'SARS-CoV-2 ', 'COVID-19', 'COVID19', 'SARS-CoV-2']
+    ['TBEV', 'tick', 'tickencephalitis', 'loupingill', 'Powassan', 'Powassan', 'Powassan', 'PowassanVirus', 'PowassanDisease', 'PowassanVirusDisease', 'PowassanEncephalitis']
+    ['Tularemia', 'Tularaemia', 'Francisella', 'FrancisellaTularensis']
+    ['UndiagnosedDisease', 'undiagnosedillness', 'unexplainedillness', 'UnidentifiedIllness', 'NewVirus', 'NewDisease', 'newillness', 'unknownbacteria', 'UnknownVirus', 'UnknownIllness', 'unknowninfection', 'IllnessOutbreak', 'unidentifieddisease', 'UnknownDisease', 'Mysteriousdisease', 'mysteriousillness', 'mysteryDisease', 'mysteryillness', 'UnknownSource', 'Unknown', 'unknownviralinfection', 'unknowninfection', 'unknownillness', 'unknownfever', 'unidentifieddisease']
+    ['westnile', 'WestNile', 'westnile', 'westnile', 'westnile', 'WestNile', 'WestNile', 'WestNile', 'WNVInfection', 'WNV', 'WNV', 'westnilevirus', 'WestNileVirus', 'WestNileInfection', 'WestNileFever']
+    ['Zika', 'Zikavirus', 'ZikaVirus', 'ZikaFever', 'Zikafever', 'Zikainfection', 'ZikaInfection', 'ZIKV', 'ZIKV', 'ZikaDisease', 'zikadisease']
+%% Cell type:markdown id:thrown-collective tags:
+## 1.1.4 Data retrieval from existing corpora
--- a/requirements.txt
+++ b/requirements.txt
+pandas==1.2.3