De la données non structurées en base de données structurées

Application aux données géographiques

31-03-2024

Format de données

Allégorie du gaz.
La donnée peut être représenté selon différentes formes :

  • l’usage qui est faite de la donnée conditionne sa forme.
  • la forme de la donnée conditionne son usage

Distinction Information/Donnée :

  • L’information est la mise en contexte d’une donnée.
  • La donnée est le support ou vecteur de l’information.

Ci-dessous 2 format pour 1 information :

  • JSON
{
                "civ": "M.",
                "prenom": "David",
                "nom": "Habib",
                "alpha": "HabibD",
                "trigramme": "DHA"
}
  • CSV
civ;prenom;nom;alpha;trigramme
M.;David;Habib;HabibD;DHA

Concept de Tidy Data


(Wickham 2014)

Tidy Data et Messy Data

(Lowndes and Horst 2020)

\(80\) à \(90\)% des données sont non structurées.
Liste d’exemple de Messy Data à partir de MongoDB

  • json dont geojson
  • xml
  • image
  • texte
  • ...

L’avantage du Tidy Data

(Lowndes and Horst 2020)

library(tidyverse)
library(dplyr)
library(ggplot2)

tidydata_format <- table1 %>%
  select(country,year,population) %>%
  filter(year==2000)
tidydata_format
# A tibble: 3 x 3
  country      year population
  <chr>       <dbl>      <dbl>
1 Afghanistan  2000   20595360
2 Brazil       2000  174504898
3 China        2000 1280428583
not_tidydata_format <- table4b %>% select(-`1999`) 
not_tidydata_format
# A tibble: 3 x 2
  country         `2000`
  <chr>            <dbl>
1 Afghanistan   20595360
2 Brazil       174504898
3 China       1280428583
also_not_tidydata_format <- table2 %>% filter(year==2000) 
also_not_tidydata_format
# A tibble: 6 x 4
  country      year type            count
  <chr>       <dbl> <chr>           <dbl>
1 Afghanistan  2000 cases            2666
2 Afghanistan  2000 population   20595360
3 Brazil       2000 cases           80488
4 Brazil       2000 population  174504898
5 China        2000 cases          213766
6 China        2000 population 1280428583

Quand un Messy Data ne peut se transformer en Tidy Data

Prenons l’exemple des circonscriptions françaises.

Si une circonscription se limitait à un point, on pourrait considérer la structure suivante :

tribble(
  ~ ID, ~ code_dpt, ~ nom_dpt, ~ nom_reg, ~ num_circ, ~ lattitude, ~ longitude,
   "33004","33", "GIRONDE", "AQUITAINE-LIMOUSIN-POITOU-CHARENTES","75",-0.454946,44.953424,
   "33004","68", "HAUT-RHIN", "ALSACE-CHAMPAGNE-ARDENNE-LORRAINE","44", 6.981539,48.009421)
# A tibble: 2 x 7
  ID    code_dpt nom_dpt   nom_reg                       num_c~1 latti~2 longi~3
  <chr> <chr>    <chr>     <chr>                         <chr>     <dbl>   <dbl>
1 33004 33       GIRONDE   AQUITAINE-LIMOUSIN-POITOU-CH~ 75       -0.455    45.0
2 33004 68       HAUT-RHIN ALSACE-CHAMPAGNE-ARDENNE-LOR~ 44        6.98     48.0
# ... with abbreviated variable names 1: num_circ, 2: lattitude, 3: longitude

Malheureusement c’est plus compliqué que cela.
(“Carte Des Circonscriptions Législatives 2012 Et 2017” 2017)

Première solution :
Multiplier les colonnes :

tribble(
  ~ ID, ~ code_dpt, ~ nom_dpt, ~ nom_reg, ~ num_circ, ~ lattitude_point1, ~ lattitude_point1,~ lattitude_point2, ~ lattitude_point2,~ lattitude_point3, ~ lattitude_point3,
   "33004","33", "GIRONDE", "AQUITAINE-LIMOUSIN-POITOU-CHARENTES","75",-0.454946,44.953424,-0.454946,45.953424,-0.454946,46.953424,
   "33004","68", "HAUT-RHIN", "ALSACE-CHAMPAGNE-ARDENNE-LORRAINE","44", 6.981539,48.009421, 6.981539,49.009421, 6.981539,50.009421)
# A tibble: 2 x 11
  ID    code_dpt nom_dpt nom_reg num_c~1 latti~2 latti~3 latti~4 latti~5 latti~6
  <chr> <chr>    <chr>   <chr>   <chr>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1 33004 33       GIRONDE AQUITA~ 75       -0.455    45.0  -0.455    46.0  -0.455
2 33004 68       HAUT-R~ ALSACE~ 44        6.98     48.0   6.98     49.0   6.98 
# ... with 1 more variable: lattitude_point3 <dbl>, and abbreviated variable
#   names 1: num_circ, 2: lattitude_point1, 3: lattitude_point1,
#   4: lattitude_point2, 5: lattitude_point2, 6: lattitude_point3

Deuxième solution :
Multiplier les lignes :

tribble(
  ~ ID, ~ code_dpt, ~ nom_dpt, ~ nom_reg, ~ num_circ, ~ lattitude, ~ lattitude,
   "33004","33", "GIRONDE", "AQUITAINE-LIMOUSIN-POITOU-CHARENTES","75",-0.474946,44.953424,
   "33004","33", "GIRONDE", "AQUITAINE-LIMOUSIN-POITOU-CHARENTES","75",-0.254946,45.953424,
   "33004","33", "GIRONDE", "AQUITAINE-LIMOUSIN-POITOU-CHARENTES","75",-0.44946,46.953424,
   "33004","68", "HAUT-RHIN", "ALSACE-CHAMPAGNE-ARDENNE-LORRAINE","44", 6.981539,49.009421,
   "33004","68", "HAUT-RHIN", "ALSACE-CHAMPAGNE-ARDENNE-LORRAINE","44", 6.981539,50.009421,
   "33004","68", "HAUT-RHIN", "ALSACE-CHAMPAGNE-ARDENNE-LORRAINE","44", 6.981539,51.009421,
  )
# A tibble: 6 x 7
  ID    code_dpt nom_dpt   nom_reg                       num_c~1 latti~2 latti~3
  <chr> <chr>    <chr>     <chr>                         <chr>     <dbl>   <dbl>
1 33004 33       GIRONDE   AQUITAINE-LIMOUSIN-POITOU-CH~ 75       -0.475    45.0
2 33004 33       GIRONDE   AQUITAINE-LIMOUSIN-POITOU-CH~ 75       -0.255    46.0
3 33004 33       GIRONDE   AQUITAINE-LIMOUSIN-POITOU-CH~ 75       -0.449    47.0
4 33004 68       HAUT-RHIN ALSACE-CHAMPAGNE-ARDENNE-LOR~ 44        6.98     49.0
5 33004 68       HAUT-RHIN ALSACE-CHAMPAGNE-ARDENNE-LOR~ 44        6.98     50.0
6 33004 68       HAUT-RHIN ALSACE-CHAMPAGNE-ARDENNE-LOR~ 44        6.98     51.0
# ... with abbreviated variable names 1: num_circ, 2: lattitude, 3: lattitude

=> Nous somme contraint de considérer le type JSON

Le type geometry

Considérons le type de variable geometry dans R dont la valeur est une liste de point.

library(geojsonsf)
library(tidyverse)
library(sf)

france_metropolitaine = geojson_sf("data/france-circonscriptions-legislatives-2012.json") %>%
  select(num_circ,code_dpt,geometry)%>%
  rename(circo = num_circ,departementCode = code_dpt) %>%
  filter(departementCode!="ZD",
         departementCode!="ZN",
         departementCode!="ZS",
         departementCode!="ZX",
         departementCode!="ZP",
         departementCode!="ZW",
         departementCode!="2B",
         departementCode!="2A",
         departementCode!="ZA",
         departementCode!="ZM",
         departementCode!="ZC",
         departementCode!="ZB")

france_metropolitaine
Simple feature collection with 535 features and 2 fields
Geometry type: GEOMETRY
Dimension:     XY
Bounding box:  xmin: -5.115104 ymin: 42.33349 xmax: 8.197396 ymax: 51.0894
Geodetic CRS:  WGS 84
First 10 features:
   circo departementCode                       geometry
1      4              33 POLYGON ((-0.454946 44.9534...
2      1              38 POLYGON ((5.805288 45.2062,...
3     10              59 POLYGON ((3.058745 50.78071...
4      7              33 POLYGON ((-0.610649 44.8224...
5      1              01 POLYGON ((4.888208 46.40298...
6      2              01 POLYGON ((5.435634 45.82977...
7      3              01 POLYGON ((5.831226 45.93846...
8      4              01 POLYGON ((4.888208 46.40298...
9      5              01 POLYGON ((5.51137 46.26436,...
10     1              02 POLYGON ((3.320684 49.69935...
plot(france_metropolitaine)

Quid de PostgreSQL

Ci-dessous un script permettant de reproduire la même chose en SQL :

CREATE EXTENSION postgis;

CREATE TABLE assemblee_elective.circonscription(
  circo VARCHAR,
  departementCode VARCHAR,
  geometry geometry);
 
 WITH DATA AS (SELECT '{
    "type": "FeatureCollection",
    "features": [
        {
            "type": "Feature",
            "properties": {
                "ID": "33004",
                "code_dpt": "33",
                "nom_dpt": "GIRONDE",
                "nom_reg": "AQUITAINE-LIMOUSIN-POITOU-CHARENTES",
                "num_circ": "4",
                "code_reg": "75"
            },
            "geometry": {
                "type": "Polygon",
                "coordinates": [
                    [
                        [
                            -0.454946,
                            44.953424
                        ],
                        [...
}'::json AS fc)

INSERT INTO assemblee_elective.circonscription(geometry,departementcode,circo)

SELECT

  ST_AsText(ST_GeomFromGeoJSON(feat->>'geometry')) AS geometry,
  CAST(REPLACE(CAST(feat->'properties'->'code_dpt' AS varchar),'"','') AS varchar) AS departementCode,
  CAST(REPLACE(CAST(feat->'properties'->'num_circ' AS varchar),'"','') AS varchar) AS circo
  
FROM (
  SELECT json_array_elements(fc->'features') AS feat
  FROM data 
) AS f;

On obtient un jeu de données qui est TIDY.

Application : Rattachement adresse à une circonscription

L’objectif étant à partir d’une adresse de déterminer la circonscription à laquelle appartient l’adresse en python :



import geopandas as gpd
import os
import psycopg2
from sqlalchemy import create_engine
import geopy as gp


db = 'bdd_democratie'  
host_db = '176.147.16.188'   
db_port = '5432'  
db_user = "postgres"  
db_password = 'postgres'


conn = psycopg2.connect(
    host=host_db,
    database=db,
    user=db_user,
    port = db_port,
    password=db_password)

engine = create_engine("postgresql://"+ db_user + ":" + db_password + "@" + host_db + ":" + db_port + "/" + db)  

adress = "24 parc de la vallée, 68530 Buhl"
locator = gp.Nominatim(user_agent="myGeocoder")

location = locator.geocode(adress)
location.longitude
7.179495
location.latitude
47.9343092
query_param = "ST_SetSRID(ST_MakePoint(" + str(location.longitude)  + " , " + str(location.latitude) + "),4326)"
query_param
'ST_SetSRID(ST_MakePoint(7.179495 , 47.9343092),4326)'
query_kernel = "SELECT nom_dpt, geometry as circonscription, " +  query_param  + " as point, ST_Contains(geometry," + query_param + ") as point_in_circo FROM assemblee_elective.circonscription;"
query_kernel
'SELECT nom_dpt, geometry as circonscription, ST_SetSRID(ST_MakePoint(7.179495 , 47.9343092),4326) as point, ST_Contains(geometry,ST_SetSRID(ST_MakePoint(7.179495 , 47.9343092),4326)) as point_in_circo FROM assemblee_elective.circonscription;'
cursor = conn.cursor()
cursor.execute(query_kernel)

print("Voci le résultat ci-dessous : ")
Voci le résultat ci-dessous : 
mobile_records = cursor.fetchall()
mobile_records[292] # Mon adresse appartient bien à la belle circonscription du HAUT-RHIN
('HAUT-RHIN', '0103000020E6100000010000001900000082FE428F18ED1B40A47213B534014840B1FD648C0F131C40DC2C5E2C0C074840865AD3BCE3341C40D9B5BDDD920A48400C2252D32E561C403CF6B3588A10484028B517D1763C1C40BF9CD9AED01148403F912749D76C1C40C328081EDF184840ECA694D74A981C40570740DCD5214840B98AC56F0ACB1C408194D8B5BD274840B2DAFCBFEA181D40B265F9BA0C274840AF5C6F9BA9401D402975C93846204840309B00C3F2671D40A81DFE9AAC1F48407A39ECBE63E81D4052431B800D1A48406F2A52616CE11D40B2D5E59480144840DBDE6E490EC81D405A492BBEA11648402461DF4E22921D402C4487C091164840AD18AE0E80781D40C51C041DAD104840E355D636C5531D40C6C210397D0F4840998235CEA6431D40A3CA30EE06094840D177B7B244671D40DB166536C80648401D1D5723BB621D401230BABC39024840AAB706B64A501D40952C27A1F4F74740D5EAABAB02C51C407B4ACE893DF247409D6340F67A471C401BD5E940D6F34740CF68AB92C8EE1B404AEB6F09C0FD474082FE428F18ED1B40A47213B534014840', '0101000020E61000009D2E8B89CDB71C400C2DA17197F74740', True)
mobile_records[132] # Mon adresse n'appartient pas à la moins belle circonscription du BAS-RHIN
('BAS-RHIN', '0103000020E6100000010000000A0000008A8F4FC8CEEB1E40B2F2CB608C46484029EACC3D24EC1E4009FD4CBD6E47484026FBE769C0F01E409DD9AED00749484092B442DA18271F40FB668163A34B48404CC11A67D3311F40FB20CB82894B484085ED27637C381F404DBB9866BA414840F5B9DA8AFD151F40912BF52C083F4840DEE522BE13D31E40EFA83121E64048403BE5D18DB0C81E401EFB592C454448408A8F4FC8CEEB1E40B2F2CB608C464840', '0101000020E61000009D2E8B89CDB71C400C2DA17197F74740', False)

Utilité dans l’application de prédiction de prix (Cf : BBL de Ramzi)

Application : Un ETL pour un flux ELT

Citations et discussion

“Carte Des Circonscriptions Législatives 2012 Et 2017.” 2017. https://www.data.gouv.fr/fr/datasets/r/efa8c2e6-b8f7-4594-ad01-10b46b06b56a.
Lowndes, Julia, and Allison Horst. 2020. “Illustrations from the Openscapes Blog Tidy Data for Reproducibility, Efficiency, and Collaboration by Julia Lowndes and Allison Horst.” https://allisonhorst.com/other-r-fun.
Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software 59 (10): 1–23. https://www.jstatsoft.org/index.php/jss/article/view/v059i10.