Folder DataCleaning renamed to DatasetMerging since it doesn't clean nothing

and instead Build the dataset
This commit is contained in:
GassiGiuseppe 2025-09-22 17:11:49 +02:00
parent edd01a2c83
commit ac1ed42c49
4 changed files with 45 additions and 28 deletions

View File

@ -1,28 +0,0 @@
"""
What we have now:
Wikipeda-summary : PageId / abstract
Movies : Movie URI
Dataset : Movie URI / Relationship / Object [RDF]
Movies-PageId : Movie URI / PageId (wiki)
Reverse : Subject / Relationship / Movie URI
What we want:
( we will generate MovieID)
Movies : MovieID [PK] / Movie URI
WikiPageIDs : MovieID [PK, FK]/ PageId [IDX] (wiki) (Not important for now)
Abstracts : MovieID [PK, FK]/ abstract
Subjects : SubjectID [PK] / RDF Subject ( both from either Dataset.csv or Reverse.csv) / OriginID [FK]
Relationships : RelationshipID [PK]/ RDF Relationship (not the actual relationshi but the value)
Objects : ObjectID [PK]/ RDF Object / OriginID [FK]
Origins : OriginID [PK]/ Origin Name
RDFs : RDF_ID[PK] / MovieID [FK] / SubjectID [FK]/ RelationshipID [FK]/ ObjectID [FK]
What we will build for the model
we need RDF list for each movie together with abstract
: MovieID / RDF_set / abstrct
"""

View File

@ -0,0 +1,45 @@
"""
What we have now: Saved AS:
Wikipeda-summary : PageId / abstract subject,text
Movies : Movie URI "subject"
Dataset : Movie URI / Relationship / Object [RDF] subject,relationship,object
Movies-PageId : Movie URI / PageId (wiki) "subject", "object"
Reverse : Subject / Relationship / Movie URI "subject","relationship","object"
What we want:
( we will generate MovieID)
Movies : MovieID [PK] / Movie URI
WikiPageIDs : MovieID [PK, FK]/ PageId [IDX] (wiki) (Not important for now)
Abstracts : MovieID [PK, FK]/ abstract
Subjects : SubjectID [PK] / RDF Subject ( both from either Dataset.csv or Reverse.csv) / OriginID [FK]
Relationships : RelationshipID [PK]/ RDF Relationship (not the actual relationshi but the value)
Objects : ObjectID [PK]/ RDF Object / OriginID [FK]
Origins : OriginID [PK]/ Origin Name
RDFs : RDF_ID[PK] / MovieID [FK] / SubjectID [FK]/ RelationshipID [FK]/ ObjectID [FK]
What we will build for the model
we need RDF list for each movie together with abstract
: MovieID / RDF_set / abstrct
"""
import sqlite3
# Create a SQL connection to our SQLite database
con = sqlite3.connect("data/portal_mammals.sqlite")
cur = con.cursor()
# Return all results of query
cur.execute('SELECT plot_id FROM plots WHERE plot_type="Control"')
cur.fetchall()
# Return first result of query
cur.execute('SELECT species FROM species WHERE taxa="Bird"')
cur.fetchone()
# Be sure to close the connection
con.close()