77 Commits

Author SHA1 Message Date
GassiGiuseppe
cf3e35121b Merge branch 'dev.etl' into dev 2025-10-17 22:15:21 +02:00
GassiGiuseppe
856c693650 Added possibility to whitelist relationships 2025-10-12 12:26:26 +02:00
GassiGiuseppe
e9d30b3cea add divide method to create hold out dataset 2025-10-11 16:49:36 +02:00
GassiGiuseppe
e353c200d7 updated special token 2025-10-08 11:02:18 +02:00
GassiGiuseppe
ee12f53f12 Added EOS token 2025-10-07 22:47:59 +02:00
GassiGiuseppe
a04f4c7cb7 changes to shorten the dataset 2025-10-07 15:49:25 +02:00
GassiGiuseppe
a93e61b8c1 Update ETL 2025-10-07 00:54:00 +02:00
GassiGiuseppe
0373460105 Movie filters updated 2025-10-06 10:57:50 +02:00
GassiGiuseppe
7307916891 update sql_endpoint to work with the new pipeline 2025-10-05 14:58:03 +02:00
GassiGiuseppe
acb43fc899 new faster pipeline 2025-10-05 14:57:45 +02:00
GassiGiuseppe
255d801a80 updated the mask rdf_mask_task.
however since the model will build the mask itself, it is deprecated
2025-10-05 14:56:33 +02:00
GassiGiuseppe
2bd24ec278 Created legacy folder for old pipeline
this pipeline still works but is slower then the new,
some ot its method can be used later
2025-10-05 14:54:32 +02:00
Christian Risi
ba3a718480 Merge branch 'dev.etl' into dev 2025-10-05 11:16:54 +02:00
GassiGiuseppe
69fba7c3e9 new utility to generate a csv debug file of the output of the pipeline 2025-10-04 21:33:09 +02:00
GassiGiuseppe
bbadd4c521 update cleaning pipeline with a new method to filter also by number of films,
also updated the signature of the pipeline
2025-10-04 19:00:05 +02:00
GassiGiuseppe
25f3a5d221 Logic to test BPE 2025-10-04 18:58:04 +02:00
Christian Risi
17d82f0a4e Added support to resume workload 2025-10-02 08:48:28 +02:00
Christian Risi
63baf29805 Added multithreaded training 2025-10-02 01:30:24 +02:00
Christian Risi
fbbe6226bb Finished uploading stubs for TokeNano 2025-10-01 18:56:53 +02:00
Christian Risi
b3d444979f Added flag to resume work correctly 2025-10-01 12:22:09 +02:00
Christian Risi
97bac464f3 Fixed JSON incompatibility 2025-10-01 00:32:43 +02:00
Christian Risi
7ab9b0358e Added script to run BPE 2025-09-30 23:59:09 +02:00
Christian Risi
30c2938d29 Fixed typing 2025-09-30 23:58:54 +02:00
GassiGiuseppe
64e355e80c Added regex to delete new lines and * from ObjectURI 2025-09-30 15:00:07 +02:00
GassiGiuseppe
007f1e9554 minor updates 2025-09-29 18:53:33 +02:00
GassiGiuseppe
c319398ca0 little update to UML pipeline 2025-09-29 17:03:31 +02:00
GassiGiuseppe
255d8a072d First implementation of the cleaning pipeline UML 2025-09-29 16:59:52 +02:00
GassiGiuseppe
8167c9d435 Added Toy Dataset entry point into the Pipeline class
Before it was forced into the sql_endpoint,
now all the pipeline can be managed in the Pipeline class
2025-09-29 16:03:49 +02:00
GassiGiuseppe
bd72ad3571 Added file to execute the complete cleaning pipeline 2025-09-29 15:21:26 +02:00
GassiGiuseppe
e521b0704e deleted TODO in path_splitter_tree, as it was already resolved 2025-09-25 19:19:11 +02:00
Christian Risi
0a698e9837 Added schema to extract from DB for BPE 2025-09-25 19:09:52 +02:00
GassiGiuseppe
9440a562f2 Merge branch 'dev.etl' of https://repositories.communitynotfound.work/PoliBa-DeepLearning/NanoSocrates into dev.etl 2025-09-25 18:33:51 +02:00
Christian Risi
5eda131aac Fixed creation query to be unique even with movieID in RDFs 2025-09-25 17:58:09 +02:00
GassiGiuseppe
57884eaf2e CSV support added to path_splitter_tree
Also resolved a minor bug to print also leaf nodes
2025-09-25 17:57:46 +02:00
GassiGiuseppe
3eec49ffa5 WIP: added test file: clean_relationship.jupyter
to create a first cleaning pipeline
2025-09-25 16:28:24 +02:00
Christian Risi
0bc7f4b227 Fixed Typos 2025-09-25 12:37:52 +02:00
Christian Risi
f28952b0a2 Added todo 2025-09-25 12:00:26 +02:00
Christian Risi
0b626a8e09 Modified query to take all data 2025-09-25 11:53:12 +02:00
Christian Risi
b254098532 Added views to count for subjects and objects 2025-09-25 11:40:44 +02:00
Christian Risi
ee88ffe4cf Added View to filter over relationship counts 2025-09-25 11:32:03 +02:00
Christian Risi
70b4bd8645 Added Complex query 2025-09-25 11:31:34 +02:00
Christian Risi
6316d2bfc4 Added queries to take data from SQL for dataset 2025-09-25 11:27:19 +02:00
Christian Risi
4315d70109 Merged abbreviation_datawarehouse into datawarehouse 2025-09-24 19:29:43 +02:00
Christian Risi
a6760cd52d Updated SQL Queries to support parsing in DB 2025-09-24 19:28:55 +02:00
GassiGiuseppe
a7eb92227d Moved all db queries file in their own folder 2025-09-24 16:44:55 +02:00
GassiGiuseppe
9f221e31cd Merge branch 'dev.etl' of https://repositories.communitynotfound.work/PoliBa-DeepLearning/NanoSocrates into dev.etl 2025-09-24 16:32:52 +02:00
GassiGiuseppe
47197194d5 WIP abbrevietion_datawarehouse to creat an abbreviation system 2025-09-24 16:32:09 +02:00
Christian Risi
0cdbf6f624 Added query to retrieve a dirty dataset from SQLite DB 2025-09-24 16:15:47 +02:00
Christian Risi
3e30489f86 Updated Queries for DB 2025-09-24 14:44:53 +02:00
Christian Risi
7feb4eb857 Fixed URI generation 2025-09-24 14:44:07 +02:00