130 Commits

Author SHA1 Message Date
Christian Risi
2fe1ce9e9a Updated Inits 2025-09-30 13:32:37 +02:00
Christian Risi
18fc2ba9d8 Added Exceptions 2025-09-30 13:32:24 +02:00
Christian Risi
5acee1d1a5 Merge branch 'dev' into dev.bpe 2025-09-30 11:35:27 +02:00
2e36753da4 Merge pull request 'dev.etl' (#5) from dev.etl into dev
Reviewed-on: #5
2025-09-30 11:28:57 +02:00
GassiGiuseppe
007f1e9554 minor updates 2025-09-29 18:53:33 +02:00
GassiGiuseppe
c319398ca0 little update to UML pipeline 2025-09-29 17:03:31 +02:00
GassiGiuseppe
255d8a072d First implementation of the cleaning pipeline UML 2025-09-29 16:59:52 +02:00
GassiGiuseppe
8167c9d435 Added Toy Dataset entry point into the Pipeline class
Before it was forced into the sql_endpoint,
now all the pipeline can be managed in the Pipeline class
2025-09-29 16:03:49 +02:00
GassiGiuseppe
bd72ad3571 Added file to execute the complete cleaning pipeline 2025-09-29 15:21:26 +02:00
GassiGiuseppe
6ddb7de9da Added sqlAlchemy to requirements 2025-09-29 15:19:19 +02:00
Christian Risi
564b0d712e Modified UML diagram 2025-09-28 18:05:03 +02:00
Christian Risi
e433941405 Added BPE
TODO:
- complete the fit method
2025-09-28 18:04:44 +02:00
Christian Risi
b46df4f91a Added Special Encoder 2025-09-28 18:03:47 +02:00
Christian Risi
d179e01971 Added Splitter to divide tokens from text 2025-09-28 18:03:16 +02:00
Christian Risi
b071145f6e Added Chunker 2025-09-28 18:02:06 +02:00
Christian Risi
ed0255e99b Updated imports 2025-09-28 18:01:35 +02:00
Christian Risi
3e8b5c5579 Added test for chunker 2025-09-26 18:50:32 +02:00
Christian Risi
8db35732f9 Added Chunker to restrict our domains 2025-09-26 18:50:23 +02:00
Christian Risi
9552d61f8d Added Excetption for when we don't find a delimiter 2025-09-26 18:49:56 +02:00
Christian Risi
be8a87ce01 Modified the architecture for BPE 2025-09-26 18:49:29 +02:00
Christian Risi
5801a819e9 Added vars to make it easier to work here 2025-09-26 18:49:06 +02:00
Christian Risi
3f48b5c428 Added text files to test a chunker 2025-09-26 18:48:44 +02:00
Christian Risi
9972ab8a51 Added imports 2025-09-26 18:48:23 +02:00
GassiGiuseppe
650b37c586 Added vscode setting to execute jupyternotebook from root dir 2025-09-26 11:24:34 +02:00
Christian Risi
90012285b5 UML Diagram to explain bpe workflows 2025-09-25 20:18:21 +02:00
Christian Risi
1bbb4a0999 Added new paper 2025-09-25 20:17:48 +02:00
GassiGiuseppe
e521b0704e deleted TODO in path_splitter_tree, as it was already resolved 2025-09-25 19:19:11 +02:00
Christian Risi
ee0aa583d5 Added Docs for BPE research 2025-09-25 19:10:45 +02:00
Christian Risi
0a698e9837 Added schema to extract from DB for BPE 2025-09-25 19:09:52 +02:00
GassiGiuseppe
9440a562f2 Merge branch 'dev.etl' of https://repositories.communitynotfound.work/PoliBa-DeepLearning/NanoSocrates into dev.etl 2025-09-25 18:33:51 +02:00
Christian Risi
5eda131aac Fixed creation query to be unique even with movieID in RDFs 2025-09-25 17:58:09 +02:00
GassiGiuseppe
57884eaf2e CSV support added to path_splitter_tree
Also resolved a minor bug to print also leaf nodes
2025-09-25 17:57:46 +02:00
Christian Risi
4548a683c2 Fixed DB 2025-09-25 17:57:45 +02:00
GassiGiuseppe
3eec49ffa5 WIP: added test file: clean_relationship.jupyter
to create a first cleaning pipeline
2025-09-25 16:28:24 +02:00
Christian Risi
0bc7f4b227 Fixed Typos 2025-09-25 12:37:52 +02:00
Christian Risi
f28952b0a2 Added todo 2025-09-25 12:00:26 +02:00
Christian Risi
0b626a8e09 Modified query to take all data 2025-09-25 11:53:12 +02:00
Christian Risi
b254098532 Added views to count for subjects and objects 2025-09-25 11:40:44 +02:00
Christian Risi
ee88ffe4cf Added View to filter over relationship counts 2025-09-25 11:32:03 +02:00
Christian Risi
70b4bd8645 Added Complex query 2025-09-25 11:31:34 +02:00
Christian Risi
6316d2bfc4 Added queries to take data from SQL for dataset 2025-09-25 11:27:19 +02:00
Christian Risi
87ca748f45 Updated DB to reflect new changes 2025-09-24 19:29:57 +02:00
Christian Risi
4315d70109 Merged abbreviation_datawarehouse into datawarehouse 2025-09-24 19:29:43 +02:00
Christian Risi
9a5d633b5e Fixed Typos 2025-09-24 19:29:07 +02:00
Christian Risi
a6760cd52d Updated SQL Queries to support parsing in DB 2025-09-24 19:28:55 +02:00
GassiGiuseppe
a7eb92227d Moved all db queries file in their own folder 2025-09-24 16:44:55 +02:00
GassiGiuseppe
9f221e31cd Merge branch 'dev.etl' of https://repositories.communitynotfound.work/PoliBa-DeepLearning/NanoSocrates into dev.etl 2025-09-24 16:32:52 +02:00
GassiGiuseppe
47197194d5 WIP abbrevietion_datawarehouse to creat an abbreviation system 2025-09-24 16:32:09 +02:00
Christian Risi
0cdbf6f624 Added query to retrieve a dirty dataset from SQLite DB 2025-09-24 16:15:47 +02:00
Christian Risi
3e30489f86 Updated Queries for DB 2025-09-24 14:44:53 +02:00