Commit Graph

  • f801afe0e4 Merge branch 'dev.embedder' of https://repositories.communitynotfound.work/PoliBa-DeepLearning/NanoSocrates into dev.embedder GassiGiuseppe 2025-10-07 17:42:21 +02:00
  • b4ee8362a2 WIP training Batching GassiGiuseppe 2025-10-07 17:41:53 +02:00
  • 3021a51961 Merge branch 'dev.embedder' of https://repositories.communitynotfound.work/PoliBa-DeepLearning/NanoSocrates into dev.embedder Christian Risi 2025-10-07 16:38:12 +02:00
  • 99b5198c9a WIP Christian Risi 2025-10-07 16:38:08 +02:00
  • b97282179d Fixed a bug about sequence normalizations Christian Risi 2025-10-07 16:37:43 +02:00
  • fdece42462 Made model Batch ready Christian Risi 2025-10-07 16:37:20 +02:00
  • 109ad9f36b Changed Imports Christian Risi 2025-10-07 16:36:59 +02:00
  • fef933da9d Added <PAD> and moved <END> Token Christian Risi 2025-10-07 16:36:45 +02:00
  • c65f5e66fe Uploaded all playgrounds Christian Risi 2025-10-07 16:36:26 +02:00
  • f9545aca1d Deleted MultiHeadAttention Christian Risi 2025-10-07 16:36:11 +02:00
  • 1d23b9cc8b little snippet to trim big dictionaries dev.bpe GassiGiuseppe 2025-10-07 16:05:32 +02:00
  • a04f4c7cb7 changes to shorten the dataset GassiGiuseppe 2025-10-07 15:49:25 +02:00
  • 490edcfd53 WIP Batcher GassiGiuseppe 2025-10-07 15:36:51 +02:00
  • 9b5bb6d5f8 Added support for batches Christian Risi 2025-10-07 12:15:03 +02:00
  • a93e61b8c1 Update ETL GassiGiuseppe 2025-10-07 00:54:00 +02:00
  • 14b810c451 WIP NanoSocratesEmbedder for batching GassiGiuseppe 2025-10-06 21:41:45 +02:00
  • 56d438f01a WIP NanoSocratesCore GassiGiuseppe 2025-10-06 18:21:27 +02:00
  • 745424a978 new special token for start sequence in decoder GassiGiuseppe 2025-10-06 18:21:10 +02:00
  • e1549d4458 Modified decoder and decoder for sequential architecture GassiGiuseppe 2025-10-06 18:20:46 +02:00
  • 456ce724fe Added capability of returning target after truncating Christian Risi 2025-10-06 17:43:01 +02:00
  • 44307cd917 Added util to create padding mask Christian Risi 2025-10-06 17:29:05 +02:00
  • ffdb312d58 Added a util to create truncated RDF lists Christian Risi 2025-10-06 17:22:13 +02:00
  • 0007c38212 Added a util to make masked inference Christian Risi 2025-10-06 17:02:06 +02:00
  • 9c1043e0ba Added post tokenization utils Christian Risi 2025-10-06 17:01:18 +02:00
  • ee8e56798c Added new utils Christian Risi 2025-10-06 17:00:55 +02:00
  • 1797571bb2 Added test to see if illegal tokens were included in target Christian Risi 2025-10-06 16:17:12 +02:00
  • e93710af08 Fixed illegal tokens being added in target output Christian Risi 2025-10-06 16:16:47 +02:00
  • d3bba9b944 Added actual test Christian Risi 2025-10-06 16:06:17 +02:00
  • b1e7af0607 Merge branch 'dev.embedder' of https://repositories.communitynotfound.work/PoliBa-DeepLearning/NanoSocrates into dev.embedder Christian Risi 2025-10-06 15:55:44 +02:00
  • d3b1f7da91 Added testing for spanned masking Christian Risi 2025-10-06 15:55:40 +02:00
  • c217f5dec9 Added 2 types of masking Christian Risi 2025-10-06 15:45:45 +02:00
  • 49f0beb6ea Updated imports Christian Risi 2025-10-06 15:45:28 +02:00
  • 05bb460999 file to test batch attention mask GassiGiuseppe 2025-10-06 13:03:20 +02:00
  • 948c3fd7ac update to batch attention mask GassiGiuseppe 2025-10-06 13:03:03 +02:00
  • 87409fecd5 added method fot batched attention_mask GassiGiuseppe 2025-10-06 12:00:11 +02:00
  • 0373460105 Movie filters updated GassiGiuseppe 2025-10-06 10:57:50 +02:00
  • 7e40a36701 wip: NanoSocratesCore GassiGiuseppe 2025-10-05 22:58:06 +02:00
  • d48815cca2 added task_type and updated init GassiGiuseppe 2025-10-05 18:58:42 +02:00
  • 0f243eaac2 added padding_mask entry to decoder and encoder GassiGiuseppe 2025-10-05 18:46:06 +02:00
  • 9c83d9fa71 Merge branch 'dev.embedder' of https://repositories.communitynotfound.work/PoliBa-DeepLearning/NanoSocrates into dev.embedder GassiGiuseppe 2025-10-05 18:45:33 +02:00
  • a693cbb77e A set of utils for our pipeline Christian Risi 2025-10-05 18:37:43 +02:00
  • 6f219f634f Added attention_mask GassiGiuseppe 2025-10-05 17:49:01 +02:00
  • b303affd18 updated uml of the model GassiGiuseppe 2025-10-05 16:40:19 +02:00
  • 53c4decac7 Added playgrounds for the architecture Christian Risi 2025-10-05 16:30:23 +02:00
  • c60da8ba82 Refactoring Christian Risi 2025-10-05 15:40:29 +02:00
  • 7307916891 update sql_endpoint to work with the new pipeline GassiGiuseppe 2025-10-05 14:58:03 +02:00
  • acb43fc899 new faster pipeline GassiGiuseppe 2025-10-05 14:57:45 +02:00
  • 255d801a80 updated the mask rdf_mask_task. however since the model will build the mask itself, it is deprecated GassiGiuseppe 2025-10-05 14:56:33 +02:00
  • 2bd24ec278 Created legacy folder for old pipeline this pipeline still works but is slower then the new, some ot its method can be used later GassiGiuseppe 2025-10-05 14:54:32 +02:00
  • 3b5e6c099c Merge branch 'dev' into dev.embedder Christian Risi 2025-10-05 11:17:09 +02:00
  • ba3a718480 Merge branch 'dev.etl' into dev Christian Risi 2025-10-05 11:16:54 +02:00
  • 69fba7c3e9 new utility to generate a csv debug file of the output of the pipeline GassiGiuseppe 2025-10-04 21:33:09 +02:00
  • 76200d936d added first classes (Encoder, Decoder, Attention) for the model GassiGiuseppe 2025-10-04 21:07:58 +02:00
  • 9b656e7918 Added a playground to test the embedding phase Christian Risi 2025-10-04 19:43:42 +02:00
  • 9a797a0485 Added embedder code for "Attention is all you need" Christian Risi 2025-10-04 19:43:25 +02:00
  • 3b274ad807 Added a way to take the default special token list Christian Risi 2025-10-04 19:43:02 +02:00
  • 8f5e2f2f0d Modifications Christian Risi 2025-10-04 19:42:45 +02:00
  • da0bdf703b Added a way to see vocabulary size Christian Risi 2025-10-04 19:42:29 +02:00
  • 03cdca1f00 Modified imports for BPE Christian Risi 2025-10-04 19:42:02 +02:00
  • 7188c8678a Added imports for Embedder Christian Risi 2025-10-04 19:41:48 +02:00
  • 1eef25a697 Merge branch 'dev' into dev.embedder Christian Risi 2025-10-04 19:04:03 +02:00
  • 165290162c added tokenano to the init GassiGiuseppe 2025-10-04 19:03:56 +02:00
  • e9165fb146 Merge branch 'dev.bpe' into dev Christian Risi 2025-10-04 19:03:09 +02:00
  • 502016f843 a new exasperated way to train the bpe, just a wild experimen that could be useful later GassiGiuseppe 2025-10-04 19:03:07 +02:00
  • 845c63dbef updated tokenano to be more easy to read GassiGiuseppe 2025-10-04 19:01:21 +02:00
  • bbadd4c521 update cleaning pipeline with a new method to filter also by number of films, also updated the signature of the pipeline GassiGiuseppe 2025-10-04 19:00:05 +02:00
  • c2f9344c82 little test file GassiGiuseppe 2025-10-04 18:58:20 +02:00
  • 25f3a5d221 Logic to test BPE GassiGiuseppe 2025-10-04 18:58:04 +02:00
  • e8ff82c40a Updated with tasks architectures Christian Risi 2025-10-04 10:57:12 +02:00
  • 23d1eaf99e Fixed a rare bug over training multiple times Christian Risi 2025-10-04 10:47:39 +02:00
  • 25a6ad1254 Added model high level architecture Christian Risi 2025-10-03 23:37:16 +02:00
  • 460d4f5188 Renamed directory to Playgrounds Christian Risi 2025-10-03 22:59:43 +02:00
  • c6ac6df2c2 Added stubs for other libraries Christian Risi 2025-10-03 20:28:23 +02:00
  • 15baba54ab Sanity check to autodetect Device Christian Risi 2025-10-03 20:16:01 +02:00
  • 87f24878f4 Added shims for utils on using Pytorch Christian Risi 2025-10-03 20:11:14 +02:00
  • 999141f886 Merge branch 'dev' into dev.embedder Christian Risi 2025-10-03 18:08:34 +02:00
  • 8e095ebb7a Added papers stub Christian Risi 2025-10-03 18:02:27 +02:00
  • 149deb407d added cache directories Christian Risi 2025-10-03 18:01:05 +02:00
  • 8a21cb1b73 added python analysis Christian Risi 2025-10-03 18:00:52 +02:00
  • d2a3dfe90f Fixed bug Christian Risi 2025-10-03 17:59:46 +02:00
  • 0f95aeb122 toy dictionary for bpe implemeted GassiGiuseppe 2025-10-03 16:26:01 +02:00
  • 0ee6e48004 Fixed the same bug as before, but this time is correct Christian Risi 2025-10-03 16:09:53 +02:00
  • 55e0d2ac23 Fixed a encoding bug Christian Risi 2025-10-03 16:08:11 +02:00
  • 9c5f42153f fixed typos Christian Risi 2025-10-03 15:17:44 +02:00
  • c74689d01d Fixed tests to reflect new version of tokenizer Christian Risi 2025-10-03 13:27:38 +02:00
  • 51f491d033 fixed typos Christian Risi 2025-10-03 13:27:17 +02:00
  • c5c0c61f79 Fix of bugs and semantics Christian Risi 2025-10-03 13:26:58 +02:00
  • 6b9cb7cd35 Modified imports Christian Risi 2025-10-03 13:26:42 +02:00
  • e8894504c6 Fixed a bug where a token (int) was yielded instead of a list of int Christian Risi 2025-10-03 11:44:44 +02:00
  • 845d645348 added some stubs on special_regex_maker GassiGiuseppe 2025-10-03 10:38:35 +02:00
  • 09f7b39512 test files updated GassiGiuseppe 2025-10-03 01:04:47 +02:00
  • 070dc1b744 implemented token nano for the BPE encoding/decoding GassiGiuseppe 2025-10-03 01:04:06 +02:00
  • 8121c75a09 Updated NanoSocratesSplitter to split also token in decode phase GassiGiuseppe 2025-10-03 01:00:36 +02:00
  • a5b8692a77 Updated NanoSocratesSpecial to work with TokeNano GassiGiuseppe 2025-10-03 00:59:15 +02:00
  • 7c935d2700 Update NanoSocratesBPE: corrected a minor bug about dictionary lenght, added some comment to make the code more clear GassiGiuseppe 2025-10-03 00:57:19 +02:00
  • a1d143187d corrected test to reflect changes in BPE trainer Christian Risi 2025-10-02 20:11:43 +02:00
  • 0eef2148a9 in NanoSocratesBPE: encode() method rewritten and tested GassiGiuseppe 2025-10-02 12:12:44 +02:00
  • 856bd8909c Added treshold Christian Risi 2025-10-02 11:02:03 +02:00
  • 2e595a3a23 Changed training phase to take directly data instead of its encode Christian Risi 2025-10-02 09:56:44 +02:00
  • 2194cc7b4f Changed test to use pool trainer Christian Risi 2025-10-02 09:56:05 +02:00