200 Commits

Author SHA1 Message Date
GassiGiuseppe
76200d936d added first classes (Encoder, Decoder, Attention) for the model 2025-10-04 21:07:58 +02:00
Christian Risi
9b656e7918 Added a playground to test the embedding phase 2025-10-04 19:43:42 +02:00
Christian Risi
9a797a0485 Added embedder code for "Attention is all you need" 2025-10-04 19:43:25 +02:00
Christian Risi
3b274ad807 Added a way to take the default special token list 2025-10-04 19:43:02 +02:00
Christian Risi
8f5e2f2f0d Modifications 2025-10-04 19:42:45 +02:00
Christian Risi
da0bdf703b Added a way to see vocabulary size 2025-10-04 19:42:29 +02:00
Christian Risi
03cdca1f00 Modified imports for BPE 2025-10-04 19:42:02 +02:00
Christian Risi
7188c8678a Added imports for Embedder 2025-10-04 19:41:48 +02:00
Christian Risi
1eef25a697 Merge branch 'dev' into dev.embedder 2025-10-04 19:04:03 +02:00
Christian Risi
e9165fb146 Merge branch 'dev.bpe' into dev 2025-10-04 19:03:09 +02:00
GassiGiuseppe
bbadd4c521 update cleaning pipeline with a new method to filter also by number of films,
also updated the signature of the pipeline
2025-10-04 19:00:05 +02:00
GassiGiuseppe
c2f9344c82 little test file 2025-10-04 18:58:20 +02:00
GassiGiuseppe
25f3a5d221 Logic to test BPE 2025-10-04 18:58:04 +02:00
Christian Risi
e8ff82c40a Updated with tasks architectures 2025-10-04 10:57:12 +02:00
Christian Risi
23d1eaf99e Fixed a rare bug over training multiple times 2025-10-04 10:47:39 +02:00
Christian Risi
25a6ad1254 Added model high level architecture 2025-10-03 23:37:16 +02:00
Christian Risi
460d4f5188 Renamed directory to Playgrounds 2025-10-03 22:59:43 +02:00
Christian Risi
c6ac6df2c2 Added stubs for other libraries 2025-10-03 20:28:23 +02:00
Christian Risi
15baba54ab Sanity check to autodetect Device 2025-10-03 20:16:01 +02:00
Christian Risi
87f24878f4 Added shims for utils on using Pytorch 2025-10-03 20:11:14 +02:00
Christian Risi
999141f886 Merge branch 'dev' into dev.embedder 2025-10-03 18:08:34 +02:00
Christian Risi
8e095ebb7a Added papers stub 2025-10-03 18:02:27 +02:00
Christian Risi
149deb407d added cache directories 2025-10-03 18:01:05 +02:00
Christian Risi
8a21cb1b73 added python analysis 2025-10-03 18:00:52 +02:00
Christian Risi
d2a3dfe90f Fixed bug 2025-10-03 17:59:46 +02:00
GassiGiuseppe
0f95aeb122 toy dictionary for bpe implemeted 2025-10-03 16:26:01 +02:00
Christian Risi
0ee6e48004 Fixed the same bug as before, but this time is correct 2025-10-03 16:09:53 +02:00
Christian Risi
55e0d2ac23 Fixed a encoding bug 2025-10-03 16:08:11 +02:00
Christian Risi
9c5f42153f fixed typos 2025-10-03 15:17:44 +02:00
Christian Risi
c74689d01d Fixed tests to reflect new version of tokenizer 2025-10-03 13:27:38 +02:00
Christian Risi
51f491d033 fixed typos 2025-10-03 13:27:17 +02:00
Christian Risi
c5c0c61f79 Fix of bugs and semantics 2025-10-03 13:26:58 +02:00
Christian Risi
6b9cb7cd35 Modified imports 2025-10-03 13:26:42 +02:00
Christian Risi
e8894504c6 Fixed a bug where a token (int) was yielded instead of a list of int 2025-10-03 11:44:44 +02:00
GassiGiuseppe
845d645348 added some stubs on special_regex_maker 2025-10-03 10:38:35 +02:00
GassiGiuseppe
09f7b39512 test files updated 2025-10-03 01:04:47 +02:00
GassiGiuseppe
070dc1b744 implemented token nano for the BPE encoding/decoding 2025-10-03 01:04:06 +02:00
GassiGiuseppe
8121c75a09 Updated NanoSocratesSplitter to split also token in decode phase 2025-10-03 01:00:36 +02:00
GassiGiuseppe
a5b8692a77 Updated NanoSocratesSpecial to work with TokeNano 2025-10-03 00:59:15 +02:00
GassiGiuseppe
7c935d2700 Update NanoSocratesBPE: corrected a minor bug about dictionary lenght,
added some comment to make the code more clear
2025-10-03 00:57:19 +02:00
Christian Risi
a1d143187d corrected test to reflect changes in BPE trainer 2025-10-02 20:11:43 +02:00
GassiGiuseppe
0eef2148a9 in NanoSocratesBPE: encode() method rewritten and tested 2025-10-02 12:12:44 +02:00
Christian Risi
856bd8909c Added treshold 2025-10-02 11:02:03 +02:00
Christian Risi
2e595a3a23 Changed training phase to take directly data instead of its encode 2025-10-02 09:56:44 +02:00
Christian Risi
2194cc7b4f Changed test to use pool trainer 2025-10-02 09:56:05 +02:00
Christian Risi
1eae8582b2 Fixed decoding phase 2025-10-02 09:33:58 +02:00
Christian Risi
eadba1fb82 Corrected test to reflect changes in NanoSocratesBPE 2025-10-02 09:33:47 +02:00
Christian Risi
aa765b4555 Added time checking 2025-10-02 08:48:45 +02:00
Christian Risi
17d82f0a4e Added support to resume workload 2025-10-02 08:48:28 +02:00
Christian Risi
0975c19e69 added nwew method to encode from list of tokens 2025-10-02 08:48:13 +02:00