54 Commits

Author SHA1 Message Date
Christian Risi
87f24878f4 Added shims for utils on using Pytorch 2025-10-03 20:11:14 +02:00
Christian Risi
149deb407d added cache directories 2025-10-03 18:01:05 +02:00
Christian Risi
d2a3dfe90f Fixed bug 2025-10-03 17:59:46 +02:00
Christian Risi
0ee6e48004 Fixed the same bug as before, but this time is correct 2025-10-03 16:09:53 +02:00
Christian Risi
55e0d2ac23 Fixed a encoding bug 2025-10-03 16:08:11 +02:00
Christian Risi
9c5f42153f fixed typos 2025-10-03 15:17:44 +02:00
Christian Risi
c74689d01d Fixed tests to reflect new version of tokenizer 2025-10-03 13:27:38 +02:00
Christian Risi
51f491d033 fixed typos 2025-10-03 13:27:17 +02:00
Christian Risi
c5c0c61f79 Fix of bugs and semantics 2025-10-03 13:26:58 +02:00
Christian Risi
6b9cb7cd35 Modified imports 2025-10-03 13:26:42 +02:00
Christian Risi
e8894504c6 Fixed a bug where a token (int) was yielded instead of a list of int 2025-10-03 11:44:44 +02:00
GassiGiuseppe
845d645348 added some stubs on special_regex_maker 2025-10-03 10:38:35 +02:00
GassiGiuseppe
09f7b39512 test files updated 2025-10-03 01:04:47 +02:00
GassiGiuseppe
070dc1b744 implemented token nano for the BPE encoding/decoding 2025-10-03 01:04:06 +02:00
GassiGiuseppe
8121c75a09 Updated NanoSocratesSplitter to split also token in decode phase 2025-10-03 01:00:36 +02:00
GassiGiuseppe
a5b8692a77 Updated NanoSocratesSpecial to work with TokeNano 2025-10-03 00:59:15 +02:00
GassiGiuseppe
7c935d2700 Update NanoSocratesBPE: corrected a minor bug about dictionary lenght,
added some comment to make the code more clear
2025-10-03 00:57:19 +02:00
Christian Risi
a1d143187d corrected test to reflect changes in BPE trainer 2025-10-02 20:11:43 +02:00
GassiGiuseppe
0eef2148a9 in NanoSocratesBPE: encode() method rewritten and tested 2025-10-02 12:12:44 +02:00
Christian Risi
856bd8909c Added treshold 2025-10-02 11:02:03 +02:00
Christian Risi
2e595a3a23 Changed training phase to take directly data instead of its encode 2025-10-02 09:56:44 +02:00
Christian Risi
2194cc7b4f Changed test to use pool trainer 2025-10-02 09:56:05 +02:00
Christian Risi
1eae8582b2 Fixed decoding phase 2025-10-02 09:33:58 +02:00
Christian Risi
eadba1fb82 Corrected test to reflect changes in NanoSocratesBPE 2025-10-02 09:33:47 +02:00
Christian Risi
aa765b4555 Added time checking 2025-10-02 08:48:45 +02:00
Christian Risi
0975c19e69 added nwew method to encode from list of tokens 2025-10-02 08:48:13 +02:00
Christian Risi
3fe4e45ceb Fixed a bug while joining frequencies 2025-10-02 01:50:37 +02:00
Christian Risi
d19426fa62 added multithreaded training to package 2025-10-02 01:31:05 +02:00
Christian Risi
63baf29805 Added multithreaded training 2025-10-02 01:30:24 +02:00
Christian Risi
b80b4e4112 Fixed returning type hints 2025-10-02 01:29:57 +02:00
Christian Risi
7cfaf601b4 Refactored to remove tokens that can't be compressed anymore 2025-10-01 19:42:22 +02:00
Christian Risi
fbbe6226bb Finished uploading stubs for TokeNano 2025-10-01 18:56:53 +02:00
Christian Risi
66bcf6e55f Added a way to recover iteration work 2025-10-01 12:21:42 +02:00
Christian Risi
dbf1d99408 Added json utils to save and load json files 2025-10-01 12:20:59 +02:00
Christian Risi
76f24d4eb0 Renamed file 2025-09-30 23:58:43 +02:00
Christian Risi
89a0a1f4bb Fixed bug for utf-8 conversion 2025-09-30 23:58:31 +02:00
Christian Risi
ccacea18d8 Created files to test BPE training 2025-09-30 13:33:54 +02:00
Christian Risi
b09bd4acba Created trainer to train BPE 2025-09-30 13:33:40 +02:00
Christian Risi
c9032cab09 Added fit method 2025-09-30 13:33:28 +02:00
Christian Risi
7020c9e683 Added utils to make regexps and iterators that check for last element 2025-09-30 13:33:12 +02:00
Christian Risi
2fe1ce9e9a Updated Inits 2025-09-30 13:32:37 +02:00
Christian Risi
18fc2ba9d8 Added Exceptions 2025-09-30 13:32:24 +02:00
Christian Risi
564b0d712e Modified UML diagram 2025-09-28 18:05:03 +02:00
Christian Risi
e433941405 Added BPE
TODO:
- complete the fit method
2025-09-28 18:04:44 +02:00
Christian Risi
b46df4f91a Added Special Encoder 2025-09-28 18:03:47 +02:00
Christian Risi
d179e01971 Added Splitter to divide tokens from text 2025-09-28 18:03:16 +02:00
Christian Risi
b071145f6e Added Chunker 2025-09-28 18:02:06 +02:00
Christian Risi
ed0255e99b Updated imports 2025-09-28 18:01:35 +02:00
Christian Risi
3e8b5c5579 Added test for chunker 2025-09-26 18:50:32 +02:00
Christian Risi
8db35732f9 Added Chunker to restrict our domains 2025-09-26 18:50:23 +02:00