Christian Risi
|
c74689d01d
|
Fixed tests to reflect new version of tokenizer
|
2025-10-03 13:27:38 +02:00 |
|
Christian Risi
|
51f491d033
|
fixed typos
|
2025-10-03 13:27:17 +02:00 |
|
Christian Risi
|
c5c0c61f79
|
Fix of bugs and semantics
|
2025-10-03 13:26:58 +02:00 |
|
Christian Risi
|
6b9cb7cd35
|
Modified imports
|
2025-10-03 13:26:42 +02:00 |
|
Christian Risi
|
e8894504c6
|
Fixed a bug where a token (int) was yielded instead of a list of int
|
2025-10-03 11:44:44 +02:00 |
|
GassiGiuseppe
|
845d645348
|
added some stubs on special_regex_maker
|
2025-10-03 10:38:35 +02:00 |
|
GassiGiuseppe
|
09f7b39512
|
test files updated
|
2025-10-03 01:04:47 +02:00 |
|
GassiGiuseppe
|
070dc1b744
|
implemented token nano for the BPE encoding/decoding
|
2025-10-03 01:04:06 +02:00 |
|
GassiGiuseppe
|
8121c75a09
|
Updated NanoSocratesSplitter to split also token in decode phase
|
2025-10-03 01:00:36 +02:00 |
|
GassiGiuseppe
|
a5b8692a77
|
Updated NanoSocratesSpecial to work with TokeNano
|
2025-10-03 00:59:15 +02:00 |
|
GassiGiuseppe
|
7c935d2700
|
Update NanoSocratesBPE: corrected a minor bug about dictionary lenght,
added some comment to make the code more clear
|
2025-10-03 00:57:19 +02:00 |
|
Christian Risi
|
a1d143187d
|
corrected test to reflect changes in BPE trainer
|
2025-10-02 20:11:43 +02:00 |
|
GassiGiuseppe
|
0eef2148a9
|
in NanoSocratesBPE: encode() method rewritten and tested
|
2025-10-02 12:12:44 +02:00 |
|
Christian Risi
|
856bd8909c
|
Added treshold
|
2025-10-02 11:02:03 +02:00 |
|
Christian Risi
|
2e595a3a23
|
Changed training phase to take directly data instead of its encode
|
2025-10-02 09:56:44 +02:00 |
|
Christian Risi
|
2194cc7b4f
|
Changed test to use pool trainer
|
2025-10-02 09:56:05 +02:00 |
|
Christian Risi
|
1eae8582b2
|
Fixed decoding phase
|
2025-10-02 09:33:58 +02:00 |
|
Christian Risi
|
eadba1fb82
|
Corrected test to reflect changes in NanoSocratesBPE
|
2025-10-02 09:33:47 +02:00 |
|
Christian Risi
|
aa765b4555
|
Added time checking
|
2025-10-02 08:48:45 +02:00 |
|
Christian Risi
|
17d82f0a4e
|
Added support to resume workload
|
2025-10-02 08:48:28 +02:00 |
|
Christian Risi
|
0975c19e69
|
added nwew method to encode from list of tokens
|
2025-10-02 08:48:13 +02:00 |
|
Christian Risi
|
3fe4e45ceb
|
Fixed a bug while joining frequencies
|
2025-10-02 01:50:37 +02:00 |
|
Christian Risi
|
d19426fa62
|
added multithreaded training to package
|
2025-10-02 01:31:05 +02:00 |
|
Christian Risi
|
63baf29805
|
Added multithreaded training
|
2025-10-02 01:30:24 +02:00 |
|
Christian Risi
|
b80b4e4112
|
Fixed returning type hints
|
2025-10-02 01:29:57 +02:00 |
|
Christian Risi
|
7cfaf601b4
|
Refactored to remove tokens that can't be compressed anymore
|
2025-10-01 19:42:22 +02:00 |
|
Christian Risi
|
fbbe6226bb
|
Finished uploading stubs for TokeNano
|
2025-10-01 18:56:53 +02:00 |
|
Christian Risi
|
b3d444979f
|
Added flag to resume work correctly
|
2025-10-01 12:22:09 +02:00 |
|
Christian Risi
|
66bcf6e55f
|
Added a way to recover iteration work
|
2025-10-01 12:21:42 +02:00 |
|
Christian Risi
|
dbf1d99408
|
Added json utils to save and load json files
|
2025-10-01 12:20:59 +02:00 |
|
Christian Risi
|
97bac464f3
|
Fixed JSON incompatibility
|
2025-10-01 00:32:43 +02:00 |
|
Christian Risi
|
9a8e726d74
|
Added cdebug configuration
|
2025-10-01 00:22:22 +02:00 |
|
Christian Risi
|
7ab9b0358e
|
Added script to run BPE
|
2025-09-30 23:59:09 +02:00 |
|
Christian Risi
|
30c2938d29
|
Fixed typing
|
2025-09-30 23:58:54 +02:00 |
|
Christian Risi
|
76f24d4eb0
|
Renamed file
|
2025-09-30 23:58:43 +02:00 |
|
Christian Risi
|
89a0a1f4bb
|
Fixed bug for utf-8 conversion
|
2025-09-30 23:58:31 +02:00 |
|
Christian Risi
|
ccacea18d8
|
Created files to test BPE training
|
2025-09-30 13:33:54 +02:00 |
|
Christian Risi
|
b09bd4acba
|
Created trainer to train BPE
|
2025-09-30 13:33:40 +02:00 |
|
Christian Risi
|
c9032cab09
|
Added fit method
|
2025-09-30 13:33:28 +02:00 |
|
Christian Risi
|
7020c9e683
|
Added utils to make regexps and iterators that check for last element
|
2025-09-30 13:33:12 +02:00 |
|
Christian Risi
|
2fe1ce9e9a
|
Updated Inits
|
2025-09-30 13:32:37 +02:00 |
|
Christian Risi
|
18fc2ba9d8
|
Added Exceptions
|
2025-09-30 13:32:24 +02:00 |
|
Christian Risi
|
5acee1d1a5
|
Merge branch 'dev' into dev.bpe
|
2025-09-30 11:35:27 +02:00 |
|
|
|
2e36753da4
|
Merge pull request 'dev.etl' (#5) from dev.etl into dev
Reviewed-on: #5
|
2025-09-30 11:28:57 +02:00 |
|
GassiGiuseppe
|
007f1e9554
|
minor updates
|
2025-09-29 18:53:33 +02:00 |
|
GassiGiuseppe
|
c319398ca0
|
little update to UML pipeline
|
2025-09-29 17:03:31 +02:00 |
|
GassiGiuseppe
|
255d8a072d
|
First implementation of the cleaning pipeline UML
|
2025-09-29 16:59:52 +02:00 |
|
GassiGiuseppe
|
8167c9d435
|
Added Toy Dataset entry point into the Pipeline class
Before it was forced into the sql_endpoint,
now all the pipeline can be managed in the Pipeline class
|
2025-09-29 16:03:49 +02:00 |
|
GassiGiuseppe
|
bd72ad3571
|
Added file to execute the complete cleaning pipeline
|
2025-09-29 15:21:26 +02:00 |
|
GassiGiuseppe
|
6ddb7de9da
|
Added sqlAlchemy to requirements
|
2025-09-29 15:19:19 +02:00 |
|