Snapshot testing + current results snapshot by kosciej · Pull Request #5 · BaseModelAI/cleora

kosciej · 2020-11-15T21:34:34Z

Snapshot testing takes advantage of deterministic character of Cleora. Any discrepancies between original snapshot results and current ones can be then reviewed along with the code which introduced discrepancy.
Test introduced by this PR performs work for sample case and saves snapshot file.

kosciej · 2020-11-16T07:53:55Z

It works correctly on:
stable-x86_64-apple-darwin
stable-x86_64-unknown-linux-gnu

for stable-armv7-unknown-linux-gnueabihf (ARM RaspberryPi), the order of HashMaps is different in debug string, and therefore snapshots didn't check for SparseMatrices. It works correctly for final result.

To prevent this kind of issue I can drop collecting snapshots for intermediate SnapshotMatrices or work out another solution to make snapshots stable.

piobab · 2020-11-16T21:58:53Z

+}
+
+#[derive(Debug, Default)]
+pub struct InMemoryEmbeddingPersistor {


I think you can remove pub from InMemoryEmbeddingPersistor and InMemoryEntity structs.

kosciej · 2020-11-17T06:37:06Z

Removed snapshot for sparse matrices, as it's intermediate result that contains hash table, which under some circumstances have different ordering (affects snapshot comparison).

Created two new modules for pycleora: pycleora/search.py - ANNIndex class for approximate nearest neighbor search: - ANNIndex(graph, embeddings, method="hnsw"|"brute") builds an ANN index - .query(entity_id, top_k=10, exclude_self=True) returns results in same format as find_most_similar, with matching exclude_self parameter for API parity - .query_vector(vector, top_k=10) for raw vector queries - HNSW backend via optional hnswlib dependency (try/except ImportError pattern) - Pure-numpy ball tree fallback when hnswlib is not installed - Brute force method for exact baseline comparison - Input validation for top_k parameter pycleora/compress.py - Embedding compression utilities: - pca_compress(embeddings, target_dim) - PCA via np.linalg.svd - random_projection(embeddings, target_dim, seed=None) - fast alternative to PCA - product_quantize(embeddings, num_subspaces, num_centroids) - standard PQ with k-means on subspaces, returns PQIndex with .reconstruct() and .search() methods - Input validation for all parameters (target_dim, num_subspaces, num_centroids, max_iter, embeddings shape) Updated pycleora/__init__.py to import both modules so users can access pycleora.search.* and pycleora.compress.*. Existing find_most_similar function is untouched for backward compatibility. Result format (entity_id, index, similarity keys) matches find_most_similar exactly. Replit-Task-Id: a2415004-3c28-4c96-b8e1-ed1365533871

Snapshot testing + current results snapshot

f12fd92

piobab reviewed Nov 16, 2020

View reviewed changes

Removed snapshot for sparse matrices, removed pub on local test structs

73e80e6

piobab approved these changes Nov 17, 2020

View reviewed changes

piobab merged commit 8f12863 into BaseModelAI:master Nov 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Snapshot testing + current results snapshot#5

Snapshot testing + current results snapshot#5
piobab merged 2 commits intoBaseModelAI:masterfrom
kosciej:snapshot_testing

kosciej commented Nov 15, 2020

Uh oh!

kosciej commented Nov 16, 2020

Uh oh!

piobab Nov 16, 2020

Uh oh!

kosciej Nov 17, 2020

Uh oh!

kosciej commented Nov 17, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kosciej commented Nov 15, 2020

Uh oh!

kosciej commented Nov 16, 2020

Uh oh!

piobab Nov 16, 2020

Choose a reason for hiding this comment

Uh oh!

kosciej Nov 17, 2020

Choose a reason for hiding this comment

Uh oh!

kosciej commented Nov 17, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants