Skip to content

Snapshot testing + current results snapshot#5

Merged
piobab merged 2 commits intoBaseModelAI:masterfrom
kosciej:snapshot_testing
Nov 17, 2020
Merged

Snapshot testing + current results snapshot#5
piobab merged 2 commits intoBaseModelAI:masterfrom
kosciej:snapshot_testing

Conversation

@kosciej
Copy link
Copy Markdown
Contributor

@kosciej kosciej commented Nov 15, 2020

Snapshot testing takes advantage of deterministic character of Cleora. Any discrepancies between original snapshot results and current ones can be then reviewed along with the code which introduced discrepancy.
Test introduced by this PR performs work for sample case and saves snapshot file.

@kosciej
Copy link
Copy Markdown
Contributor Author

kosciej commented Nov 16, 2020

It works correctly on:
stable-x86_64-apple-darwin
stable-x86_64-unknown-linux-gnu

for stable-armv7-unknown-linux-gnueabihf (ARM RaspberryPi), the order of HashMaps is different in debug string, and therefore snapshots didn't check for SparseMatrices. It works correctly for final result.

To prevent this kind of issue I can drop collecting snapshots for intermediate SnapshotMatrices or work out another solution to make snapshots stable.

Comment thread tests/snapshot.rs Outdated
}

#[derive(Debug, Default)]
pub struct InMemoryEmbeddingPersistor {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can remove pub from InMemoryEmbeddingPersistor and InMemoryEntity structs.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@kosciej
Copy link
Copy Markdown
Contributor Author

kosciej commented Nov 17, 2020

Removed snapshot for sparse matrices, as it's intermediate result that contains hash table, which under some circumstances have different ordering (affects snapshot comparison).

@piobab piobab merged commit 8f12863 into BaseModelAI:master Nov 17, 2020
jaroslawkrolewski pushed a commit to jaroslawkrolewski/cleora that referenced this pull request Mar 18, 2026
Created two new modules for pycleora:

pycleora/search.py - ANNIndex class for approximate nearest neighbor search:
- ANNIndex(graph, embeddings, method="hnsw"|"brute") builds an ANN index
- .query(entity_id, top_k=10, exclude_self=True) returns results in same format
  as find_most_similar, with matching exclude_self parameter for API parity
- .query_vector(vector, top_k=10) for raw vector queries
- HNSW backend via optional hnswlib dependency (try/except ImportError pattern)
- Pure-numpy ball tree fallback when hnswlib is not installed
- Brute force method for exact baseline comparison
- Input validation for top_k parameter

pycleora/compress.py - Embedding compression utilities:
- pca_compress(embeddings, target_dim) - PCA via np.linalg.svd
- random_projection(embeddings, target_dim, seed=None) - fast alternative to PCA
- product_quantize(embeddings, num_subspaces, num_centroids) - standard PQ with
  k-means on subspaces, returns PQIndex with .reconstruct() and .search() methods
- Input validation for all parameters (target_dim, num_subspaces, num_centroids,
  max_iter, embeddings shape)

Updated pycleora/__init__.py to import both modules so users can access
pycleora.search.* and pycleora.compress.*.

Existing find_most_similar function is untouched for backward compatibility.
Result format (entity_id, index, similarity keys) matches find_most_similar exactly.

Replit-Task-Id: a2415004-3c28-4c96-b8e1-ed1365533871
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants