public class SimilarityIndex extends Object
This structure can be used to compute an approximation of the similarity
between two files. The index is used by
SimilarityRenameDetector
to compute scores
between files.
To save space in memory, this index uses a space efficient encoding which will not exceed 1 MiB per instance. The index starts out at a smaller size (closer to 2 KiB), but may grow as more distinct blocks within the scanned file are discovered.
Modifier and Type | Class and Description |
---|---|
static class |
SimilarityIndex.TableFullException
Thrown by
create() when file is too large. |
Modifier and Type | Field and Description |
---|---|
static SimilarityIndex.TableFullException |
TABLE_FULL_OUT_OF_MEMORY
A special
SimilarityIndex.TableFullException used in place of OutOfMemoryError. |
Modifier and Type | Method and Description |
---|---|
static SimilarityIndex |
create(ObjectLoader obj)
Create a new similarity index for the given object
|
int |
score(SimilarityIndex dst,
int maxScore)
Compute the similarity score between this index and another.
|
public static final SimilarityIndex.TableFullException TABLE_FULL_OUT_OF_MEMORY
SimilarityIndex.TableFullException
used in place of OutOfMemoryError.public static SimilarityIndex create(ObjectLoader obj) throws IOException, SimilarityIndex.TableFullException
obj
- the object to hashIOException
- file contents cannot be read from the repository.SimilarityIndex.TableFullException
- object hashing overflowed the storage capacity of the
SimilarityIndex.public int score(SimilarityIndex dst, int maxScore)
A region of a file is defined as a line in a text file or a fixed-size block in a binary file. To prepare an index, each region in the file is hashed; the values and counts of hashes are retained in a sorted table. Define the similarity fraction F as the the count of matching regions between the two files divided between the maximum count of regions in either file. The similarity score is F multiplied by the maxScore constant, yielding a range [0, maxScore]. It is defined as maxScore for the degenerate case of two empty files.
The similarity score is symmetrical; i.e. a.score(b) == b.score(a).
dst
- the other indexmaxScore
- the score representing a 100% matchCopyright © 2018 Eclipse JGit Project. All rights reserved.