👁 Image
README
¶
sdhash
👁 Tests
👁 codecov
👁 Go Report Card
👁 GoDoc
👁 Release
👁 Language
👁 License
sdhash is a tool that processes binary data and produces similarity digests using bloom filters. Two binary files with common parts produces two similar digests. sdhash is able to compare the similarity digests to produce a score. A score close to 0 means that two file are very different, a score equals to 100 means that two file are equal.
Features
- calculate similarity digests of many files in a short time
- compare a large amount of digests using precalculated indexes
- the comparison can also be made during the digest process
- same results of original sdhash with similar performance, but entirely rewritten in go language
Getting started
The sdhash package is available as binaries and as a library.
Binaries
The binaries for all platforms are available on the Releases page.
Library
- Install sdhash package with the command below
$ go get -u github.com/eciavatta/sdhash
- Import it in your code and start play around
package main
import (
"fmt"
"github.com/eciavatta/sdhash"
)
func main() {
factoryA, _ := sdhash.CreateSdbfFromFilename("a.bin")
sdbfA := factoryA.Compute()
factoryB, _ := sdhash.CreateSdbfFromFilename("b.bin")
sdbfB := factoryB.Compute()
fmt.Println(sdbfA.String())
fmt.Println(sdbfB.String())
fmt.Println(sdbfA.Compare(sdbfB))
}
Documentation
The library documentation is published at pkg.go.dev/github.com/eciavatta/sdhash. How sdhash works is described in this paper, and here you can find a tutorial of the original version of sdhash.
License
sdhash is originally created by Vassil Roussev and Candice Quates and is licensed under Apache-2.0 License. The implementation in golang was made by Emiliano Ciavatta and is also licensed under Apache-2.0 License.
👁 Image
Documentation
¶
Index ¶
Constants ¶
const (
MinFileSize = 512 // Minimum file size for a Sdbf file.
)
Variables ¶
var ( BfSize uint32 = 256 // BfSize is the size of each bloom filters PopWinSize uint32 = 64 // PopWinSize is the size of the sliding window used to hash input. MaxElem uint32 = 160 // MaxElem is maximum number of elements in each bloom filter in stream mode. MaxElemDd uint32 = 192 // MaxElem is maximum number of elements in each bloom filter in block mode. Threshold uint32 = 16 // Threshold is the minimum value of the score above witch chunks are considered. BlockSize = 4 * kB // BlockSize is the block size used to generate chunk ranks. EntropyWinSize = 64 // EntropyWinSize is the entropy window size used to generate chunk ranks. )
Functions ¶
This section is empty.
Types ¶
type BloomFilter ¶
type BloomFilter interface {
// ElemCount returns the number of elements in the BloomFilter.
ElemCount() uint64
// MaxElem returns the maximum number of elements that can be present in the BloomFilter.
MaxElem() uint64
// BitsPerElem returns the number of bits for each elements of the BloomFilter.
BitsPerElem() float64
// WriteToFile serialize the current BloomFilter to a file specified by filename.
WriteToFile(filename string) error
// String returns the serialized representation of the BloomFilter.
String() string
// contains filtered or unexported methods
}
BloomFilter represent a bloom filter and it is used to calculate similarity digests.
func NewBloomFilter ¶
func NewBloomFilter() BloomFilter
NewBloomFilter returns a new BloomFilter with the default initial values.
func NewBloomFilterFromIndexFile ¶
func NewBloomFilterFromIndexFile(indexFileName string) (BloomFilter, error)
NewBloomFilterFromIndexFile read a BloomFilter serialized into a file.
func NewBloomFilterFromString ¶
func NewBloomFilterFromString(filter string) (BloomFilter, error)
NewBloomFilterFromString create a new BloomFilter from a serialized string.
type Sdbf ¶
type Sdbf interface {
// Name of the of the file or data this Sdbf represents.
Name() string
// Size of the hash data for this Sdbf.
Size() uint64
// InputSize of the data that the hash was generated from.
InputSize() uint64
// FilterCount returns the number of bloom filters count.
FilterCount() uint32
// Compare two Sdbf and provide a similarity score ranges between 0 and 100.
// A score of 0 means that the two files are very different, a score of 100 means that the two files are equals.
Compare(other Sdbf) int
// CompareSample compare two Sdbf with sampling and provide a similarity score ranges between 0 and 100.
// A score of 0 means that the two files are very different, a score of 100 means that the two files are equals.
CompareSample(other Sdbf, sample uint32) int
// String returns the encoded Sdbf as a string.
String() string
// GetIndex returns the BloomFilter index used during the digesting process.
GetIndex() BloomFilter
// GetSearchIndexesResults returns search indexes results.
// The return value is an array of size == len(searchIndexes), and each elements has another array of length bfCount.
GetSearchIndexesResults() [][]uint32
// Fast modify the bloom filter buffer for faster comparison.
// Warning: the operation overwrite the original buffer.
Fast()
}
Sdbf represent the similarity digest of a file and can be compared for similarity to others Sdbf.
func ParseSdbfFromString ¶
ParseSdbfFromString decode a Sdbf from a digest string.
type SdbfFactory ¶
type SdbfFactory interface {
// WithBlockSize sets the block size for the block mode.
// The default value of 0 involves in a Sdbf generated in stream mode.
WithBlockSize(blockSize uint32) SdbfFactory
// WithInitialIndex sets the initial BloomFilter index.
// Without setting an initial index the factory creates a new empty BloomFilter.
WithInitialIndex(initialIndex BloomFilter) SdbfFactory
// WithSearchIndexes sets a list of BloomFilter which are checked for similarity during digesting process.
// Without setting a value the searching operation during the digesting process is disabled.
WithSearchIndexes(searchIndexes []BloomFilter) SdbfFactory
// WithName sets the name of the Sdbf in the output.
WithName(name string) SdbfFactory
// Compute start the digesting process and provide a Sdbf with the result.
Compute() Sdbf
}
SdbfFactory can be used to create a Sdbf from a binary source.
func CreateSdbfFromBytes ¶
func CreateSdbfFromBytes(buffer []uint8) (SdbfFactory, error)
CreateSdbfFromBytes returns a factory which can produce a Sdbf from a bytes buffer.
func CreateSdbfFromFilename ¶
func CreateSdbfFromFilename(filename string) (SdbfFactory, error)
CreateSdbfFromFilename returns a factory which can produce a Sdbf of a file.
func CreateSdbfFromReader ¶
func CreateSdbfFromReader(r io.Reader) (SdbfFactory, error)
CreateSdbfFromReader returns a factory which can produce a Sdbf from a io.Reader.
