CPSC 461: Copyright © 2002 Katrin Becker 1998-2002 Last Modified May 20, 2000 11:34 AM

Binary Signatures
 
We've seen bit strings used in Data Compression (to mark the locations of spaces, or to distinguish between value types as in the Relative Encoding example).
 
As systems evolve there is a tendency to move to higher level interfaces and for the most part, this is a good thing. There remains a place for low-level representations.
 
Advantages of low-level representations:
- they require less storage
- they allow for faster searching (since we are dealing with hardware instructions for comparisons and searches)
 
Bit strings are very useful for representing the case where we are only interested in whether an object has a certain characteristic and not in its specific value. For example, we might want to find students with G.P.A.s of 3.0 or greater. This can be coded in a single bit and would make searching and matches far more efficient than it would be if we had to check the value of a floating point field in every record.
 
This is called a Binary Attribute
 
Scanning through a set of binary attributes can easily be done using machine-level bit masks. Here's an example:
 
Let's say we have a document retrieval system that works using keywords. We can build a set of bit strings (one per document) where each bit represents a "hit" or a "miss" for a specific keyword (i.e. it does or doesn't relate to the document).
 
Doc  links hashing compression indexing rings records trees  
D1 1 0 0 1 0 1 1  1001011
D2 0 1 1 0 0 1 1  0110011
D3 0 0 0 0 1 1 0  0000110
D4 0 1 1 0 0 0 1  0110001
D5 0 1 0 1 1 1 1  0101111
D6 1 0 1 0 1 0 1  1010101
D7 0 0 0 1 0 1 0  0001010
 
To retrieve documents that deal with hashing and compression (notice that fairly complicated queries could be done this way) we simply create a mask: 0110000
 
In just a few machine instructions we can scan through the associated bit strings for each document and build a list of matches without ever looking into the documents themselves or doing any lengthy text searches. (The result would be a match at D2 and D4).
 
As this idea grows so do the bit strings of course and doing this with bit strings that are several hundred bits long starts to become more complicated.
 
So, how can we reduce the number of bits and still retain the search advantages?
Answer: We need to find a way to reduce the # of bits; we still want to generate a unique code for each attribute
 
Superimposed Coding:
 
- the idea is to create an m-bit code for each attribute and superimpose these on top of each other or....
 
- let's first say that m should be a multiple of 8 (keeping our machine instructions in mind) so m should be 8, 16, 32, 64, ...
- now we form the code by setting exactly k bits for each attribute
- we need enough bits to form a unique representation for each attribute
 
How to figure out what k needs to be?
 
mCk = (The simplest way to get the values is to just play with m and k until we get what we need). If we set m to 8 and try k at 2, we get 40320 / (720 * 2) = 28 so we can represent 28 different attributes.
 
The final code for each document is created by superimposing each of the attribute codes (via logical or).
 
Doc  links hashing compression indexing rings records trees

 final
10100000 01100000 00101000 10001000 00000110 00010100 00010001

code
D1  1 0 0 1 0 1 1 10111101
D2 0 1 1 0 0 1 1 01111101
D3 0 0 0 0 1 1 0 00010110
D4 0 1 1 0 0 0 1 01111001
D5 0 1 0 1 1 1 1 11111111
D6 1 0 1 0 1 0 1 10111111
D7 0 0 0 1 0 1 0 10011100
 
This approach doesn't work so well for common attributes (those that appear in most or many of the documents) - they should be done differently.
 
Now if we're looking for "hashing" and "compression" we need a 01101000 mask (01100000 | 00101000) and running this through the bit strings for the documents our match list is: D2, D4, and D5
 
- because the meanings of the individual bits are no longer unique it is possible to get matches that don't actually contain what we are looking for. This is called a false drop.
 
- it means we still need to double check matches to see if they are in fact hits but we can still use this approach effectively to eliminate a large number of documents from our search. We can store these bit strings in a separate file and search them first - this way we eliminate files without even opening them. We can often arrange to keep this information in memory.
 
How can we reduce the number of false drops?
1. increase m
2. remove the common attributes
3. increase k? (nope, this won't help - why?)

Text Searching
 
- looking for a given pattern in a string
- the usual way is to check chars in string against the first char in pattern, moving 1 character at a time from left to right. When the first character matches check subsequent characters. When we fail, our 'current' pointer is moved one position to the right and we start again.
 
Here is another way:
 
[ref. Boyer & Moore, CACM, Vol. 20, 1977, p762-772]
 
1. 'line up' pattern at tha start of string
2. check for match at right side of pattern
3. if no match:
THEN check to see if current char of string appears in pattern
A. if YES:
A.1 line up rightmost occurrence of char in pattern with char in string
A. 2 go to 2
B. if NO:
B. 1 advance current pointer along string by len(pattern)
ELSE start symbol by symbol check starting at right

Most pattern matching algorithms examine most if not all of the symbols at least once.
 
Can we reduce this?
 
What if we can devise a way of eliminating lines of text or substrings that are unlikely to result in a match?
 
When searching articles (web sites, etc.) we don't usually search the entire text contained within. We usually search some well-defined subset:
1. a summary
2. an abstract
3. a list of related keywords
4. an index
 
These are all condensed forms of the original text, but they are still rather cumbersome, especially if we are talking about 1000's (or in the case of a web search engine 100's of 1000's) of documents, each one of which is likely to be stored as a distinct file.
 
This sounds like a lot of work; can we do better?

Signatures
 
- involves adding something to the file
- the trade-off (there's ALWAYS a trade-off) is increased storage for increased search efficiency
- involves some pre-processing (store once, process often - same argument that was used for complicated collision resolution techniques in hashing)
 
- we can associate a distinct signature with each text segment and then just search the signatures
- signatures need not be unique - all we are trying to do is reduce the number of strings we have to include in our detailed search { like a FILTER } (make it faster by doing less)
- a signature is a bit string (usually a multiple of the machine word size) that describes the 'essence' of an object
 
Here's one approach:
Hash all contiguous k-symbol groups to a number [0,m) (i.e. from 0 to m not including m) where m is the size of the signature. The result of the hash tells us which bit to set (note: this needs to be updated each time the source text is changed)
 
When we want to find a pattern, we hash the pattern in the same way and use it as a mask against the text signatures
- those strings that match need to be examined in more detail; those that don't match can be safely ignored
 
To make it simple at the machine instruction level:
if (( ~text_signature) & (pattern_signature) == 0)
// testing for zero is faster than testing for a 'random' bit pattern
then it's a match
 
Example: we have 8 bytes to work with and each symbol pair is denoted as y1,y2
h(y1,y2) = numofclasses * T(y1) + T(y2) = {0,...,63}
when k = 2; numofclasses is the largest integer n, such that n2 m and m is the signature length
 
- so numofclasses for a 64-bit signature is 8
- T is a simple array (one element per symbol; each symbol is assigned a value from 0-7)
- the values assigned to the symbols must be evenly distributed according to the statistical frequencies of the symbols themselves
 
TABLE 'T':
 class symbols
0 (blank)
1 E B 6 : & ' " ?
2 T X Z W G 5 ; / * <
3 A F Y P 4 , ) ! > ^
4 O L C 3 . ( @ [ _
5 I K D M J Q 2 9 # ] |
6 N V S 1 8 - $ {
7 H U R 0 7 + % }
 
Signature Formation:
while (not eof)
{
text = input line;
signature = all_zeros;
for i <- 1 to (endoftext-1)
{
position = numofclasses * T[i] + T[i+1];
signature [ position ] = 1;
}
output <- concatenate(signature, text);
}

 

 

 
Signatures for Records
 
- can create a signature for each record
- record signatures usually use disjoint coding (each field gets its own portion of the signature)
- field values can be reduced by regular hashing or they be be re-grouped and coded that way
 
eg. a name field could be reduced to 5 bits:
A-E = 10000; F-J = 01000; K-M = 00100; N-R = 00010; S-Z = 00001
 
we can even create a hash function that will generate the appropriate bit position (this increases speed because instead of searching for the correct sub-group, we can hash to it directly.)
 
Partial Match Retrieval with Page Signatures
- this idea can be applied to groups of records (like in buckets)
- now we can group sets of signatures into a tree structure
 
Bloom Filters
 
Searching for records that are not in the file is often a very expensive operation (looking for a hashed record in a table created using open addressing often results in the entire table being searched).
 
Here's a way to make this more efficient:
 
Example:
We have a file containing a list of bad credit card numbers
- a long list though it represents a very small portion of all credit card numbers
- a list that must be searched even though we expect not to find the number we have
 
1. Create a loooooong bit string (m bits long)
2. As we insert records into the search file we apply a battery of hash functions to the primary keys of each record.
h1(key) -> (0 - (m-1))
h2(key) -> (0 - (m-1))
h3(key) -> (0 - (m-1))
:
hn(key) -> (0 - (m-1))
- the output of each determines which bit we should set
- these are all superimposed on top of each other to form one signature
 
It's like having a signature for the entire file.
 
- to retrieve a record we apply the same series & check the appropriate bit
- (here's the cool part) the first occurrence of a zero indicates for certain the record is not in the file
- this allows us to determine most unsuccessful matches without a single file access
 
What if all the bits check out?
1. could be a match
2. could be a false drop
so then we actually have to check the file
 
What we gain is that MOST searches don't need to access the file - hence filter
 
 


HomeCPSC 461: Copyright © 2002 Katrin Becker 1998-2002 Last Modified May 20, 2000 11:34 AM