CPSC 461: Copyright © 2002 Katrin Becker 1998-2002 Last Modified May
20, 2000 11:34
AM
- Binary
Signatures
-
- We've seen bit strings used in Data Compression
(to mark the locations of spaces, or to distinguish between value
types as in the Relative Encoding example).
-
- As systems evolve there is a tendency
to move to higher level interfaces and for the most part, this
is a good thing. There remains a place for low-level representations.
-
- Advantages of low-level representations:
- - they require less storage
- - they allow for faster searching (since
we are dealing with hardware instructions for comparisons and
searches)
-
- Bit strings are very useful for representing
the case where we are only interested in whether
an object has a certain characteristic and not in its specific
value. For example, we might want to find students with G.P.A.s
of 3.0 or greater. This can be coded in a single bit and would
make searching and matches far more efficient than it would be
if we had to check the value of a floating point field in every
record.
-
- This is called a Binary Attribute
-
- Scanning through a set of binary attributes
can easily be done using machine-level bit masks. Here's an example:
-
- Let's say we have a document retrieval
system that works using keywords. We can build a set of bit strings
(one per document) where each bit represents a "hit"
or a "miss" for a specific keyword (i.e. it does or
doesn't relate to the document).
-
|
Doc |
links |
hashing |
compression |
indexing |
rings |
records |
trees |
|
|
D1 |
1 |
0 |
0 |
1 |
0 |
1 |
1 |
1001011 |
|
D2 |
0 |
1 |
1 |
0 |
0 |
1 |
1 |
0110011 |
|
D3 |
0 |
0 |
0 |
0 |
1 |
1 |
0 |
0000110 |
|
D4 |
0 |
1 |
1 |
0 |
0 |
0 |
1 |
0110001 |
|
D5 |
0 |
1 |
0 |
1 |
1 |
1 |
1 |
0101111 |
|
D6 |
1 |
0 |
1 |
0 |
1 |
0 |
1 |
1010101 |
|
D7 |
0 |
0 |
0 |
1 |
0 |
1 |
0 |
0001010 |
-
- To retrieve documents that deal with hashing
and compression (notice that fairly complicated queries could
be done this way) we simply create a mask: 0110000
-
- In just a few machine instructions we
can scan through the associated bit strings for each document
and build a list of matches without ever looking into the
documents themselves or doing any lengthy text searches.
(The result would be a match at D2 and D4).
-
- As this idea grows so do the bit strings
of course and doing this with bit strings that are several hundred
bits long starts to become more complicated.
-
- So, how can we reduce the number of bits
and still retain the search advantages?
- Answer: We need to find a way to reduce
the # of bits; we still want to generate a unique code for each
attribute
-
- Superimposed Coding:
-
- - the idea is to create an m-bit
code for each attribute and superimpose these on top of each
other or....
-
- - let's first say that m
should be a multiple of 8 (keeping our machine instructions in
mind) so m should be 8, 16, 32, 64, ...
- - now we form the code by setting exactly
k bits for each attribute
- - we need enough bits to form a unique
representation for each attribute
-
- How to figure out what k
needs to be?
-
- mCk =
(The simplest way to get
the values is to just play with m and k
until we get what we need). If we set m to 8 and
try k at 2, we get 40320 / (720 * 2) = 28 so we
can represent 28 different attributes.
-
- The final code for each document is created
by superimposing each of the attribute codes (via logical or).
-
|
Doc |
links |
hashing |
compression |
indexing |
rings |
records |
trees |
final |
|
10100000 |
01100000 |
00101000 |
10001000 |
00000110 |
00010100 |
00010001 |
code |
|
D1 |
1 |
0 |
0 |
1 |
0 |
1 |
1 |
10111101 |
|
D2 |
0 |
1 |
1 |
0 |
0 |
1 |
1 |
01111101 |
|
D3 |
0 |
0 |
0 |
0 |
1 |
1 |
0 |
00010110 |
|
D4 |
0 |
1 |
1 |
0 |
0 |
0 |
1 |
01111001 |
|
D5 |
0 |
1 |
0 |
1 |
1 |
1 |
1 |
11111111 |
|
D6 |
1 |
0 |
1 |
0 |
1 |
0 |
1 |
10111111 |
|
D7 |
0 |
0 |
0 |
1 |
0 |
1 |
0 |
10011100 |
-
- This approach doesn't work so well for
common attributes (those that appear in most or many of the documents)
- they should be done differently.
-
- Now if we're looking for "hashing"
and "compression" we need a 01101000 mask (01100000
| 00101000) and running this through the bit strings for the
documents our match list is: D2, D4, and D5
-
- - because the meanings of the individual
bits are no longer unique it is possible to get matches that
don't actually contain what we are looking for. This is called
a false drop.
-
- - it means we still need to double check
matches to see if they are in fact hits but we can still use
this approach effectively to eliminate a large number of documents
from our search. We can store these bit strings in a separate
file and search them first - this way we eliminate files without
even opening them. We can often arrange to keep this information
in memory.
-
- How can we reduce the number of false
drops?
- 1. increase m
- 2. remove the common attributes
- 3. increase k? (nope, this
won't help - why?)
- Text Searching
-
- - looking for a given pattern
in a string
- - the usual way is to check chars in
string against the first char in pattern,
moving 1 character at a time from left to right. When the first
character matches check subsequent characters. When we fail,
our 'current' pointer is moved one position to the right and
we start again.
-
- Here is another way:
-
- [ref. Boyer & Moore, CACM, Vol.
20, 1977, p762-772]
-
- 1. 'line up' pattern at
tha start of string
- 2. check for match at right side
of pattern
- 3. if no match:
- THEN check to see if current char of string
appears in pattern
- A. if YES:
- A.1 line up rightmost occurrence
of char in pattern with char
in string
- A. 2 go to 2
- B. if NO:
- B. 1 advance current pointer along string
by len(pattern)
- ELSE start symbol by symbol check starting
at right
- Most pattern matching algorithms examine
most if not all of the symbols at least once.
-
- Can we reduce this?
-
- What if we can devise a way of eliminating
lines of text or substrings that are unlikely to result in a
match?
-
- When searching articles (web sites, etc.)
we don't usually search the entire text contained within. We
usually search some well-defined subset:
- 1. a summary
- 2. an abstract
- 3. a list of related keywords
- 4. an index
-
- These are all condensed forms of the original
text, but they are still rather cumbersome, especially if we
are talking about 1000's (or in the case of a web search engine
100's of 1000's) of documents, each one of which is likely to
be stored as a distinct file.
-
- This sounds like a lot of work; can we
do better?
- Signatures
-
- - involves adding something to the file
- - the trade-off (there's ALWAYS a trade-off)
is increased storage for increased search efficiency
- - involves some pre-processing (store
once, process often - same argument that was used for complicated
collision resolution techniques in hashing)
-
- - we can associate a distinct signature
with each text segment and then just search the signatures
- - signatures need not be unique - all
we are trying to do is reduce the number of strings we have to
include in our detailed search { like a FILTER }
(make it faster by doing less)
- - a signature is a bit string
(usually a multiple of the machine word size) that describes
the 'essence' of an object
-
- Here's one approach:
- Hash all contiguous k-symbol
groups to a number [0,m) (i.e. from 0 to m not
including m) where m is the size of the signature.
The result of the hash tells us which bit to set (note: this
needs to be updated each time the source text is changed)
-
- When we want to find a pattern, we hash
the pattern in the same way and use it as a mask against the
text signatures
- - those strings that match need to be
examined in more detail; those that don't match can be safely
ignored
-
- To make it simple at the machine instruction
level:
- if (( ~text_signature) & (pattern_signature)
== 0)
- // testing for zero is faster than testing
for a 'random' bit pattern
- then it's a match
-
- Example: we have 8 bytes to work with
and each symbol pair is denoted as y1,y2
- h(y1,y2)
= numofclasses * T(y1) + T(y2)
= {0,...,63}
- when k = 2; numofclasses
is the largest integer n, such that n2
m and
m is the signature length
-
- - so numofclasses for a 64-bit
signature is 8
- - T is
a simple array (one element per symbol; each symbol is assigned
a value from 0-7)
- - the values assigned to the symbols must
be evenly distributed according to the statistical frequencies
of the symbols themselves
-
- TABLE 'T':
|
class |
symbols |
|
0 |
(blank) |
|
1 |
E B 6 : & ' " ? |
|
2 |
T X Z W G 5 ; / * < |
|
3 |
A F Y P 4 , ) ! > ^ |
|
4 |
O L C 3 . ( @ [ _ |
|
5 |
I K D M J Q 2 9 # ] | |
|
6 |
N V S 1 8 - $ { |
|
7 |
H U R 0 7 + % } |
-
- Signature Formation:
- while (not eof)
- {
- text = input line;
- signature = all_zeros;
- for i <- 1 to (endoftext-1)
- {
- position = numofclasses * T[i] + T[i+1];
- signature [ position ] = 1;
- }
- output <- concatenate(signature, text);
- }
-
- Signatures for Records
-
- - can create a signature for each record
- - record signatures usually use disjoint
coding (each field gets its own portion of the signature)
- - field values can be reduced by regular
hashing or they be be re-grouped and coded that way
-
- eg. a name field could be reduced to 5
bits:
- A-E = 10000; F-J = 01000; K-M = 00100;
N-R = 00010; S-Z = 00001
-
- we can even create a hash function that
will generate the appropriate bit position (this increases speed
because instead of searching for the correct sub-group, we can
hash to it directly.)
-
- Partial Match Retrieval with
Page Signatures
- - this idea can be applied to groups of
records (like in buckets)
- - now we can group sets of signatures
into a tree structure
-
- Bloom Filters
-
- Searching for records that are not
in the file is often a very expensive operation (looking for
a hashed record in a table created using open addressing often
results in the entire table being searched).
-
- Here's a way to make this more efficient:
-
- Example:
- We have a file containing a list of bad
credit card numbers
- - a long list though it represents a very
small portion of all credit card numbers
- - a list that must be searched even though
we expect not to find the number we have
-
- 1. Create a loooooong bit string (m
bits long)
- 2. As we insert records into the search
file we apply a battery of hash functions to the primary keys
of each record.
- h1(key) ->
(0 - (m-1))
- h2(key) ->
(0 - (m-1))
- h3(key) ->
(0 - (m-1))
- :
- hn(key) ->
(0 - (m-1))
- - the output of each determines which
bit we should set
- - these are all superimposed on top of
each other to form one signature
-
- It's like having a signature for the entire
file.
-
- - to retrieve a record we apply the same
series & check the appropriate bit
- - (here's
the cool part) the first occurrence
of a zero indicates for certain the record is not
in the file
- - this allows us to determine most unsuccessful
matches without a single file access
-
- What if all the bits check out?
- 1. could be a match
- 2. could be a false drop
- so then we actually have to check the
file
-
- What we gain is that MOST searches don't
need to access the file - hence filter
-
-
CPSC 461:
Copyright © 2002 Katrin Becker 1998-2002 Last Modified May
20, 2000 11:34
AM