HBase performance with large number of scans -

- June 15, 2011

i have table hundred of million records. table contains data servers , events genereated on them. following row key of table:

rowkey = md5(serverid) + timestamp [32 hex characters + 10 digits = 42 characters]

one of use case list events time t1 t2. this, normal scan taking time. speed things, have done following:

fetch list of unique serverid table (real fast).
divide above list in 256 buckets based on first 2 hex characters of md5 of serverids.
for each bucket, call co-processor (parallel requests) list of serverid, start time , end time.

the co-processor scan table follow:

for (string serverid :  serverids) {   byte[] startkey = generatekeyserverid, starttime);   byte[] endkey = generatekey(serverid, endtime);   scan scan = new scan(startkey, endkey);   internalscanner scanner = env.getregion().getscanner(scan);   .... }

i able result quick fast approach. concern large number of scans. if table has 20,000 serverids above code making 20,000 scans. impact overall performance , scalability of hbase?

try using timestamp filter. following syntax test in hbase shell import java.util.arraylist import org.apache.hadoop.hbase.filter.timestampsfilter list=arraylist.new() list.add(1444398443674) //start timestamp list.add(1444457737937) //end timestamp scan 'eventlogtable', {filter=>timestampsfilter.new(list)}

same api exits in java , other languages too.

Search This Blog

HTPPS

HBase performance with large number of scans -

Comments

Post a Comment

Popular posts from this blog

wordpress - (T_ENDFOREACH) php error -

Export Excel workseet into txt file using vba - (text and numbers with formulas) -

Using django-mptt to get only the categories that have items -