Hbase Data Export, Import and Migrate

Your ads will be inserted here by

Easy Plugin for AdSense.

Please go to the plugin admin page to
Paste your ad code OR
Suppress this ad slot.

CopyTable

CopyTable is a utility that can copy part or of all of a table, either to the same cluster or another cluster. The usage is as follows:

$ bin/hbase org.apache.hadoop.hbase.mapreduce.CopyTable [--starttime=X] [--endtime=Y] [--new.name=NEW] [--peer.adr=ADR] tablename

Options:

  • starttime Beginning of the time range. Without endtime means starttime to forever.
  • endtime End of the time range. Without endtime means starttime to forever.
  • versions Number of cell versions to copy.
  • new.name New table’s name.
  • peer.adr Address of the peer cluster given in the format hbase.zookeeper.quorum:hbase.zookeeper.client.port:zookeeper.znode.parent
  • families Comma-separated list of ColumnFamilies to copy.
  • all.cells Also copy delete markers and uncollected deleted cells (advanced option).

Args:

  • tablename Name of table to copy.

 

Example of copying ‘TestTable’ to a cluster that uses replication for a 1 hour window:

$ bin/hbase org.apache.hadoop.hbase.mapreduce.CopyTable
--starttime=1265875194289 --endtime=1265878794289
--peer.adr=server1,server2,server3:2181:/hbase TestTable

Export

Your ads will be inserted here by

Easy Plugin for AdSense.

Please go to the plugin admin page to
Paste your ad code OR
Suppress this ad slot.

Export is a utility that will dump the contents of table to HDFS in a sequence file. Invoke via:

$ bin/hbase org.apache.hadoop.hbase.mapreduce.Export <tablename> <outputdir> [<versions> [<starttime> [<endtime>]]]

 

Note: caching for the input Scan is configured via hbase.client.scanner.caching in the job configuration.

Import

Import is a utility that will load data that has been exported back into HBase. Invoke via:

$ bin/hbase org.apache.hadoop.hbase.mapreduce.Import <tablename> <inputdir>

Please refer to <a href="http://hbase.apache.org/book/ops.backup.html">http://hbase.apache.org/book/ops.backup.html</a> for more details

Read More

1

Understand Multilevel index and Blocking factor

I got the following lecture notes from http://computersciencecafe.blogspot.co.uk/2010/10/blocking-factor-in-multilevel-index.html

Why use Multilevel index? 
In ordered sequential file number of block access required is log(2,b) where b is number of blocks to store the index file. Multilevel index were introduced to reduce the number of block access to access the recordNumber of block access required for multilevel index is log(bfr,b)
so number of access are less if bfr is greater than 2 for multilevel index than for ordered index fileBlocking factor

bfr  =  B/R =  (block size / record size)
blocking factor or fan out of multilevel index specifies number of records that can be accumulated in single block or records per block

Problem 1.1

Consider ordered data file with following parameters

r (number of records) = 16348
R (record size) = 32 bytes
B (block size) = 1024 bytes

index stored as key + pointer pair

key value = 10 bytes
block pointer = 6 bytes

Find the number of first level and second level blocks required for multilevel index on this

Solution:Number of First level Blocks

Lets find Number of blocks in data file

Number of records that can be accumulated in block i.e
Blocking factor  bfr  = 1024/32 = 2^5
so, can have 32 records in a block

now how many such blocks are required for 16348 records

number of blocks required for data file =  (r/bfr)
= 16348/ 32  ~=  511

now we know we need 511 entries in the first level index


Find 511 entries can be stored in how many blocks
i.e  how many blocks in first level of multilevel index will be required to store this much entries where each entry is of 16 bytes(key + pointer size)
R’ = 16
B = 1024
bfr’ = 1024/16 = 2^6

 Blocking factor or fan-out for first level and its subsequent levels will be same because index entry is of same size

so number of blocks required for 512 entries would be  = r’/bfr’
= 511/64 = 2^3  ~= 8

Number of Second level Blocks

Its clear that only a single second level block would be required to store 8 entries
but lets calculate
Number of entries in second level = Number of blocks in the first level  = 8
Number of blocks in second level = (number of fist level blocks)/(bfr)
= r”/bfr’

blocking factor bfr’ is same here as second level because here also we will be storing key + pointer pair
Number of records are now 8.

So, Number of blocks for second level = 8/64  ~= 1

Problem 1.2

For secondary index on unordered key data file
with same parameters

Solution:

In case of secondary index there is one index entry required for each data record in data file

Number  of First level blocks 

First level index will store index entries for all the records(16348)  in data file

Number of blocks needed for first level index = r/bfr = 16348 / 64 ~= 256
(bfr = 1024/(10+6) )


Number of second level blocks

Number of entries in second level = Number of blocks in first level = 256
bfr = 64 is same and r = 256
so Number of second level blocks = 256/64 = 4

Read More

storage workflow dropbox

Future Collaboration: Dropbox+Workflow+Social Network concept

How do we collaboration now?

—  Email attachments: The most common way. Difficult to complete complex workflow tasks and there is no data synchronization

—  Microsoft SharePoint: static workflow on documents. It doesn’t fit for the open cloud environment, difficult to create temporary groups and share data.

—  Google Docs: online editing documents at the same time. It provides basic permission controls that who can read/write the documents

—  Huddle: assign single tasks to users on a specific document, carried out by start and end dates.

 

SharePoint (Microsoft)

—  Document collaboration means several authors work on a document or collection of documents together. They could be simultaneously co-authoring a document or reviewing a specification as part of a structured workflow.

—  Semiformal co-authoring: Multiple authors edit simultaneously anywhere in the document.

—  Formal co-authoring: Multiple authors edit simultaneously in a controlled way by saving content when ready to be revealed. Examples include: business plans, newsletters, and legal briefs for Word; and marketing and conference presentations for PowerPoint.

—  Comment and review: A primary author solicits edits and comments (which can be threaded discussions) by routing the document in a workflow, but controls final document publishing. Examples include online help, white papers, and specifications.

—  Document sets: Multiple authors are assigned separate documents as part of a workflow, and then one master document set is published. Examples include: new product literature and a sales pitch book.

 

Typical story:

  1. A teacher sends homework to all the students after the class.
  2. All students receive the homework file.
  3. Students complete the homework.
  4. The homework file gets submitted to the not-graded folder.
  5. The teacher gets a message and starts marking it or sends the homework to teaching helpers.
  6. The homework gets marked and sent back to the student to re-do of incorrect sections.
  7. Student updates the doc accordingly and complete it
  8. The teacher grades the doc again and complete.

Open State Scenario

For example, in an open research environment, researchers need to circulate documents to other appropriate persons when needed. Often, there is some legal constrains such as where the data can be accessed and how can these documents to be circulated. For example, in such a project, data cannot leak beyond the designated region.

Read More