CloudxLab

Thursday, April 28, 2016

Falcon FEED need certain features


This is based on experience from Falcon 8, and operation feasibility 

  • Feed should be read or written to various source of data or cluster, retention should be optional
    • Which manages its own retention policy, so having retention is void, if accidentally scheduled will kick off oozie coord which is always in killed state
      • s3 retention
      • flume
      • hbase
      • hive
      • kafka


  • Feed can be replicated to any external source
    • need support to add custom jar/path
      • s3
      • nfs
      • san
  • Feed replication is not completed untill we have data from all colos
    • in this case we need feed available to consume which ever is been replicated
    • add promotion property, part of feed replication property=promoted, and specify directory
    • this will help if feed is replicat and track the promoted directory for the instance
  • Feed should have pipeline entity, so as to know the source of feed
    • with using dependency its too cumbersome, there would be lots of discarded or junk process dependecies.
    • Feeds which are replicated, or promoted or archived can be tied to pipelines and be helpful case of maintenance, or backlog reprocessing
  • Feeds should have properties for replication, archival as
    • replication, archival(optional  retention, its just move out from current source), promotion(its move so no retention required).
    • replication, archival should also support as fetch/push operation and data type
    • we should be able to replicate data from HDFS to DB and vice versa
    • this can be helpful in bulk migration

  • Feed which have endtime defined, should be retired/deleted from config store
    • falcon startup.properties can have retention for retired feeds


  • Feed should have auto update in entity - pipelines in case a new process is added
    • This will help tracking if multiple pipelines are using same feed


  • Feed should validate if it has write permissions(already there), schedule job in queue.
    • This will help to submit correctly before scheduling
  • Feed re-run for archival and retention, should validate if it has not crossed the retention period.
    • The jobs keep getting failed as source might not have data
  • Feeds should maintain stats for activities its doing, rather than just logging
    • Amount of data transferred speed, replication/archival
    • Amount of data deleted time for retention
  • Feed retention should not be based on feed frequency, but 30minutely, hourly, daily job
    • The last only instance should take care of all retention
    • If any failed instance, next succeeded should take care of previous instances cleanup


Saturday, December 12, 2015

Falcon 8 Released..

Its on production, looks promising

The Apache Falcon team would like to announce the release of Falcon 0.8. Falcon is a feed processing and data management system aimed at making it easier for end consumers to onboard their feed processing and feed management on Hadoop clusters. Falcon 0.8 aims to ensure backward compatibility with older versions of Falcon, improve usability, fix bugs, increase performance and optimize a few of the already implemented features.

http://falcon.apache.org/0.8/index.html

https://cwiki.apache.org/confluence/download/attachments/61318307/ReleaseNotes-ApacheFalcon-0.8.pdf

Tuesday, September 1, 2015

Vagrant config for cloud projects


For running cloud projects, i have setup vagrant with sharing folder with local disk, for changes use Intellij, and running the application in the ubuntu vagrant

github://sanjeevtripurari/vagrant-vm 



ubuntu1.vm.network "private_network", ip: "192.168.156.10" ubuntu1.vm.host_name="192.168.156.10"
config.ssh.username="sanjeevt" config.vm.synced_folder "/Users/sanjeev.tripurari/Projects", "/home/sanjeevt/Projects" config.vm.synced_folder "/Users/sanjeev.tripurari/Downloads", "/home/sanjeevt/Downloads" config.vm.synced_folder "/Users/sanjeev.tripurari/.m2", "/home/sanjeevt/.m2"

Giving hostnme as ip address helps a lot, as don't have to have DNS and cluster can be clicked.

Monday, August 31, 2015

Hadoop compile from source

Here My git, where I am maintaining the actual updates for hadoop compilation


github://sanjeevtripurari/hadoop-configs



## On Local Directory
cd Projects/hadoop/
my git config on local
cat .git/config

# [remote "origin"]
# url = ssh://git@github.com/sanjeevtripurari/hadoop.git
# [remote "upstream"]
# url = ssh://git@github.com/apache/hadoop.git
## Keeping the source updated
git pull upstream trunk
git push -u

### Development  env setup
sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get upgrade
sudo apt-get install build-essential git maven subversion
sudo apt-get install g++ autoconf automake libtool cmake zlib1g-dev pkg-config libssl-dev protobuf-compiler

### Java Install

## For oracle java auto install we need
echo oracle-java8-installer shared/accepted-oracle-license-v1-1 select true | sudo /usr/bin/debconf-set-selections
sudo apt-get install oracle-java8-installer
## Set JAVA_HOME, and java8 as default
sudo apt-get install oracle-java8-set-default
sudo update-java-alternatives -s java-8-oracle

### We need Protobuf 2.5.0, required by hadoop source to compile
wget https://github.com/google/protobuf/archive/v2.5.0.tar.gz
tar -tvzf v2.5.0.tar.gz
cd protobuf-2.5.0/
./autogen.sh
./configure --prefix=/opt/protobuf-2.5.0
make
sudo make install

cd /opt/protobuf-2.5.0/

## Add to ldconfig, so hadoop compilation can pick the .so files
sudo ldconfig -v /opt/protobuf-2.5.0/lib |grep -i pro

## export PATH
export PATH=/opt/protobuf-2.5.0/bin:$PATH
## Here you go for hadoop compilation, skip doc, as java8 has issues parsing doc, but works with java7

mvn clean package -Pdist,native -Dmaven.javadoc.skip=true  -DskipTests -Dtar

Check the target in
# target: hadoop-dist/target/

Pull my cluster config, if you want to do quick setup.


## For me i got the hadoop source compiled and this is the o/p it 

main:
     [exec] $ tar cf hadoop-3.0.0-SNAPSHOT.tar hadoop-3.0.0-SNAPSHOT
     [exec] $ gzip -f hadoop-3.0.0-SNAPSHOT.tar
     [exec]
     [exec] Hadoop dist tar available at: /home/sanjeevt/Projects/hadoop/hadoop-dist/target/hadoop-3.0.0-SNAPSHOT.tar.gz
     [exec]
[INFO] Executed tasks
[INFO]
[INFO] --- maven-javadoc-plugin:2.8.1:jar (module-javadocs) @ hadoop-dist ---
[INFO] Skipping javadoc generation
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] Apache Hadoop Main ................................ SUCCESS [4.008s]
[INFO] Apache Hadoop Build Tools ......................... SUCCESS [1.185s]
[INFO] Apache Hadoop Project POM ......................... SUCCESS [2.567s]
[INFO] Apache Hadoop Annotations ......................... SUCCESS [4.056s]
[INFO] Apache Hadoop Assemblies .......................... SUCCESS [1.178s]
[INFO] Apache Hadoop Project Dist POM .................... SUCCESS [6.116s]
[INFO] Apache Hadoop Maven Plugins ....................... SUCCESS [6.408s]
[INFO] Apache Hadoop MiniKDC ............................. SUCCESS [6.141s]
[INFO] Apache Hadoop Auth ................................ SUCCESS [14.335s]
[INFO] Apache Hadoop Auth Examples ....................... SUCCESS [7.447s]
[INFO] Apache Hadoop Common .............................. SUCCESS [8:23.207s]
[INFO] Apache Hadoop NFS ................................. SUCCESS [21.162s]
[INFO] Apache Hadoop KMS ................................. SUCCESS [57.151s]
[INFO] Apache Hadoop Common Project ...................... SUCCESS [0.573s]
[INFO] Apache Hadoop HDFS Client ......................... SUCCESS [3:50.178s]
[INFO] Apache Hadoop HDFS ................................ SUCCESS [9:44.831s]
[INFO] Apache Hadoop HttpFS .............................. SUCCESS [2:04.535s]
[INFO] Apache Hadoop HDFS BookKeeper Journal ............. SUCCESS [13.240s]
[INFO] Apache Hadoop HDFS-NFS ............................ SUCCESS [15.292s]
[INFO] Apache Hadoop HDFS Project ........................ SUCCESS [0.635s]
[INFO] Apache Hadoop YARN ................................ SUCCESS [0.539s]
[INFO] Apache Hadoop YARN API ............................ SUCCESS [1:41.401s]
[INFO] Apache Hadoop YARN Common ......................... SUCCESS [2:11.975s]
[INFO] Apache Hadoop YARN Server ......................... SUCCESS [0.768s]
[INFO] Apache Hadoop YARN Server Common .................. SUCCESS [34.561s]
[INFO] Apache Hadoop YARN NodeManager .................... SUCCESS [1:34.676s]
[INFO] Apache Hadoop YARN Web Proxy ...................... SUCCESS [11.593s]
[INFO] Apache Hadoop YARN ApplicationHistoryService ...... SUCCESS [25.655s]
[INFO] Apache Hadoop YARN ResourceManager ................ SUCCESS [1:50.939s]
[INFO] Apache Hadoop YARN Server Tests ................... SUCCESS [8.727s]
[INFO] Apache Hadoop YARN Client ......................... SUCCESS [17.878s]
[INFO] Apache Hadoop YARN SharedCacheManager ............. SUCCESS [10.647s]
[INFO] Apache Hadoop YARN Applications ................... SUCCESS [0.436s]
[INFO] Apache Hadoop YARN DistributedShell ............... SUCCESS [8.340s]
[INFO] Apache Hadoop YARN Unmanaged Am Launcher .......... SUCCESS [5.777s]
[INFO] Apache Hadoop YARN Site ........................... SUCCESS [0.515s]
[INFO] Apache Hadoop YARN Registry ....................... SUCCESS [19.661s]
[INFO] Apache Hadoop YARN Project ........................ SUCCESS [4:34.361s]
[INFO] Apache Hadoop MapReduce Client .................... SUCCESS [1.314s]
[INFO] Apache Hadoop MapReduce Core ...................... SUCCESS [1:23.951s]
[INFO] Apache Hadoop MapReduce Common .................... SUCCESS [34.491s]
[INFO] Apache Hadoop MapReduce Shuffle ................... SUCCESS [6.693s]
[INFO] Apache Hadoop MapReduce App ....................... SUCCESS [54.634s]
[INFO] Apache Hadoop MapReduce HistoryServer ............. SUCCESS [18.979s]
[INFO] Apache Hadoop MapReduce JobClient ................. SUCCESS [53.345s]
[INFO] Apache Hadoop MapReduce HistoryServer Plugins ..... SUCCESS [4.575s]
[INFO] Apache Hadoop MapReduce NativeTask ................ SUCCESS [2:53.471s]
[INFO] Apache Hadoop MapReduce Examples .................. SUCCESS [13.553s]
[INFO] Apache Hadoop MapReduce ........................... SUCCESS [2:31.103s]
[INFO] Apache Hadoop MapReduce Streaming ................. SUCCESS [15.238s]
[INFO] Apache Hadoop Distributed Copy .................... SUCCESS [22.616s]
[INFO] Apache Hadoop Archives ............................ SUCCESS [8.276s]
[INFO] Apache Hadoop Rumen ............................... SUCCESS [24.058s]
[INFO] Apache Hadoop Gridmix ............................. SUCCESS [17.028s]
[INFO] Apache Hadoop Data Join ........................... SUCCESS [7.206s]
[INFO] Apache Hadoop Ant Tasks ........................... SUCCESS [3.241s]
[INFO] Apache Hadoop Extras .............................. SUCCESS [6.438s]
[INFO] Apache Hadoop Pipes ............................... SUCCESS [12.511s]
[INFO] Apache Hadoop OpenStack support ................... SUCCESS [10.768s]
[INFO] Apache Hadoop Amazon Web Services support ......... SUCCESS [8.633s]
[INFO] Apache Hadoop Azure support ....................... SUCCESS [13.133s]
[INFO] Apache Hadoop Client .............................. SUCCESS [15.168s]
[INFO] Apache Hadoop Mini-Cluster ........................ SUCCESS [0.840s]
[INFO] Apache Hadoop Scheduler Load Simulator ............ SUCCESS [16.242s]
[INFO] Apache Hadoop Tools Dist .......................... SUCCESS [35.833s]
[INFO] Apache Hadoop Tools ............................... SUCCESS [0.465s]
[INFO] Apache Hadoop Distribution ........................ SUCCESS [1:01.293s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 55:44.365s
[INFO] Finished at: Tue Sep 01 07:29:57 UTC 2015
[INFO] Final Memory: 139M/468M
[INFO] ------------------------------------------------------------------------


Sunday, December 1, 2013

Falcon Onboarding

Here will start will Falcon first look, as we having been using for our actual production operations for data pipelines.

Will go through how to configure and onboard a pipeline on Falcon..

http://falcon.incubator.apache.org/docs/InstallationSteps.html
http://falcon.incubator.apache.org/docs/FalconArchitecture.html
http://falcon.incubator.apache.org/docs/OnBoarding.html
http://falcon.incubator.apache.org/docs/EntitySpecification.html
http://falcon.incubator.apache.org/docs/FalconCLI.html

First get your hadoop clusters, with oozie and activemq ready, yes clusters.. we will have two hadoop setups for out activity..

keep watching..

Tuesday, August 27, 2013

Cloud Technologies

..here will be posting on hadoop infrastructure and data pipeline engineering...

  • Get Docs
    • http://hadoop.apache.org/docs/stable/hdfs_design.html
    • http://falcon.incubator.apache.org/index.html
    • http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html

  • Get Hadoop installed
    • http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
    • http://www.opensourceclub.net/hadoop/cloudera-hadoop-single-node-cluster-pseudo-distributed-mode-on-mac-os-x-lion/ 
    • http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/4.2.0/CDH4-Quick-Start/cdh4qs_topic_3_3.html