Wednesday, July 8, 2020
HDFS Tutorial
HDFS Tutorial HDFS Tutorial: Introduction to HDFS its Features Back Home Categories Online Courses Mock Interviews Webinars NEW Community Write for Us Categories Artificial Intelligence AI vs Machine Learning vs Deep LearningMachine Learning AlgorithmsArtificial Intelligence TutorialWhat is Deep LearningDeep Learning TutorialInstall TensorFlowDeep Learning with PythonBackpropagationTensorFlow TutorialConvolutional Neural Network TutorialVIEW ALL BI and Visualization What is TableauTableau TutorialTableau Interview QuestionsWhat is InformaticaInformatica Interview QuestionsPower BI TutorialPower BI Interview QuestionsOLTP vs OLAPQlikView TutorialAdvanced Excel Formulas TutorialVIEW ALL Big Data What is HadoopHadoop ArchitectureHadoop TutorialHadoop Interview QuestionsHadoop EcosystemData Science vs Big Data vs Data AnalyticsWhat is Big DataMapReduce TutorialPig TutorialSpark TutorialSpark Interview QuestionsBig Data TutorialHive TutorialVIEW ALL Blockchain Blockchain TutorialWhat is BlockchainHyperledger FabricWhat Is EthereumEthereum TutorialB lockchain ApplicationsSolidity TutorialBlockchain ProgrammingHow Blockchain WorksVIEW ALL Cloud Computing What is AWSAWS TutorialAWS CertificationAzure Interview QuestionsAzure TutorialWhat Is Cloud ComputingWhat Is SalesforceIoT TutorialSalesforce TutorialSalesforce Interview QuestionsVIEW ALL Cyber Security Cloud SecurityWhat is CryptographyNmap TutorialSQL Injection AttacksHow To Install Kali LinuxHow to become an Ethical Hacker?Footprinting in Ethical HackingNetwork Scanning for Ethical HackingARP SpoofingApplication SecurityVIEW ALL Data Science Python Pandas TutorialWhat is Machine LearningMachine Learning TutorialMachine Learning ProjectsMachine Learning Interview QuestionsWhat Is Data ScienceSAS TutorialR TutorialData Science ProjectsHow to become a data scientistData Science Interview QuestionsData Scientist SalaryVIEW ALL Data Warehousing and ETL What is Data WarehouseDimension Table in Data WarehousingData Warehousing Interview QuestionsData warehouse architectureTalend T utorialTalend ETL ToolTalend Interview QuestionsFact Table and its TypesInformatica TransformationsInformatica TutorialVIEW ALL Databases What is MySQLMySQL Data TypesSQL JoinsSQL Data TypesWhat is MongoDBMongoDB Interview QuestionsMySQL TutorialSQL Interview QuestionsSQL CommandsMySQL Interview QuestionsVIEW ALL DevOps What is DevOpsDevOps vs AgileDevOps ToolsDevOps TutorialHow To Become A DevOps EngineerDevOps Interview QuestionsWhat Is DockerDocker TutorialDocker Interview QuestionsWhat Is ChefWhat Is KubernetesKubernetes TutorialVIEW ALL Front End Web Development What is JavaScript รข" All You Need To Know About JavaScriptJavaScript TutorialJavaScript Interview QuestionsJavaScript FrameworksAngular TutorialAngular Interview QuestionsWhat is REST API?React TutorialReact vs AngularjQuery TutorialNode TutorialReact Interview QuestionsVIEW ALL Mobile Development Android TutorialAndroid Interview QuestionsAndroid ArchitectureAndroid SQLite DatabaseProgramming its Features Last upd ated on May 22,2019 59.6K Views Ashish Bakshi13 Comments Bookmark 1 / 3 Blog from Hadoop Distributed File System Become a Certified Professional HDFSTutorialBefore moving ahead in this HDFS tutorial blog, let me take youthrough some of the insane statistics related to HDFS:In 2010, Facebook claimed to have one of the largest HDFS cluster storing 21 Petabytes of data.In 2012, Facebook declared that they have the largest single HDFS cluster with more than 100 PB of data.And Yahoo! has more than 100,000 CPU in over 40,000 servers running Hadoop, with its biggest Hadoop cluster running 4,500 nodes. All told, Yahoo! stores 455 petabytes of data in HDFS.In fact, by 2013, most of the big names in the Fortune 50 started using Hadoop.Too hardto digest? Right. As discussed in Hadoop Tutorial, Hadoop has two fundamental units Storage and Processing. When I saystorage part of Hadoop, I am referring to HDFS which stands for Hadoop Distributed File System.So, in this blog, I will be introdu cing you toHDFS.Here, I will be talking about:What is HDFS? Advantages of HDFSFeatures of HDFSBefore talking about HDFS, letme tell you,what is a Distributed File System?DFS or Distributed File System:Distributed File System talks about managing data, i.e. files or folders across multiple computers or servers. In other words, DFS is a file system that allows us to store data over multiple nodes or machines in a cluster and allows multiple users to access data. So basically, it serves the same purpose as the file system which is available in your machine, like for windows you have NTFS (New Technology File System) or for Mac you have HFS (Hierarchical File System). The only difference is that, in case of Distributed File System, you store datain multiple machines rather than single machine. Even though the files are stored across the network, DFS organizes, and displays data in such a manner that a user sitting on a machine will feel like all the data is stored in that very machine.W hat is HDFS?Hadoop Distributed file system or HDFS is a Java based distributed file system that allows you to store large data across multiplenodes in a Hadoop cluster. So, if you install Hadoop, you get HDFS as an underlying storage system for storing the data in the distributed environment. Lets take an example to understand it. Imagine that you have ten machines or ten computers with a hard drive of 1 TB on each machine. Now, HDFS says that if you install Hadoop as a platform on top of these ten machines, you will get HDFS as a storage service. Hadoop Distributed File System is distributed in such a way that every machine contributes their individual storage for storing any kind of data.HDFS Tutorial:Advantages Of HDFS1. Distributed Storage:When you accessHadoop Distributed file system from any of the ten machines in the Hadoop cluster, you will feel as if you have logged into a single large machine which has a storage capacity of 10 TB (total storage over ten machines). What doe s it mean? It means that you can store a single large file of 10 TB which will be distributed over the ten machines (1 TB each). So, it is not limited to the physical boundaries of each individual machine.2. Distributed Parallel Computation:Because the data is divided across the machines, it allows us to take advantage of Distributed and Parallel Computation. Lets understand this concept bythe above example. Suppose, it takes 43 minutesto process 1 TBfile on a single machine. So, now tell me, how much time will it take to process the same 1 TB file when you have 10 machines in a Hadoop cluster with similarconfiguration 43 minutes or 4.3 minutes? 4.3 minutes, Right! What happened here? Each of the nodes is working with a part of the 1 TBfile in parallel. Therefore, the work which was taking 43 minutesbefore, gets finished in just 4.3 minutes now as the work got divided over ten machines.3. Horizontal Scalability:Last but not the least, let us talkabout thehorizontal scaling or scal ing out in Hadoop. There are two types of scaling: vertical and horizontal. In vertical scaling (scale up), you increase the hardware capacity of your system. In other words, you procure more RAM or CPU and add it to your existing system to make it more robust and powerful. But there are challenges associatedwith vertical scaling or scaling up:There is always a limit to which you can increase your hardware capacity. So, you cant keep on increasing the RAM or CPU of the machine.In vertical scaling, you stop your machine first. Then you increase the RAM or CPU to make it a more robust hardware stack. After you have increased yourhardware capacity, you restart the machine. This down time when you are stopping your system becomes a challenge.In case of horizontal scaling (scale out), you add more nodes to existing cluster instead of increasing the hardware capacity of individual machines. And most importantly, you can add more machines on the goi.e. Without stopping the system. Therefor e, while scaling out we dont have anydown time or green zone, nothing of such sort. At the end of the day, you will have more machines working in parallel to meet your requirements.HDFS Tutorial Video:You may check out the video given below where all the concepts related to HDFS has been discussed in detail:HDFS Tutorial:Features of HDFSWe will understand these features in detail when we will explore the HDFS Architecture in our next HDFS tutorial blog. But, for now, lets have anoverviewon the features of HDFS:Cost:The HDFS, in general, is deployed on a commodity hardware like your desktop/laptop which you use every day. So, it is very economical in terms of the cost of ownership of the project. Since, we are using low costcommodity hardware,you dont need to spend huge amount of money for scaling out your Hadoop cluster. In other words, adding more nodes toyour HDFS is cost effective.Variety and Volume of Data:When we talk about HDFS then we talk about storing huge data i.e. Terabyt es petabytes of data and different kinds of data. So, you can store any type of data into HDFS, be it structured, unstructured or semi structured.Reliability and Fault Tolerance:When you store data on HDFS, itinternally divides the given data into data blocks and stores it in a distributed fashion across your Hadoop cluster. The information regarding which data block is located on which of the data nodes is recordedin the metadata. NameNode manages the meta dataand the DataNodes are responsible for storing the data. Name nodealso replicates the data i.e. maintains multiple copies of the data. This replication of the data makes HDFS very reliable and fault tolerant. So, even if any of the nodes fails, we can retrieve the data from the replicas residing on other data nodes. By default, the replication factor is 3. Therefore, if you store 1 GB of file in HDFS, it will finally occupy 3 GB of space.The name node periodically updates the metadata and maintains the replication factor con sistent.Data Integrity:Data Integrity talks about whether the data stored in my HDFS are correct or not. HDFS constantly checks the integrity of data stored against its checksum. If itfinds any fault, it reports to the name node about it. Then, the name node creates additional new replicas and therefore deletes the corrupted copies.High Throughput: Throughput is the amount of work done in a unit time. It talks about how fast you can access the data from the file system. Basically, it gives you an insight about the system performance. As youhave seen in the above example where we used ten machines collectively to enhance computation.There we were able to reduce the processing time from 43 minutes to a mere4.3 minutes as all the machines were working in parallel. Therefore, by processing data in parallel, we decreased the processing time tremendously and thus, achieved high throughput.Data Locality:Data locality talks about moving processing unit to data rather than the data to proces sing unit. In our traditional system, we used to bring the data to the application layer and then process it. But now, because of the architecture and huge volume of the data, bringing the data to the application layer will reduce the network performance to a noticeableextent. So, in HDFS, we bring the computation part to the data nodes where the data is residing. Hence, you are not moving the data, you are bringing the program or processing part to the data.So now, you have a brief idea about HDFS and its features. But trust me guys, this is just the tip of the iceberg. In my next HDFS tutorial blog, I will deep dive into the HDFS architecture and I will be unveiling the secrets behind the success of HDFS. Together we will be answering all those questions which are ponderingin your head such as:What happens behind the scenes when you read or write data in Hadoop Distributed File System? What are the algorithms like rack awareness that makes HDFS so fault tolerant? How Hadoop Distri buted File System manages and creates replica? What are block operations?Now that you have understood HDFS and its features, check out theHadooptrainingby Edureka,a trusted online learning companywith a network of more than250,000satisfied learnersspread acrossthe globe. The Edureka Big Data Hadoop Certification Training coursehelps learners becomeexpert in HDFS, Yarn, MapReduce, Pig, Hive, HBase, Oozie, Flume and Sqoop using real-time use cases on Retail, Social Media, Aviation, Tourism, Finance domain.Got a question for us? Please mention it in the comments section and we will get back to you.Recommended videos for you Boost Your Data Career with Predictive Analytics! Learn How ? Watch Now Hadoop Architecture Hadoop Tutorial on HDFS Architecture Watch Now Big Data Processing With Apache Spark Watch Now Bulk Loading Into HBase With MapReduce Watch Now When not to use Hadoop Watch Now Introduction to Big Data TDD and Pig Unit Watch Now Apache Kafka With Spark Streaming: Real-Time A nalytics Redefined Watch Now New-Age Search through Apache Solr Watch Now Power of Python With BigData Watch Now Apache Spark Redefining Big Data Processing Watch Now Hadoop-A Highly Available And Secure Enterprise Data Warehousing Solution Watch Now Python for Big Data Analytics Watch Now Hive Tutorial Understanding Hive In Depth Watch Now Distributed Cache With MapReduce Watch Now Secure Your Hadoop Cluster With Kerberos Watch Now Big Data Tutorial Get Started With Big Data And Hadoop Watch Now Administer Hadoop Cluster Watch Now Is It The Right Time For Me To Learn Hadoop ? Find out. Watch Now MapReduce Tutorial All You Need To Know About MapReduce Watch Now Hadoop Cluster With High Availability Watch NowRecommended blogs for you Oracle to HDFS using Sqoop Read Article Why Hadoop? Read Article Business Applications of Hadoop Read Article Real Time Big Data Applications in Various Domains Read Article How To Install MongoDB On Windows Operating System? Read Article Splunk Use C ase: Dominos Success Story Read Article Spark Accumulators Explained: Apache Spark Read Article Top 50 Hadoop Interview Questions You Must Prepare In 2020 Read Article Apache Spark with Hadoop Why it Matters? Read Article PySpark Programming Integrating Speed With Simplicity Read Article Apache Pig Installation on Linux Read Article Essential Hadoop Tools for Crunching Big Data Read Article Helpful Hadoop Shell Commands Read Article Big Bucks for Big Data Professionals: A Hype or Hope? Read Article Introduction to Hadoop 2.0 and Advantages of Hadoop 2.0 over 1.0 Read Article Hive Data Models Read Article 5 Reasons to Learn Hadoop Read Article Apache Storm Use Cases Read Article Big Data Processing with Spark and Scala Read Article Install Apache Hadoop Cluster on Amazon EC2 free tier Ubuntu server in 30 minutes Read Article Comments 13 Comments Trending Courses in Big Data Big Data Hadoop Certification Training158k Enrolled LearnersWeekend/WeekdayLive Class Reviews 5 (62950)
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.