Blog – Full Width

by

“How to choose the best AWS region for your cloud based application?”

When it comes to deploying an application in the cloud, many customers choose a region based on the proximity to their clients.  However, proximity to the end users of your application is not the only criteria in determining an AWS region.

Here are some of the key factors that you need to consider when choosing an AWS region for your workload:

  • Cost
  • Performance
  • Features
  • High Availability
Cost

Cost for AWS services vary by region. If you choose an expensive region you could end up paying a lot more. Thankfully, Amazon has a Cost Calculator to get a rough estimate on your monthly costs based on the services you need.

Performance

Multi-region deployment is critical to provide low latency for end users. However, we need to keep in mind that each region has a different latency and data transfer speed.

Click on the image to see the dynamically populated inter region latency in milliseconds between all the AWS regions.

Image Credit: cloudping.co

Features

Not all of the AWS services are available in all the regions. It will take much longer for new services to be available in some regions. So, making sure the regions that you choose have all the required services is important.

Below is a list of AWS services offered in different AWS regions. Click on the image to see the complete table.

High Availability:

Availability Zones are a collection of data centers in a region. They are connected to each other with fast, private fiber-optic networking, that enables to architect applications that automatically fail-over between Availability Zones without any interruption. So, Availability Zones are key to High Availability & Fault Tolerance.

Different regions have different number of Availability Zones (AZs). Some regions have only 2 AZs, while some have 3 or more. If you choose an AWS region that has only 2 AZs, if one of the Azs become unavailable then your application will be at risk because the only available AZ will be overloaded. If your application demands high availability you need to choose a region that has at least 3 Availability Zones.

It’s clear that there are multiple factors to consider when deciding on a suitable AWS region. Choosing your AWS region carefully can enable your applications to be high performant, highly available and cost-effective.

by

Will Big Data Analytics and AI take my job away?

It’s that time of year again, when many of us begin to make resolutions for the year ahead. While I was preparing my laundry list of New Year’s resolutions, a report released by McKinsey Global Institute (MGI) in Dec 2017 caught my attention: “Jobs lost, Jobs gained: Workforce transitions in a time of automation”.

So, what exactly will be the impact of Big data analytics and AI? Is the automation going to eliminate most of the jobs? What we learn from the report is that by 2030, as many as 375 million workers, that is ~14 percent of the global workforce, will need to change careers or learn new skills to survive in the labor market. Looking at closer to home, in the United States alone, anywhere from 39 million to 73 million workers will be displaced due to automation by 2030 resulting in the need of up to 33% of the workforce to switch jobs.

Now you may be thinking that automation will only impact low-skilled blue-collar jobs, but they are not the only victim of this new wave of automation technologies. Stanford professor Andrew Ng, an industry expert in machine learning, says, “AI can now diagnose pneumonia from chest X-rays better than radiologists.” Many white-collar jobs including the jobs of financial advisors and analysts, insurance agents, tax preparers, sports reporters, online marketers, anesthesiologists, etc. are already in the process of automation. If you are from the software testing background you might have noticed that software testing automation has grown exponentially in the recent years. Wall Street, the media and even software developers are highly exposed to automation and are coping with artificial intelligence (AI) and automation changing the nature of their work. There is not even a single sector that is left untouched by AI. That is why Prof Ng calls AI the new electricity.

I have been thinking about an instance where such a change occurred. I still remember the rhythmic noise of clackity-clackity-clack from the typist’s desk, when I used to visit my dad’s office as a child. After the mushrooming of personal computers, word processing software made it easier for anyone to edit and retype documents. This made the jobs of typists/secretaries and book keepers extinct.

Automation is an age-old phenomenon that has existed for more than a century. Workers shifting from farms to factories, cars replacing horses, robots automating factory floors, and computers automating business processes are all well-known evidences of automation that impacted the global workforce. With the advent of self-driving cars, robots processing Amazon orders, chatbots providing online customer support, AI based fraud detection, etc. we are up for another huge wave of automation by technologies like Big Data Analytics, AI and robotics.

Do you need to panic now? Stay with me here as we learn more. To estimate the impact of the new automation technologies, Mckinsey Global Institute conducted case studies to understand the job creation and destruction pattern on two previous automation technologies: automobiles and personal computers. The research revealed that more jobs were created than were destroyed. The personal computer enabled the creation of 15.8 million net new jobs since 1980, accounting for 10 percent of employment. Jobs in typewriter manufacturing, typewriting, secretarial work & bookkeeping got displaced. But many new jobs got created including jobs in computer manufacturing, semiconductors, as well as jobs enabled by computers including programmers, IT system admin and jobs that use computers like customer service call centers. Similarly, the assembly line automation by Ford Motor Company resulted in a 10% increase in employment rate in 1915.

Takeaway Message:

Automation destroys some old jobs and creates more new jobs, but it comes at the cost of transitioning the workforce! Click to Tweet
People who equipped themselves with new skills had a smooth career transition whereas people who didn’t got thrown out of the job market.

Based on the lessons learned from history, though the long-term impact of automation remains positive in terms of employment, the rate at which disruption happens to workers due to AI automation will be faster than in the past. This includes the ability of machines to perform work that requires cognitive capabilities, the ability of machines to teach themselves to improve at tasks without much human intervention. Since AI automation has a potential to rapidly reallocate jobs, the authors of the McKinsey report believe that there might be a massive transition on a scale not seen since the early 1900s, when workers shifted from farms to factories. The transition would be painful for people who are hesitant to learn new skills and adapt themselves for new roles. The biggest challenge will be retraining millions of workers mid-career, says Susan Lund, co-author of the McKinsey report.

“The model where people go to school for the first 20 years of life and work for the next 40 or 50 years is broken.”

Susan Lund, McKinsey Global Institute
When I reflected upon this model I realized that it worked for my dad, but I myself had to retrain during mid-career and for my kids it is an absolute necessity to learn and retrain throughout the course of their career.

Remember the Luddites, a group of highly skilled English textile workers in the 19th century who destroyed weaving machinery as a form of protest? Their movement finally failed. We cannot simply shoo away rapidly advancing automation technologies like big data analytics, AI and robotics. There are only few job areas that are less susceptible to automation: jobs that involve genuine creativity: artist, scientist, etc., jobs that involve building complex relationship with people: nurses, a business role that requires interacting with customers, and other stakeholders, etc. For many of us who currently have a career in other job areas, the best way to survive the automation wave is to roll up our sleeves and start investing some time in learning new skills.

As I completed reading the report, I felt fortunate to be in a position where I could provide the transitional training. I finalized my New Year Resolution to create awareness and train as many people as I can in the field of big data and machine learning to help them survive in the rapidly advancing technological world. If you have been thinking about your next move or a career in Big Data or AI, ping me @SivagamiRamiah via LinkedIn or reach me through ByteQuest.Net. I will be happy to help.

Happy 2018!

by

10 Ways to Contribute to Open Source

From small start-ups to tech giants like FaceBook, Google, IBM, LinkedIn etc. companies of all sizes are embracing Open Source Software (OSS). The software industry is marching towards Open Source! It’s highly beneficial for anyone to contribute to Open Source and be part of the bigger community.

Whether you are a novice or an advanced developer or don’t have any coding background, you can contribute to Open Source irrespective of your coding background!

Whether you have only a little time or a plenty of free time, you can contribute to Open Source irrespective of the amount of time you are willing to spare!

If you are stuck finding out the right project to work on, take a look at the below list of ideas:

    1. Choose an open source project that you actively use in your daily life. Examples: Firefox Browser, Android Operating System, WikiPedia. The main advantage is that you will be already familiar with the software and your learning curve will be small. So, you can start contributing immediately!
    2. Choose an open source project that you use at work. Examples: Linux, Eclipse, Docker. In addition to a smaller learning curve, it can help you with your work life!
    3. Choose an open source framework that you are interested in learning. Examples: Apache Open Source Big Data Frameworks, TensorFlow. (Apache open source projects can be classified by category or by programming language or by the number of source code committers.) The main advantage is that you can familiarize yourself with the framework easily by contributing to it!
    4. Choose an open source project that you love or choose a project that’s trending. Take a look at GitHub ShowCases.
    5. Choose beginner friendly open source projects. You can browse the following URLs to find one: UpForGrabs, Great for New Contributors, OpenHatch

If you are wondering how to get started, here are some ways to contribute:

    1. Report a Bug: You can make a contribution by reporting the bug that you spot in an open source software. Make sure it doesn’t exist in the bug tracking system already. This is a very important contribution to the project because if bugs are not reported, chances are they won’t get fixed!
    2. Diagnose a Bug: Many times bugs are reported poorly resulting in wasted time and efforts for the development team. It’s hard to fix bugs that cannot be reproduced. So it’s very important to clearly report the exact steps involved in reproducing the bug along with the operating environment in which its found, the expected results and what actually happened. You can save time and efforts of the development team by adding details that will help to narrow down the cause of an existing bug.
    3. Test a Beta: When a beta or release candidate is published for a project, it needs to be tested on many different platforms. You can help the project leaders to make sure that the software works on your platform.
    4. Write Documentation: You could edit project wiki/write documentation or improve the existing documentation by adding examples on how to use the project. New users of the software will definitely appreciate good documentation!
    5. Answer a Question in Community Forum: If you like helping other people, you can answer questions about the project in its forum/mailing list or Q&A forums like Stack Overflow or Reddit. By doing this, you’ll be helping to build the community.
    6. Translate the Software Documentation: You can help grow the community by initiating the translation of the project documentation in the language in which you have very good command or join an existing team.
    7. Write a Blog Post: If you are into blogging, you can write about some cool/useful features of the software. You’ll be helping with the branding by doing so. Also, you could write about any problem that you have encountered using the software and how you solved it. This would help other users who run into the same problem.
    8. Enhance the Project’s Web Site: The project community might lack design skills. If you are good at web/graphic design you could lend your hand by enhancing the project’s website.
    9. Fix a Bug: If you are interested in committing code, fixing a bug is a good start. Look for an open issue and work on it. Comment your code fix if it’s necessary. It’s good to add a test to the test suite to test your fix.
    10. Suggest/Write a new feature: If you love coding and have a good idea for a new feature you can suggest it. You can also ask the project leaders if you can help them to write the new feature!

Some companies even let their employees to contribute to open source. Check out your company’s open source contribution policy.

You can educate yourself further by reading the GitHub’s Open Source Guide:
“How to Contribute to Open Source”

So, what are you waiting for? Contribute to open source and be part of the community!

by

Demand for Data Engineers and Data Scientists Remains High

Our research shows that employers are very invested in expanding head count in areas such as analytics and data science, product development, and sales as they strive to stay competitive in B2B and B2C markets.”  – Matt Ferguson, CEO of CareerBuilder

With data collection and data storage becoming accessible to businesses at all scale and access to machine learning tools through providers such as Amazon, IBM’s Watson and Google GCP, business of any size can harness the power of Big Data. To enable that, businesses of all size need data engineers and data scientists.

Based on Forbes here are the top 5 industries that hire Big Data related expertise: Professional, Scientific and Technical Services, IT, Manufacturing, Finance, Insurance and Retail. The below chart shows the distribution of advertised positions in the above mentioned industries.

Top 5 Industries Hiring Big Data

Bigger Paychecks

It’s not just the huge demand for Big Data jobs but also the lucrative salary these jobs offer. According to Indeed, a most popular job search engine, the average salary for a Big Data professional is about 114,000 USD per annum. This is about 98% higher than average salaries for all job postings nationwide.

The average annual salary for Professionals with Big Data expertise in SFO, CA looks attractive.Big Data Salaries in SFO

Top 3 Big Data employment Markets in U.S.

As stated by Forbes, here are the top three U.S. big data employment markets:

  • San Jose – Sunnyvale – Santa Clara, CA,
  • San Francisco – Oakland – Fremont, CA and
  • New York-Northern New Jersey-Long Island
Future Predictions of Big Data Industry

Based on IDC forecasts, a premier market research firm, worldwide revenues for big data and business analytics will grow from $130.1 billion in 2016 to more than $203 billion in 2020, at a compound annual growth rate (CAGR) of 11.7%.

Big Gap in Big Data Skills

Inspite of the huge demand for Big Data skills, there is a significant gap in terms of the availability of skills resulting in large number of unfilled jobs across the globe.

Big Data Skills Gap

With companies hunting for professionals with Big Data expertise, now is the right time to add some Big Data skills to your toolbox and land one of the highest paying IT jobs. As you start thinking about equipping yourself to become a Big Data Engineer, there are plenty of questions that might pop-up in your mind:

  • How steep will the learning curve be?
  • What are the pre-requisites to learn Hadoop and other big data technologies?
  • With continuously evolving expanding stack of technologies, what is the minimum required set of tools to learn to get started?
  • What’s the best way to get trained in these technologies?

It is true that there is a plethora of books on big data, plenty of MOOC and online courses, blogposts on big data and innumerable discussion forums. This information overload overwhelms the newcomer and hard to know how and where to get started.

While many online based big data courses offer the flexibility for one to learn at one’s own time and pace, they provide brief overview of the subject matter, with no hands-on, real world project. In addition, these courses are designed with university-style quizzes and assignments and they don’t add much practical value.

At ByteQuest, we believe that working on real world problems is the best way to get trained in Big Data. Our Hadoop and Big Data training is geared towards gaining practical knowledge and depth required in the job market. Without toiling long hours of hard work, our students can grasp the big data technology effortlessly. Our courses are face-to-face and will be led by friendly and experienced instructors who have the expertise and passion to teach.

by

Run a MapReduce job in Pseudo-Distributed Mode

Run MapReduce Job in Pseudo-Distributed Mode

In Building a MapReduce Maven Project with Eclipse post, we learned how to run a Hadoop MapReduce job in Standalone mode. In this post, we will figure out on how to run a MapReduce job in Pseudo-Distributed Mode. Prior to delving into the details,  let’s understand the various modes in which a Hadoop job can be run.

MODES OF HADOOP

We can run Hadoop jobs in three different modes: Standalone, Pseudo-Distributed and Fully distributed. Each mode has a well defined purpose. Let’s take a look at the characteristics and the usage for each one of them.

Standalone or LocalJobRunner:

  • In standalone mode, Hadoop runs in a single Java Virtual Machine (JVM) and uses the local file system instead of the Hadoop Distributed File System (HDFS).
  • There are no daemons running in this mode.
  • The jobs will be run with one mapper and one reducer.
  • Standalone mode is primarily used to test the code with a small input during development since it is easy to debug in this mode.
  • Standalone mode is faster than pseudo distributed mode.

Pseudo-Distributed or Single Node Cluster:

  • Pseudo distributed mode simulates the behavior of a cluster by running Hadoop daemons in different JVM instances on a single machine.
  • This mode uses HDFS instead of the local file system.
  • There can be multiple mappers and multiple reducers to run the jobs.
  • In pseudo distributed mode a single node will be used as Master Node, Name Node, Data Node,  Job Tracker & Task Tracker. Hence, the replication factor is just one.
  • Configuration is required in the following Hadoop configuration files: mapred-site.xml, core-site.xml, hdfs-site.xml.
  • Pseudo distributed mode is primarily used by developers to test their code in a simulated cluster environment.

Fully Distributed or Multi Node Cluster:

  • Fully distributed mode offers the true power of Hadoop – distributed computing capability, scalability, reliability and fault tolerance
  • In fully distributed mode, the code runs on a real Hadoop cluster (production cluster) with several nodes.
  • All daemons are run in separate nodes.
  • Data will be distributed across several nodes.

Now that we are aware of the various modes in which a Hadoop job can be run,  let’s roll up our sleeves and work our way to run a Hadoop MapReduce job in a pseudo distributed mode.

RUNNING A MAPREDUCE JOB IN A PSEUDO DISTRIBUTED MODE

Step 1: We need to have Oracle Virtual Box and Cloudera QuickStart VM installed in our computer. If it has not been installed let’s get started with How to Install Hadoop on Windows with Cloudera VM.

Step 2: We have to launch the virtual box and turn on the Cloudera QuickStart Virtual Machine.

Before we can build our project, we need a copy of the wordcount MapReduce project in our QuickStart VM. Git command line interface is a tool that we could employ for this purpose. Cloudera QuickStart VM has a pre-installed version of git command line interface (CLI). Note: Git is the most widely used open source version control system among the developer community.

Step 3: Let’s open a terminal window and issue the below git clone command to get a copy of the source code and input data for our MapReduce WordCount program from the following GitHub URL: https://github.com/bytequest/bigdataoncloud

1 git clone https://github.com/bytequest/bigdataoncloud /home/cloudera/workspace/projects/bigdataoncloud

Step 4: Let’s verify if the wordcount project has been copied to our file system successfully. We can run unix ls command to make sure if the wordcount project directory has been created.

1 ls -l /home/cloudera/workspace/projects/bigdataoncloud

Let’s a take a look at the contents of wordcount directory.

1 ls –l /home/cloudera/workspace/projects/bigdataoncloud/wordcount

Step 5: Now that we have the source code ready for our wordcount project, we can go ahead and a) import it as an existing Maven project in Eclipse and b) run Maven Clean and c) Maven Install. Note: If you need help with importing and compiling an existing Maven project please see this post: Building a MapReduce program with Maven and Eclipse

The Maven Install step will produce the jar file “wordcount-0.0.1.jar” that we need to run our wordcount program. It will be located in the target directory under wordcount directory. Let’s run the following command in a terminal window to make sure if the jar file exists.

1 ls –l /home/cloudera/workspace/projects/bigdataoncloud/wordcount/target

Step 6: Now, we have the jar file ready to run our wordcount MapReduce program. To run it in a pseudo distributed mode, we need to run it in HDFS instead of issuing Maven Build command in Eclipse.

Before running wordcount program in HDFS, we have to copy the input text file in HDFS. Let’s first create an input directory “dataset” in HDFS by running the hadoop fs mkdir command.

1 hadoop fs –mkdir dataset

Now we can copy our input file to HDFS dataset directory by running the hadoop fs put command.

1 hadoop fs -put ~/workspace/projects/bigdataoncloud/wordcount/dataset/WorksOfShakespeare.txt dataset/

Step 7: We can use the Hadoop UI tool “Hue” in Cloudera QuickStart VM to verify if the input text file has been copied to HDFS . First, let’s open a browser window and select “Hue” tab.

After Hue loads, we have to click on the “File Browser” icon.

We will be able to see the input file “WorksOfShakespeare.txt” residing under “dataset” directory.

Step 8: Now that we have copied the input file to HDFS, we can go ahead and run the wordcount MapReduce program in HDFS using the Hadoop jar command.

The “Hadoop jar” command runs a program contained in a JAR file. Here’s it’s syntax:

hadoop jar <jar file name> [<arguments>]

We will issue the below command in a terminal window.

123 hadoop jar/home/cloudera/workspace/projects/bigdataoncloud/wordcount/target/wordcount-0.0.1.jarnet.bytequest.bigdataoncloud.wordcount.WordCountDriver dataset/WorksOfShakespeare.txt output

In the command that we issued, we passed the name of the wordcount jar file (that we created using Eclipse) followed by three arguments:

first argument ==> Main java class of wordcount project: net.bytequest.bigdataoncloud.wordcount.WordCountDriver

second argument ==> input text file: dataset/WorksOfShakespeare.txt

third argument ==> output directory: “output”

The below two screenshots are the console output from executing the Hadoop jar command.

Step 9: The MapReduce job ran successfully. Now, let’s verify if the “output” directory is created in the HDFS by running the following command.

1 hadoop fs -ls output

An output file is expected for each reducer in a MapReduce program. Since there was only one reducer for this job, we should only see one part-* File.

Step 10: Let’s browse the contents (first few lines) of the output file “part-r-00000” by piping the output of hadoop fs cat command to head command.

1 hadoop fs -cat output/part-r-00000 | head -n 35

We should see the same output as when we ran the MapReduce job in standalone mode as mentioned in the post:  Building a MapReduce program with Maven and Eclipse

Step 11: To find out the total number of lines in part-r-00000 output file, we can pipe the output of hadoop fs cat command to wc command.

1 hadoop fs -cat output/part-r-00000 | wc -l

As we can see, there are 67779 lines in the part-r-00000 file.

Step 12: We could also copy the output from HDFS to the local file system. To do that let’s first create an “output” directory under wordcount directory in the local file system and then issue hadoop fs get command to copy part-r-00000 file from HDFS to our local file system. Let’s run the following commands in a terminal window.

1 mkdir ~/workspace/projects/bigdataoncloud/wordcount/output

1 hadoop fs -get output/part-r-00000 ~/workspace/projects/bigdataoncloud/wordcount/output/

Step 13: To make sure we have copied part-r-00000 file in our local system let’s run the following command.

1 ls -l ~/workspace/projects/bigdataoncloud/wordcount/output/

 JOB HISTORY AND LOGS

Step 14: Now that we successfully ran our MapReduce job in pseudo distributed mode, let’s spend some time to view the list of completed jobs, their job history, logs of map tasks, logs of reduce tasks etc. by pointing our browser to http://localhost:8088. Let’s follow the links encircled in orange in each screenshot.

Hurray! We have successfully ran a MapReduce job in pseudo-distributed mode and were able to view job summary and job logs.

In this post, we learned about the various modes of Hadoop, running a MapReduce job in pseudo- distributed mode and access the job summary and job logs. If you have any questions or comments regarding this blogpost or would like to share your experience on running a Hadoop job in pseudo-distributed mode, please feel free to post it in the comment section below.

by

How to make sense of Bytes measured in Binary and Decimal?

In order to work with big data, we need to have a good understanding on bytes and their ever expanding multiples, the storage units that measure data. To add up to the confusion, byte multiples can be defined in two different ways: binary and decimal. Let’s get a good grasp of how bytes and their multiples work in binary and decimal system.

Computer memory, storage capacity in hard disks and file sizes are measured in bytes, a basic storage unit of digital information comprising of eight bits. A byte is used to hold a single character of text. Most of us have probably heard the terms used to describe the storage capacity of a computer: MegaBytes, GigaBytes, TeraBytes, but in today’s big data era, computers can store massive amounts of data and the need to coin new terminology for the amounts it can store is ever increasing.

Image Source: uschamberfoundation.org

Let’s meet our new friends from the Byte family who have the power to measure Big Data:

  1. PetaBytes
  2. ExaBytes
  3. ZettaBytes
  4. YottaBytes
  5. BrontoBytes and
  6. GeopByte

All the above storage units are multiples of the unit byte. Computer data is measured using the Binary number system based on powers of two. Let’s take a look at the following table to understand the relationship between a byte and each one of these storage units.

Here is the byte conversion table that the Hard drive manufacturers use:However, hard drive manufacturers use the decimal number system based on powers of ten to define storage space. So, 1 KB is defined as 1000 Bytes in hard disk storage.

Storage Unit Value Bytes
Byte (B) 1 1
Kilobyte (KB) 10^3 1,000
Megabyte (MB) 10^6 1,000,000
Gigabyte (GB) 10^9 1,000,000,000
Terabyte (TB) 10^12 1,000,000,000,000
Petabyte (PB) 10^15 1,000,000,000,000,000
Exabyte (EB) 10^18 1,000,000,000,000,000,000
Zettabyte (ZB) 10^21 1,000,000,000,000,000,000,000
Yottabyte (YB) 10^24 1,000,000,000,000,000,000,000,000
Brontobyte (BB) 10^27 1,000,000,000,000,000,000,000,000,000
GeopByte (GpB) 10^30 1,000,000,000,000,000,000,000,000,000,000

As we can see from the above two tables, 1 KB of data in a computer system = 1024 bytes whereas 1 KB in a hard disk storage = 1000 bytes. Therefore we only have 1000/1024 = 0.97 KB hard disk space instead of 1 KB.

Note: If we want to convert bytes to kilobytes we have to divide the total bytes by 1024. If we have to convert bytes to gigabytes we need to divide the total bytes by 1024 to convert to kilobytes first and then divide the result again by 1024 to convert to megabytes and finally divide the result again by 1024 to convert to gigabytes .

For 1 TB in hard drive we only have 1,000,000,000,000/(1024*1024*1024) = 931 GB. That’s the reason why a 1 TB hard drive only shows 931 GB space. There is a difference of 69 GB for 1 TB. As the storage capacity gets bigger and bigger this gap widens up quickly.

To avoid the confusion between decimal(metric) storage units and binary storage units, the International Electrotechnical Commission (IEC) proposed a custom naming scheme for the binary based units in the year 2000. In the binary prefix naming scheme, a new name was coined by replacing the second syllable in the old metric prefix based name by ‘bi’ (to indicate ‘binary’).  So, a ‘kilobyte’ is renamed as ‘kibibyte’, a ‘megabyte” is renamed as ‘mebibyte’ and so on.

Notation Symbol Value Bytes
1 kibibyte 1 KiB 2^10 1,024
1 mebibyte 1 MiB 2^20 1,048,576
1 gebibyte 1 GiB 2^30 1,073,741,824
1 tebibyte 1 TiB 2^40 1,099,511,627,776
1 pebibyte 1 PiB 2^50 1,125,899,906,842,62
1 exbibyte 1 EiB 2^60 1,152,921,504,606,840,000
1 zebibyte 1 ZiB 2^70 1,180,591,620,717,410,000,000
1 yobibyte 1 YiB 2^80 1,208,925,819,614,620,000,000,000

However, IEC prefixes (KiB, MiB, GiB etc…) for binary multiples are not widely adopted.

As of this writing, the Windows operating system uses the binary units with the old metric prefix naming instead of using IEC binary prefix (KiB, MiB etc…). So, 1 KB = 1024 bytes in windows OS .

Since Mac OS Snow Leopard 10.6 version, Apple switched from binary units to the standard metric units. So, 1 kB (Note: lowercase k) = 1000 bytes in Mac OS. Similarly, Linux operating system “Ubuntu” switched to base 10 file size units since Ubuntu’s 10.10 release.

So, we need to be aware of the number system used (binary or decimal) to measure file sizes in an operating system.

Another thing that we need to keep in mind is, each operating system has its own way of displaying file sizes. The default file size unit differs from one operating system to the other.

Unix ls –l command displays file sizes without any unit.

The default file size unit of files displayed by ls –l  command is bytes in binary system. It’s difficult for us to comprehend the size of a large file like “WorksOfShakespeare.txt” in bytes. So, we need to get the file size displayed in human readable format.

Unix ls –lh command displays file sizes in human readable format.

If we want the file size to be displayed in decimal (metric) system we could use Unix ls –l —sicommand.

The default file size unit of files displayed by Windows Operating System is KB.

However, if we right click on a file and look at its properties windows will display the file size in human readable format.

Let’s suppose if we would like to roughly estimate 5,589,889 bytes in human readable format we could simply go with metric conversion and divide 5,589,889/(1000 * 1000). This would give us approximately 5.5 MB. However, if we would like to know the actual value then we have to do the binary conversion and divide 5,589,889/(1024 * 1024). This would give us the accurate value of 5.33 MB.

References:

  1. International System of Units
  2. Kilobyte
  3. File size reporting in Ubuntu 10.10
  4. Files size units: “KiB” vs “KB” vs “kB”

If you have any questions or comments regarding this blogpost or would like to share your experience on file size units used in various operating systems, please feel free to post it in the comment section below.

  • 1
  • 2

    ByteQuest is a cost-effective Big Data-as-a-Service Provider, providing fully managed Data Analytics Service with a prime focus on keeping our customers in the loop at every phase of the data analytics life cycle.

    info@bytequest.net

    510-573-4804

    Send this to a friend