In order to work with big data, we need to have a good understanding on bytes and their ever expanding multiples, the storage units that measure data. To add up to the confusion, byte multiples can be defined in two different ways: binary and decimal. Let’s get a good grasp of how bytes and their multiples work in binary and decimal system.
Computer memory, storage capacity in hard disks and file sizes are measured in bytes, a basic storage unit of digital information comprising of eight bits. A byte is used to hold a single character of text. Most of us have probably heard the terms used to describe the storage capacity of a computer: MegaBytes, GigaBytes, TeraBytes, but in today’s big data era, computers can store massive amounts of data and the need to coin new terminology for the amounts it can store is ever increasing.
Image Source: uschamberfoundation.org
Let’s meet our new friends from the Byte family who have the power to measure Big Data:
- BrontoBytes and
All the above storage units are multiples of the unit byte. Computer data is measured using the Binary number system based on powers of two. Let’s take a look at the following table to understand the relationship between a byte and each one of these storage units.
Here is the byte conversion table that the Hard drive manufacturers use:However, hard drive manufacturers use the decimal number system based on powers of ten to define storage space. So, 1 KB is defined as 1000 Bytes in hard disk storage.
As we can see from the above two tables, 1 KB of data in a computer system = 1024 bytes whereas 1 KB in a hard disk storage = 1000 bytes. Therefore we only have 1000/1024 = 0.97 KB hard disk space instead of 1 KB.
Note: If we want to convert bytes to kilobytes we have to divide the total bytes by 1024. If we have to convert bytes to gigabytes we need to divide the total bytes by 1024 to convert to kilobytes first and then divide the result again by 1024 to convert to megabytes and finally divide the result again by 1024 to convert to gigabytes .
For 1 TB in hard drive we only have 1,000,000,000,000/(1024*1024*1024) = 931 GB. That’s the reason why a 1 TB hard drive only shows 931 GB space. There is a difference of 69 GB for 1 TB. As the storage capacity gets bigger and bigger this gap widens up quickly.
To avoid the confusion between decimal(metric) storage units and binary storage units, the International Electrotechnical Commission (IEC) proposed a custom naming scheme for the binary based units in the year 2000. In the binary prefix naming scheme, a new name was coined by replacing the second syllable in the old metric prefix based name by ‘bi’ (to indicate ‘binary’). So, a ‘kilobyte’ is renamed as ‘kibibyte’, a ‘megabyte” is renamed as ‘mebibyte’ and so on.
|1 kibibyte||1 KiB||2^10||1,024|
|1 mebibyte||1 MiB||2^20||1,048,576|
|1 gebibyte||1 GiB||2^30||1,073,741,824|
|1 tebibyte||1 TiB||2^40||1,099,511,627,776|
|1 pebibyte||1 PiB||2^50||1,125,899,906,842,62|
|1 exbibyte||1 EiB||2^60||1,152,921,504,606,840,000|
|1 zebibyte||1 ZiB||2^70||1,180,591,620,717,410,000,000|
|1 yobibyte||1 YiB||2^80||1,208,925,819,614,620,000,000,000|
However, IEC prefixes (KiB, MiB, GiB etc…) for binary multiples are not widely adopted.
As of this writing, the Windows operating system uses the binary units with the old metric prefix naming instead of using IEC binary prefix (KiB, MiB etc…). So, 1 KB = 1024 bytes in windows OS .
Since Mac OS Snow Leopard 10.6 version, Apple switched from binary units to the standard metric units. So, 1 kB (Note: lowercase k) = 1000 bytes in Mac OS. Similarly, Linux operating system “Ubuntu” switched to base 10 file size units since Ubuntu’s 10.10 release.
So, we need to be aware of the number system used (binary or decimal) to measure file sizes in an operating system.
Another thing that we need to keep in mind is, each operating system has its own way of displaying file sizes. The default file size unit differs from one operating system to the other.
Unix ls –l command displays file sizes without any unit.
The default file size unit of files displayed by ls –l command is bytes in binary system. It’s difficult for us to comprehend the size of a large file like “WorksOfShakespeare.txt” in bytes. So, we need to get the file size displayed in human readable format.
Unix ls –lh command displays file sizes in human readable format.
If we want the file size to be displayed in decimal (metric) system we could use Unix ls –l —sicommand.
The default file size unit of files displayed by Windows Operating System is KB.
However, if we right click on a file and look at its properties windows will display the file size in human readable format.
Let’s suppose if we would like to roughly estimate 5,589,889 bytes in human readable format we could simply go with metric conversion and divide 5,589,889/(1000 * 1000). This would give us approximately 5.5 MB. However, if we would like to know the actual value then we have to do the binary conversion and divide 5,589,889/(1024 * 1024). This would give us the accurate value of 5.33 MB.
- International System of Units
- File size reporting in Ubuntu 10.10
- Files size units: “KiB” vs “KB” vs “kB”
If you have any questions or comments regarding this blogpost or would like to share your experience on file size units used in various operating systems, please feel free to post it in the comment section below.
At ByteQuest, we are planning to offer face-to-face (in person) Big Data training courses in Bay Area, CA. If you are interested in enrolling, please click here to learn more. If you would like to receive our latest posts & updates on big data training directly in your email inbox, please subscribe. If you have any questions or suggestions for us, please feel free to contact us.
- 10 Ways to Contribute to Open Source - June 28, 2017
- Demand for Data Engineers and Data Scientists Remains High - May 1, 2017
- Run a MapReduce job in Pseudo-Distributed Mode - February 17, 2017
- How to make sense of Bytes measured in Binary and Decimal? - February 12, 2017
- How to Install Hadoop on Windows with Cloudera VM - February 2, 2017
- Freely available Large Datasets to try out Hadoop - January 3, 2017
- Building a MapReduce Maven Project with Eclipse - December 29, 2016