Hello and welcome to another episode in the Godot basics tutorial series. In this episode we are going to take a brief and quick look at the history of Unicode.
So before we can get into Unicode we must first understand where Unicode came from and where long distance communication started.
Now before the eighteen thirties there was really no way of communicating long distance between people outside of sending letters. Now the problem of long distance communication was solved and we can thank Samuel Morse for this because in the 1830 Samuel Morse along with other amazing inventors developed the Telegraph. And even though the machine was quite complex in its nature the way it communicated between two different parties was quite simple.
It worked by transmitting signals over a wire between stations a station to send the message and another station to receive the message telegraphs only send
Electric current.
So what exactly do you do with it. Well lucky for us Samuel Morse along with helping invent. The telegraph machine. Also came up. With. The morse code in a sense the Morse code. Was a way you could encode
Messages. Now keep in mind you would have to press a transmission key in the Telegraph to send your message. So the only way to send messages was either by pressing the key briefly. Thus sending a dot or by pressing the key longer thus sending a dash. Now on the other end there is a. Device that receives the message and transcribed them by marking dots and dashes. On paper. As you can see in this chart here we have Morse code on the left.
We have our English Latin language and to the right we have the dots and dashes.
Let's segway into Baudot code. So Baudot code is an early character encoding that was invented around the 1870s. Now keep in mind that earlier systems sent characters. Such as Morse code by distinguishing
Short gaps in the sense of Morse code that would be dots and dashes however.
Baudot code sent characters together in a stream. That means each character code. Was exactly the same length.
And had the same number of elements.
In a sense the Baudot code recognized the value of sending data by stream. And if we take a look here everything is five bits
Long in a sense you could think of this as the first successful digital code for Telegraph and as a matter of fact ASCII which by the way became one of the most widely accepted code for translating computer text was inspired by and based on Baudot code.
Now as the years continue technology started to advance and one of those technologies that started to advance were computers.
However computers back then around 1960 or before 1963 had a problem.
There was really no way of standardizing how the English American language was represented in computers.
And so in 1963 the American standard code for information interchange was developed.
Now ASCII includes a definition for 128 characters 33 which are non principle and we went over that briefly in the last episode.
Now ASCII access both the encoding scheme and a code point. And on top of that one hundred twenty characters can fit inside seven bits.
Now you may be wondering what exactly is a code point.
Well the dictionary definition of a code point is a code position of any of the numerical values that make up a code space. Now an easier way of thinking about a code point is that a code point represents a numerical value to a glyph and an example of a glyph just means characters. But it could be anything really just a visual object.
In this case we have the upper case A We have two glyphs in a different languages. And of course the infamous emoji and these are examples of a glyph and code points tie in numerical value to a glyph. Now
Now even though ASCII was the standard for the American language ASCII again stands for the American standard code for information interchange. And the problem was that ASCII only worried about the English language and so the problem that arose at that time was that other character sets had come up to accommodate what ASCII left out.
Now you may already see the problem with that.
If a code point represents a to a numerical value and there are other called points out in existence. How are you able to support all the code points all the languages. Well that's where Unicode comes in. Now Unicode is a standard Unicode aimed to reduce the problems of basically incompatible binary text encoding by being the standard. Now since as key was the standard at the time for the English language Unicode adopted the first 120 character code point.
This allowed Unicode to be backwards compatible. That
That means if other schemes were to implement Unicode there would be no issues in terms of binary text compatibility. Now one thing to keep in mind is that Unicode is a code point not the encoding scheme.
Now ASCII he was both a code point and an encoding scheme whereas Unicode only cares about being a code point.
What this means is that Unicode does not enforce what the encoding scheme supports. It is just a standard for mapping glyphs to a numerical representation.
Now there are a few encoding schemes that support unicode. 00:06:13:28 - 00:06:56:11
For example we have UTF 8 UTF 16 and G.B. 18 0 3 0 and they are built to follow the unicode standard code points. The one that is most commonly used is UTF 8. Now UTF stands where Unicode transformation format and UTF 8 is a superset of ASCII UTF 8 can have up to 4 bytes thus covering the entire Unicode standard code point. Now in the previous episode we did a little math to get binaries into decimal and so you can imagine the decimal value we are capable of having if we are able to use something that is 4 bytes long.
Now UTF 8 requires that each character be represented by at least one byte it's able to represent the first 120 characters of the unicode set which is the English language which again basically copy and pasted the as key standard. Let's go ahead and take a quick look at unicode. Now I grab this image straight from Unicode dot org.
I'm going to leave links in the description down below on Unicode along with the telegraph and ASCII as well. Now if you look down here you can see a glyph in this case the Asterix mapped to a numerical representation in this case 0 0 2 8 0 0 0 2 a is a little different than what we went over in the last episode. Why is that.
Well Unicode is mapping a visual glyph to a hexadecimal value hexadecimal also referred to as base 16 is another way of representing digits basically to the power of 16 per digit now for hexadecimal we use the values 0 through F hexadecimal is a good way of representing binary. Let me show you an example in the first column we have decimal values in the second column.
We have binaries and in the third column we have our hexadecimal. Now notice how a binary pattern matches to a hexadecimal value. You can see here when our hexadecimal value is 0 or binaries are 0 0 when we have a hexadecimal value of f. Notice how all our binaries are on.
Basically they're all ones. So as you can see here hexadecimal values are a great way of mapping to a specific binary pattern. I really want to drive home the point that hexadecimal values are basically mapped to a binary pattern. So keep that in mind and we're going to go over a few examples to reinforce this.
One last thing before we move on to those examples is that we will be dealing with UTI f 8 encoding now UTF 8.
As we mentioned before supports up to 4 bytes and 4 bytes means that we can support 8 bit sixteen bits 24 bits and 32 bits.
However there is a caveat and that is some bits are reserved and we cannot use them now when dealing with one byte one bit is reserved meaning we have 7 bits that we are able to use in this case the zero or the last binary digit is reserved when dealing with two bytes even though two bytes has sixteen bit notice that we only have eleven bits available to be used.
And that's because five bits are reserved the first byte the last two binary digits are reserved and the second bite.
The last three binary digits are reserved when dealing with three bites we only have 16 bits available to us. And notice again how certain binary digits are reserved. And on top of that they must be assigned the following values that I'm showing you.
In this case the third byte has four bits reserved the second bite has two bits reserved and the first bite has two bits reserve. Lastly when dealing with four bites we only have twenty one bits available to us. And that's because eleven bits are reserved. Let's look at a quick example.
And again we're dealing with UTF 8.
Now in this case we have the word high with an exclamation point.
And if we go to the unicode Web site you'll notice that the capital H has the hexadecimal of forty eight as the code point the lower case ie has the hexadecimal value sixty nine and the exclamation point has the hexadecimal twenty one.
So let's go ahead and see how we would be able to convert the hexadecimal values provided to us by a unicode dot org and convert them into binary again binary under UTF 8. Now one thing to keep in mind is that all our strings are of the as key standard or basically the first seven bits. That means that we're going to use the format reserved for one bite. That means that the last binary digit has the value of zero reserved.
We cannot touch the last digit but lucky for us because we're dealing with the English language.
We won't ever need to use that.
Now let's start with the letter H. Now notice that the letter H is green. Corresponding to a hexadecimal value in green.
And in this case let's grab the four because that's first.
So for we look at the hexadecimal value for four and we map it to the binary digit which is 0 1 0 0 and that's what we get over here. We do the same thing for 8. And as you'll notice here 8 corresponds to basically 1 0 0 0 or 1000. If we were just to read it in decimal form and that's what appears here. Now if you were to check the UTF 8 binary for the capital H you'll find that this is what is given to us and we can do the same thing for the lowercase i grabbed 6 which is this down here.
And
And when we move along you'll notice that's exactly what we get. Go to the nine get that as well. And now we're dealing with the exclamation point which is hexadecimal value twenty one. And so if we grab the two you'll notice that's what appears here and when we get the one that's what appears here. And this seems correct.
Notice how at the end of each byte the zero is there and everything else after that we are able to use. Now that's pretty easy when dealing with the first 120 bits which is just anything on your keyboard.
But what happens if we use something more complex. So let's go ahead and look at IMO G and try to convert that. Now first we need to find out how many bits we are working with and to find that out it's really simple. Keep in mind that each hexadecimal value is exactly four bits.
And so if we count how many hexadecimal values are shown to us through the unicode standard we get one two three four five five times four is twenty. We are under twenty one bits.
So that means we need to use this specific format when assigning the hexadecimal values and to binary.
Now keep in mind we are working with 20 bits. That means that there's going to be one bit left over unused and that's where this 0 x format comes in the 0 x format basically lets you know that if you have anything after if you have anything left over just a sign at the value 0. And that's what we'll do in this example. Now keep in mind that everything in blue is reserved we're not going to touch that. We're only going to touch the Xs and we're going to fill it in based on what our hexadecimal value is.
So let's go ahead and take a look at that starting with the first value which is 0 now 0.
If we look at our table shows that the first four bits are zero and so we replace our X's with those values. We
We do the same thing for the second hexadecimal value which is also zero.
And notice how as we fill it in and when we reach a reserved portion of the binary format we're going to skip over and filled in the x's after that. Now we do the same thing with the hexadecimal values 6.
And as you can see it's 0 1 1 0 and that's what we do 0 1 1 0.
And basically we keep doing that for all the other values that includes the F which is 0 1 and we do that for the hexadecimal value one as well which if we correlate from the table to the binary format underneath you'll notice it's the same format. And notice how here we have the leftover bit and because our format is telling us that everything after must be zero. We're just going to put zero and basically it. That's how you can grab hexadecimal values from Unicode dot org and translate them into the binary format that your computer will take in when reading the file.
And that's basically it.
This is the standardized format that all computers are able to read when doing encoding through UTF dash 8 when using the code points provided to us by unicode.
I want to thank the following amazing people who helped me who basically guided me into a better understanding of the topic on as key in the comments section of the last episode and these people are and I hope I'm pronouncing their YouTube names correctly. Any
Any key Zyban Nick FERPA and great collapsing crunk and basically in the last episode they wrote in the comments section.
My mistakes in the last episode and this episode is dedicated to not only correcting my mistakes but better explaining unicode. I went over a little bit of history so please feel free to leave any corrections down below.
Not only on telegraphs but on other items I discussed in this episode as well. I love learning and I love knowing when I've made a mistake. So please don't feel like you are attacking me.
If you want to correct me whether you have experience or you're starting to check out the links down below on more information anyway thank you so much for joining me. Thank you for clicking the Like button and thank you for clicking the Subscribe button. If you have any questions or comments please feel free to leave them in the comments section down below and I look forward to seeing you in the next episode.
Have an amazing day.