Introduction of Unicode
The American Standard Code for Information Interchange, better known by its acronym ASCII, was standardized. ASCII defined numeric codes for various characters, with the numeric values running from 0 to 127
Unicode has lot of benefit in programming and Unicode doesn't matter what the Platform, what Language , what Program.
for i in range(65,91):
print(chr(i),end=" ")
#OUTPUT A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
UTF-8 is one of the most commonly used encodings. UTF stands for “Unicode Transformation Format”, and the ‘8’ means that 8-bit numbers are used in the encoding. (There’s also a UTF-16 encoding, but it’s less frequently used than UTF-8.) UTF-8 uses the following rules:
If the code point is <128, it’s represented by the corresponding byte value.
If the code point is between 128 and 0x7ff, it’s turned into two byte values between 128 and 255.
Code points >0x7ff are turned into three- or four-byte sequences, where each byte of the sequence is between 128 and 255.
UTF-8 has several convenient properties:
It can handle any Unicode code point.
A Unicode string is turned into a string of bytes containing no embedded zero bytes. This avoids byte-ordering issues, and means UTF-8 strings can be processed by C functions such as strcpy() and sent through protocols that can’t handle zero bytes.
A string of ASCII text is also valid UTF-8 text.
UTF-8 is fairly compact; the majority of code points are turned into two bytes, and values less than 128 occupy only a single byte.
If bytes are corrupted or lost, it’s possible to determine the start of the next UTF-8-encoded code point and resynchronize. It’s also unlikely that random 8-bit data will look like valid UTF-8.
Unicode Properties
The Unicode specification includes a database of information about code points.
import unicodedata
u = chr(233) + chr(0x0bf2) + chr(3972) + chr(6000) + chr(13231)
for i, c in enumerate(u):
print(i, '%04x' % ord(c), unicodedata.category(c), end=" ")
print(unicodedata.name(c))
%04x is a format specifier, which only makes sense with the % operator. It specifies that the number to be formatted (in this case, ord(c) should be formatted as a hexadecimal with four digits and zero-padding on the left. So if c is "A" , whose Unicode code point is 65, it will be printed as 0041 .
This function returns the general category assigned to the given character as string. For example, it returns ‘L’ for letter and ‘u’ for uppercase.
import unicodedata
print unicodedata.category(u'A')
print unicodedata.category(u'b')
Ll # Letter Uppercase .
- Unicode Provide the Unique number for every Character, like as the Uppercase letter 'A' is assigned the '65' Code Value.
- Unicode consist of Uppercase ,lowercase letter, emoji , each each every character as code value.
- Unicode may be 8 bit ,16 bit Or 32 bit..
- Let See Example Of Using ASCII Value:
- ASCII Code(65 to 90)
for i in range(65,91):
print(chr(i),end=" ")
#OUTPUT A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
- Similar we can use the lowercase alphabets by using (97 to 121) ASCII code value.
UNICODE TABLE :
Properties :
If the code point is <128, it’s represented by the corresponding byte value.
If the code point is between 128 and 0x7ff, it’s turned into two byte values between 128 and 255.
Code points >0x7ff are turned into three- or four-byte sequences, where each byte of the sequence is between 128 and 255.
UTF-8 has several convenient properties:
It can handle any Unicode code point.
A Unicode string is turned into a string of bytes containing no embedded zero bytes. This avoids byte-ordering issues, and means UTF-8 strings can be processed by C functions such as strcpy() and sent through protocols that can’t handle zero bytes.
A string of ASCII text is also valid UTF-8 text.
UTF-8 is fairly compact; the majority of code points are turned into two bytes, and values less than 128 occupy only a single byte.
If bytes are corrupted or lost, it’s possible to determine the start of the next UTF-8-encoded code point and resynchronize. It’s also unlikely that random 8-bit data will look like valid UTF-8.
Unicode Properties
The Unicode specification includes a database of information about code points.
import unicodedata
u = chr(233) + chr(0x0bf2) + chr(3972) + chr(6000) + chr(13231)
for i, c in enumerate(u):
print(i, '%04x' % ord(c), unicodedata.category(c), end=" ")
print(unicodedata.name(c))
- Explanation :
%04x is a format specifier, which only makes sense with the % operator. It specifies that the number to be formatted (in this case, ord(c) should be formatted as a hexadecimal with four digits and zero-padding on the left. So if c is "A" , whose Unicode code point is 65, it will be printed as 0041 .
- Unicodedata.category(chr)
This function returns the general category assigned to the given character as string. For example, it returns ‘L’ for letter and ‘u’ for uppercase.
- Example :
import unicodedata
print unicodedata.category(u'A')
print unicodedata.category(u'b')
- Output : Lu # Letter Lowercase .
Ll # Letter Uppercase .