Standards
ASCII
- 48-57 - 0-9
- 65-90 - A-Z
- 97-122 - a-z
base64
In computer science, Base64 is a group of binary-to-text encoding schemes that represent binary data in an ASCII string format by translating it into a radix-64 representation. The term Base64 originates from a specific MIME content transfer encoding. Each Base64 digit represents exactly 6 bits of data. Three 8-bit bytes (i.e., a total of 24 bits) can therefore be represented by four 6-bit Base64 digits.
Common to all binary-to-text encoding schemes, Base64 is designed to carry data stored in binary formats across channels that only reliably support text content. Base64 is particularly prevalent on the World Wide Web its uses include the ability to embed image files or other binary assets inside textual assets such as HTML and CSS files.
The difference between Base64 and hex is really just how bytes are represented. Hex is another way of saying "Base16". Hex will take two characters for each byte - Base64 takes 4 characters for every 3 bytes, so it's more efficient than hex. Assuming you're using UTF-8 to encode the XML document, a 100K file will take 200K to encode in hex, or 133K in Base64.
base64 is a costly instrument. It makes data about 33% larger in terms of memory usage. So base64 is one of these little things that make software slow. That's why you should use it only when it's absolutely necessary.
What is base64 Encoding and Why is it Necessary?
Example
bootstrap.servers=kafka.confluent.svc.cluster.local:9071 security.protocol=PLAINTEXT
vs
bootstrap.servers=kafka.confluent.svc.cluster.local:9071 security.protocol=PLAINTEXT
The difference in above 2 is only newline vs space, so just because of that the base64 format difference is only K vs g
Ym9vdHN0cmFwLnNlcnZlcnM9a2Fma2EuY29uZmx1ZW50LnN2Yy5jbHVzdGVyLmxvY2FsOjkwNzEKc2VjdXJpdHkucHJvdG9jb2w9UExBSU5URVhU
Ym9vdHN0cmFwLnNlcnZlcnM9a2Fma2EuY29uZmx1ZW50LnN2Yy5jbHVzdGVyLmxvY2FsOjkwNzEgc2VjdXJpdHkucHJvdG9jb2w9UExBSU5URVhU
How Base64 works
Base64 encoding converts binary data into a text string using 64 printable ASCII characters (A-Z, a-z, 0-9, +, /) by grouping 8-bit bytes into 6-bit chunks, mapping these chunks to characters, and using '=' for padding, making binary data safe for text-based systems like email or URLs where special characters can cause issues. It works by taking three 8-bit bytes (24 bits) and splitting them into four 6-bit values, each representing one of the 64 characters, increasing data size by about 33% but ensuring reliable transmission.
How Base64 Encoding Works Step-by-Step
- Input Data (Binary): Start with any binary data (text, images, etc.), which is essentially a stream of 8-bit bytes (e.g., 'H' is 01001000).
- Group into 3 Bytes: Take the binary data in groups of three 8-bit bytes (24 bits total).
- Split into 6-bit Chunks: Divide these 24 bits into four 6-bit chunks (4 x 6 = 24 bits).
- Map to Base64 Characters: Each 6-bit chunk (representing a value from 0 to 63) is mapped to a character in the Base64 index table (A-Z, a-z, 0-9, +, /).
- Padding: If the original data isn't a perfect multiple of 3 bytes, padding is added:
- If one byte is left, two '=' are added.
- If two bytes are left, one '=' is added.
- Output (Text String): The result is a string of readable ASCII characters, which can be safely sent over text-based channels.
How base64 encoding works - YouTube
Base64 vs UTF-8/UTF-16
UTF-8 and UTF-16 are methods to encode Unicode strings to byte sequences.
Base64 is a method to encode a byte sequence to a string.
Base64 is a way to encode binary data, while UTF8 and UTF16 are ways to encode Unicode text.
Things to keep in mind:
- Not every byte sequence represents an Unicode string encoded in UTF-8 or UTF-16
- Not every Unicode string represents a byte sequence encoded in Base64
Unicode
A character is a minimal unit of text that has semantic value.
A character set is a collection of characters that might be used by multiple languages. For example, the Latin character set is used by English and most European languages, though the Greek character set is used only by the Greek language.
A coded character set is a character set where each character is assigned a unique number.
A code point is a value that can be used in a coded character set. A code point is a 32-bit int data type, where the lower 21 bits represent a valid code point value and the upper 11 bits are 0. Code point is a character and this is represented by one or more code units depending on the encoding.
Intro
Unicode is a computing industry standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems.
The latest version contains a repertoire of 136,755 characters covering 139 modern and historic scripts, as well as multiple symbol sets.
A Unicode code unit is a 16-bit char value. For example, imagine a String that contains the letters "abc" followed by the Deseret LONG I, which is represented with two char values. That string contains four characters, four code points, but five code units. Code unit is the number of bits an encoding uses. So UTF-8 would use 8 and UTF-16 would use 16 units.
To express a character in Unicode, the hexadecimal value is prefixed with the string U+. The valid code point range for the Unicode standard is U+0000 to U+10FFFF, inclusive. The code point value for the Latin character A is U+0041. The character € which represents the Euro currency, has the code point value U+20AC. The first letter in the Deseret alphabet, the LONG I, has the code point value U+10400.
The following table shows code point values for several characters:
| Character | Unicode Code Point | Glyph |
|---|---|---|
| Latin A | U+0041 | |
| Latin sharp S | U+00DF | |
| Han for East | U+6771 | |
| Deseret, LONG I | U+10400 |
As previously described, characters that are in the range U+10000 to U+10FFFF are called supplementary characters. The set of characters from U+0000 to U+FFFF are sometimes referred to as theBasic Multilingual Plane (BMP).
Control Characters
A control character or non-printing character(NPC) is a code point (a number) in a character set, that does not represent a written symbol. They are used as in-band signaling to cause effects other than the addition of a symbol to the text. All other characters are mainly printing, printable, or graphic characters, except perhaps for the "space" character (see ASCII printable characters).
The control characters in ASCII still in common use include:
- 0 (null,NUL, 0, ^@), originally intended to be an ignored character, but now used by many programming languages including C to mark the end of a string.
- 7 (bell,BEL, a, ^G), which may cause the device to emit a warning such as a bell or beep sound or the screen flashing.
- 8 (backspace,BS,b, ^H), may overprint the previous character.
- 9 (horizontal tab,HT, t, ^I), moves the printing position right to the next tab stop.
- 10 (line feed,LF, n, ^J), moves the print head down one line, or to the left edge and down. Used as the end of line marker in most UNIX systems and variants.
- 11 (vertical tab,VT, v, ^K), vertical tabulation.
- 12 (form feed,FF, f, ^L), to cause a printer to eject paper to the top of the next page, or a video terminal to clear the screen.
- 13 (carriage return,CR, r, ^M), moves the printing position to the start of the line, allowing overprinting. Used as the end of line marker in Classic Mac OS, OS-9, FLEX(and variants). ACR+LFpair is used by CP/M-80 and its derivatives including DOS and Windows, and by Application Layerprotocols such as FTP, SMTP, and HTTP.
- 26 (Control-Z,SUB,EOF, ^Z). Acts as an end-of-file for the Windows text-mode file i/o.
- 27 (escape,ESC, e(GCC only),^[). Introduces an escape sequence.
It does not make sense to have a string without knowing what encoding it uses.
There's No Such Thing As Plain Text • Dylan Beattie • YOW! 2023 - YouTube
ISO/IEC 5218
- 0 = Not known;
- 1 = Male;
- 2 = Female;
- 9 = Not applicable.
Licenses
- GPL Gnu General Public License
- The GNU General Public License (GPL) is a widely used free software license that guarantees users the freedom to run, study, share, and modify the software. It is a "copyleft" license, meaning that any derivative works must also be released under the GPL, ensuring that the software and its modifications remain free. It's popular for projects that want to ensure their software remains open and free for all users.
- CDDL Common Development and Distribution License
- The Common Development and Distribution License (CDDL) is a free and open-source software license developed by Sun Microsystems. It is a "weak copyleft" license, meaning that modifications to files covered by the CDDL must also be released under the CDDL, but new files added to the project can be under a different license. It's often used for projects that want to allow for easier integration with proprietary software while still maintaining open-source principles for the core code.
- Apache License (APL)
- The Apache License 2.0 is a popular, permissive open-source license allowing free use, modification, and distribution of software for any purpose, including commercial, with few restrictions, requiring only attribution, preservation of notices, and stating changes made to files, while also providing explicit patent grants for contributors' work. It's popular because it balances open-source freedoms with clear legal terms, encouraging use in both open and proprietary projects.
- MIT License
- The MIT License is one of the most permissive free software licenses. It allows users to do anything they want with the software, including using, copying, modifying, merging, publishing, distributing, sublicensing, and/or selling copies, provided that the original copyright notice and permission notice are included. It's popular for its simplicity and minimal restrictions, making it highly compatible with other licenses and suitable for both open-source and proprietary projects.
- BSD Licenses (e.g., 3-Clause BSD)
- The BSD Licenses are a family of permissive free software licenses that place minimal restrictions on the use and distribution of software. They typically require only that the copyright notice and disclaimer of warranty be retained. There are several variants, such as the 2-Clause (FreeBSD) and 3-Clause (New BSD) licenses. They are known for their flexibility, allowing software to be easily incorporated into proprietary products.
- MPL Mozilla Public License
- The Mozilla Public License (MPL) is a "weak copyleft" license that aims to strike a balance between the strong copyleft of the GPL and the permissive nature of licenses like MIT or BSD. It requires modifications to files covered by the MPL to also be released under the MPL, but it allows for linking with proprietary code without requiring the entire combined work to be open source. It's often used for projects where a mix of open and proprietary components is desired.
| License | Category | Copyleft Strength | Key Requirements | Commercial Use |
|---|---|---|---|---|
| GPL (GNU General Public License) | Copyleft | Strong | - Derivative works must also be licensed under GPL. - Guarantees user freedoms (run, study, share, modify). | Yes, but derivative works must remain open source under GPL. |
| CDDL (Common Development and Distribution License) | Weak Copyleft | Weak | - Modifications to files covered by CDDL must remain under CDDL. - New files can be under different licenses. | Yes, allows easier integration with proprietary software for new files. |
| Apache License 2.0 (APL) | Permissive | None | - Attribution (retain copyright and patent notices). - State changes made to files. - Explicit patent grants from contributors. | Yes, with minimal restrictions. |
| MIT License | Permissive | None | - Include original copyright and permission notice. | Yes, with minimal restrictions. |
| BSD Licenses (e.g., 3-Clause BSD) | Permissive | None | - Retain copyright notice and disclaimer of warranty. | Yes, with minimal restrictions. |
| MPL (Mozilla Public License) | Weak Copyleft | Weak | - Modifications to files covered by MPL must remain under MPL. - Can be combined with proprietary code. | Yes, allows for mixed open/proprietary projects. |
https://choosealicense.com/appendix
Common Programming Casing Styles
-
camelCase: First word lowercase, others capitalized.
- Example:
firstName,myVariable
- Example:
-
PascalCase (Upper Camel Case): First letter of every word capitalized.
- Example:
MyClass,Person
- Example:
-
snake_case: Words separated by underscores.
- Example:
first_name,user_id
- Example:
-
SCREAMING_SNAKE_CASE (Macro Case): Uppercase words separated by underscores.
- Example:
MAX_COUNT,API_KEY
- Example:
-
kebab-case (Dash Case): Words separated by hyphens.
- Example:
my-class,background-color
- Example:
-
Train-Case (HTTP Header Case): Hyphen separated with Capitalized words.
- Example:
Content-Type
- Example:
-
flatcase: All lowercase, no separators.
- Example:
username,body
- Example: