Applied Programming/Strings
Overview
[edit | edit source]String
[edit | edit source]String Functions
[edit | edit source]String functions are used in computer programming languages to manipulate a string or query information about a string (some do both).
Most programming languages that have a string datatype will have some string functions although there may be other low-level ways within each language to handle strings directly. In object-oriented languages, string functions are often implemented as properties and methods of string objects. In functional and list-based languages a string is represented as a list (of character codes), therefore all list-manipulation procedures could be considered string functions. However such languages may implement a subset of explicit string-specific functions as well.
For function that manipulate strings, modern object-oriented languages, like C# and Java have immutable strings and return a copy (in newly allocated dynamic memory), while others, like C manipulate the original string unless the programmer copies data to a new string. See for example Concatenation below.
The most basic example of a string function is the length(string)
function. This function returns the length of a string literal.
- e.g.
length("hello world")
would return 11.
Other languages may have string functions with similar or exactly the same syntax or parameters or outcomes. For example, in many languages the length function is usually represented as len(string).
String datatypes
[edit | edit source]Literal strings
[edit | edit source]Non-text strings
[edit | edit source]String processing algorithms
[edit | edit source]Run-length encoding
[edit | edit source]Run-length encoding (RLE) is a form of lossless data compression in which runs of data (sequences in which the same data value occurs in many consecutive data elements) are stored as a single data value and count, rather than as the original run. This is most useful on data that contains many such runs. Consider, for example, simple graphic images such as icons, line drawings, Conway's Game of Life, and animations. It is not useful with files that don't have many runs as it could greatly increase the file size.
RLE may also be used to refer to an early graphics file format supported by CompuServe for compressing black and white images, but was widely supplanted by their later Graphics Interchange Format (GIF). RLE also refers to a little-used image format in Windows 3.x, with the extension rle, which is a Run Length Encoded Bitmap, used to compress the Windows 3.x startup screen.
Example
For example, consider a screen containing plain black text on a solid white background. There will be many long runs of white pixels in the blank space, and many short runs of black pixels within the text. A hypothetical scan line, with B representing a black pixel and W representing white, might read as follows:
WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW
With a run-length encoding (RLE) data compression algorithm applied to the above hypothetical scan line, it can be rendered as follows:
12W1B12W3B24W1B14W
This can be interpreted as a sequence of twelve Ws, one B, twelve Ws, three Bs, etc.,
The run-length code represents the original 67 characters in only 18. While the actual format used for the storage of images is generally binary rather than ASCII characters like this, the principle remains the same. Even binary data files can be compressed with this method; file format specifications often dictate repeated bytes in files as padding space. However, newer compression methods such as DEFLATE often use LZ77-based algorithms, a generalization of run-length encoding that can take advantage of runs of strings of characters (such as BWWBWWBWWBWW).
Run-length encoding can be expressed in multiple ways to accommodate data properties as well as additional compression algorithms. For instance, one popular method encodes run lengths for runs of two or more characters only, using an "escape" symbol to identify runs, or using the character itself as the escape, so that any time a character appears twice it denotes a run. On the previous example, this would give the following:
WW12BWW12BB3WW24BWW14
This would be interpreted as a run of twelve Ws, a B, a run of twelve Ws, a run of three Bs, etc. In data where runs are less frequent, this can significantly improve the compression rate.
One other matter is the application of additional compression algorithms. Even with the runs extracted, the frequencies of different characters may be large, allowing for further compression; however, if the run lengths are written in the file in the locations where the runs occurred, the presence of these numbers interrupts the normal flow and makes it harder to compress. To overcome this, some run-length encoders separate the data and escape symbols from the run lengths, so that the two can be handled independently. For the example data, this would result in two outputs, the string "WWBWWBBWWBWW" and the numbers (12,12,3,24,14).
Escape sequence
[edit | edit source]What is it?
[edit | edit source]The programmers refer to the “backslash (\)” character as an escape character. In other words, it has a special meaning when we use it inside the strings. As the name suggests, the escape character escapes the characters in a string for a brief moment to introduce unique inclusion. That is to say; backlash signifies that the next character after it has a different meaning.[1]
Examples in Python
[edit | edit source]Single quotes
string = 'That\'s my bag.'
print(string)
Output:
That's my bag.
This example used (\') to print the single-quote in the string.
Double quotes
string = "\"Python\""
print(string)
Output:
"Python"
This example used (\") to remove the backslash and put the quote in the string.
Newline character
string = 'applied \nprogramming'
print(string)
Output:
applied programming
A newline character is used to write the words in a new separate line.
Backslash
string = 'applied\\ programming'
print(string)
Output:
applied\ programming
This example prints a single backslash.
Space
string = 'applied\tprogramming'
print(string)
Output:
applied programming
This example adds a space between the words.
Backspace
string = 'applied \bprogramming'
print(string)
Output:
appliedprogramming
This example used "\b" to remove the space between the words.
Hexa value
string = "\x50\x59\x54\x48\x4f\x4E"
print(string)
Output:
PYTHON
This example used \xhh to convert hexa values to a string.
string = "Nancy said \x22Hello World!\x22 to the crowd."
print(string)
Output:
Nancy said "Hello World!" to the crowd.
This example uses "\x" to indicate the following two characters are hexadecimal digits, "22" being the ASCII value for a double-quote in hexadecimal.
Octal value
string = "\120\131\124\110\117\116"
print(string)
Output:
PYTHON
This example used \ooo to convert the octal value into a normal string.
Activities
[edit | edit source]Key terms
[edit | edit source]Concatenation - The joining of two character strings. Also referred to as as ‘concat’.
Control Characters - Used to perform actions rather than to display a printable character on screen. Easily understood examples include 'Escape', 'Backspace' and 'Delete'. [4]
Escape Sequence - A combination of characters that has a meaning other than the literal characters contained therein.[5]
Fixed Length String - A string with a pre-determined, static length.
Iteration - The repetition of a process in order to generate an outcome.[6]
Prefix - A string A = a1, a2, …an has a prefix  = a1, a2, … am when m ≤ n. A proper prefix of the string A would not be equal to itself (0 ≤ m < n). [7]
Run-Length Encoding (RLE) - a Form of data compression in which a stream of data is given as the input (i.e. "AAABBCCCC") and the output is a sequence of counts of consecutive data values in a row (i.e. "3A2B4C").[8]
String - An array of characters typically surrounded by quotation marks.
String Literal - A type of literal in programming for the representation of a string value within the source code of a computer program.[9]
Substring - Occurs when one string is a prefix of a suffix of an original string, and equivalently a suffix of a prefix.[7]
Suffix - Any substring of an original string that includes the original string’s last letter, including itself. A proper suffix of a string is not equal to/the same as the string original string itself.[7]
Variable Length String - A string where the length can vary and is often determined by user input.