I always say this that a programming language can be best understood with an analogy of natural language and mathematics. Consider English, which has an alphabet, a dictionary and grammar to form sentences. Similalry, Python has a character set, reserved words and grammar to represent the language. The interpreter ensures that these are followed in a correct program. A lot of following flows from Python documentation with added commentary and examples.
Python character set includes all ASCII characters i.e. a-z, A-Z, 0-9
and all symbols found on the PC-104
keyboard. Python 3 is also unicode ready i.e. it includes all unicode characters in its character set.
Python has following keywords:
Table 2.1. Keywords of Python
False | await | else | import | pass |
None | break | except | in | raise |
True | class | is | finally | return |
and | continue | for | lambda | try |
as | def | from | nonlocal | while |
assert | del | global | not | with |
async | elif | if | or | yield |
Keyword means that these words are reserved and they have specific meaning. Using these otherwise than what they are meant for will cause syntax error and Python interpreter will not run the program.
If you have not programmed before then you would wonder that what is a program? A program is a sequence of instructions which is executed by the computer. A program may have some input and it definitely has an output. The input may come from keyboard, file, socket, database and so on and output can be on terminal, printer, file, notwork, database and so on. A program without an output is meaningless. Consider following Python program:
#!/usr/bin/env python3 a = 5 b = 5 print(a + b)
You can type these(except first line) in REPL(read-eval-print-loop also known as Python shell) of Python or save it as a
file. Suppose you saved this file as
first_program.py
then you can run it as python first_program.py
As you can see this program has two variables a
and b
each with same value
5.
The output of program is 10,
which is sum of a
and b
,
printed on terminal or STDOUT.
We will learn about STDOUT later.
When you run python first_program.py
you are invoking Python iterpreter on the program. It is parsed by
the parser of the Python interpreter. The lexical analyzer then takes this and
generates tokens. For example, a = 5
will be split in three tokens, a, =
and
5
. We are going to learn how Python interpreter breaks down a program in lexical tokens i.e. we will try to
understand how a Python program works.
#define
in C/C++ whitespace
becomes important. The lines in a Python program can be divided in logical lines and physical lines.
The end of a logical line is represented by the token NEWLINE
. NEWLINE is entered in input by pressing
Enter
key on keyboard. In Windows it is translated to \r\n
and on Unix/GNU-Linux systems
it is simply \n
. Note that in a text editor or terminal these are invisible you just see the
line-break. If you perform a hexdump you will see the encoded values. \r
is also called carriage return
from the ages of typewrites and \n
is called linefeed. These are represented by LF
and
CR
in ASCII character set.
Python statements cannot cross logical line boundaries i.e. whitespace(Return/Enter key, tab, spacebar on keyboard are all
treated as whitespace) is significant. In some satements in Python. called compound statements NEWLINE
is
allowed between statements. We will learn about these later. A logical line is formed from one or more physical lines
following the explicit or implicit line
joining rules.
A physical line in concept is same as logical lines with the difference being is what you see as a line is a line.
Comments are information which is usually written for explaining some complex behavior. This is important for all those who read the code later on including the original author of the program. As you will gain experience as a programmer you will learn to appreciate the value of comments. They are ignored by Python interpreter as far as a program's logic's execution is concerned i.e. they have no bearing on a program's output.
A comment in Python begins with #
(read as hash) as the first non-whitespace character on anyline note that this
#
cannot be part of a string literal(you will soon see what I mean by this). You can also comment
out multiple lines with '''
or """
which are also used to form multiline doctrings. More
on docstrings will come later when we will see how to document the program. help
method on a Python
identifier prints these docstrings. A comment means end of the logical line unless the implicit line joining rules come in
picture.
Note that the line #!/usr/bin/env python3
is not a comment. It is known as shebang operator in
Unix/GNU-Linux world and must be the first line in any source file. Also, encoding declarations(described) are not comment
as well even though they start with #
like shebang operator. The use of shebang operator is that you can
give execution permission to the file and then invoke it as if it is binary executable file. The Python interpreter in the
environment will be picked up automatically to execute the program.
Sometimes a line of logic can be so long that it does not fit the screen. In old times terminal was 80x24
that has become a standard in line length. However, since Python uses tab(\t
) as indentation which is
usually 4 spaces this line length limit is usually extended to more than 80 characters. Also, if you are familair of LaTeX
then in his book Leslie
Lamport says that there should not be more than 75 characters in a line. It is the same reason
newspapers use multiple columns and I have used a fixed width for text rather than full width.
Now this forces need of a mechanism to join several lines to one line. This is done by backslash(\
)
character. I am giving following example from Python docs:
if 1900 < year < 2100 and 1 <= month <= 12 \ and 1 <= day <= 31 and 0 <= hour < 24 \ and 0 <= minute < 60 and 0 <= second < 60: # Looks like a valid date return 1
Note that this \
cannot be part of a comment or string literal. A line ending in backslash cannot carry a
comment and a backslash does not continue a comment. For, example
if 1900 < year < 2100 and 1 <= month <= 12 \ # this is wrong # this does not continue the comment \ wrong continuation
As such no text can come after line-continuation character i.e. after backslash. Even whitespace is not allowed after
\
, basically implying it should be last character.
Expressions in parentheses()
, square brackets[]
and curly braces{}
can
be split over more then one physical line without using backslashes. For example:
month_names = ['Januari', 'Februari', 'Maart', # These are the 'April', 'Mei', 'Juni', # Dutch names 'Juli', 'Augustus', 'September', # for the months 'Oktober', 'November', 'December'] # of the year
These lines can carry comemnts. Indentation is insignificant here. Blank continuation lines are also allowed.
A logical line that contains only spaces, tabs, formfeeds and possibly a comment, is ignored. PEP-8(Python Enhancement Proposal) also says that there should be one or two lines between functions and class declarations. We will study about PEP-8 later. It is a general coding best practice guideline. During interactive input of statements, handling of a blank line may differ depending on the implementation of the read-eval-print loop. In the standard interactive interpreter, an entirely blank logical line (i.e. one containing not even whitespace or a comment) terminates a multi-line statement.
Since Python supports unicode it is important that text editors because there can be non-ASCII text in the program. The
typical text which we are used to is English in languages like C/C++. There are few ways to set this encoding. If the first
or second line of a program contains a comment which matches the regular expression coding[=:]\s*([-\w.]+)
then this is passed as encoding declaration. The first group of this expression names the encoding of the source file. Do
not fret if you do not understand this regular expression. In due course of time you will be able to understand it when we
study regular expressions. This comment line which contains the encoding must appear on its own line. If it is on the
second line then first line must also be a comment line which is typically a shebang line. The recommended way of writing
this is below:
# -*- coding: <encoding-name> -*-
Typically, we deal with utf-8
which is written as:
# -*- coding: utf-8 -*-
however, utf-8
is defaault encoding in Python 3 so you can skip this. For VIM editor you can use VIM
specific line which is:
# vim:fileencoding=<encoding-name>
If the first bytes of the file are the UTF-8 byte-order mark (b'\xef\xbb\xbf'
), the declared file encoding
is utf-8 (this is supported, among others, by Microsoft’s Notepad).
In Python indentation is significant and is used to determine block scope. You can use tabs or spaces for
indentation. Usually, people use 4 spaces and replce tabs with 4 spaces. You can configure this in your IDE/editor
easily. You cannot mix tabs and spaces in a file and if you do then Python will raise a TabError
,
exception while parsing the file and refuse to run the program. The reason is that text editor behaves in a weird manner on
non-Unix platforms.
A formfeed charater may be present in the beginning of line and is ignored for indentation calculations. Formfeed characters occurring elsewhere in the leadin whitespace have an undefined effect. Usually, we do not care about this because we do not enter formfeed charatcer in our source code but it is something to keep in mind if you do.
Except for the beginning of line or in string literals the space, tab and formfeed characters can be used interchangeably
to separate tokens. Whitespace is needed between two tokens only if they form a new token when taken together. For example,
ab
is one token but a b
are two tokens. Besides NEWLINE, INDENT
and
DEDENT
, there are following categories of tokens understood by Python's parser: identifiers(varibale's, function', class' names etc), keywords, literals, operators and
delimiters. Whitespace characters are not tokens but they delimit tokens. Where
there is an ambiguity, a token is the longest possible string that forms a legal token when read from left to right. This
last point is quite common in any LR parser for example, LALR parser. BTW, the original LR parsers were done by legendary
computer scientict Donald
Knuth.
Identifiers are names used to identify or name the variables, function, classes and modules. For example, in the first
program a, b
and print
are identifiers. They are described by following lexical
definition.
The syntax of identifiers in Python is based on the Unicode standard annex UAX-31, with changes defined below; see also PEP-3131 for further details.
Within the ASCII range (U+0001..U+007F), the characters which can be used in Python identifiers are a-z, A-Z, 0-9 and
_. However, numbers cannot be first character in the identifier. The identifiers are unlimited in length and are
case-sensitive i.e. a
and A
are two different idetifiers. Similarly, my, My, my,
MY
are four different identifiers.
Literals are notations for constant values of some built-in types. Almost all programs make use of literals in some or the other way. Without literals you cannot initialize a variable.
There are three types of numeric literals: integers, floating point numbers and imaginary numbers. Complex literals or complex numbers can be formed by combining a real number with an imaginary number.
Note that numeric literals do not include a sign; something like -1
is actually an expression composed of
the unary operator '-'
and literal 1
.
Integer literals are described by following lexical definitions:
integer ::= decinteger | bininteger | octinteger | hexinteger decinteger ::= nonzerodigit (["_"] digit)* | "0"+ (["_"] "0")* bininteger ::= "0" ("b" | "B") (["_"] bindigit)+ octinteger ::= "0" ("o" | "O") (["_"] octdigit)+ hexinteger ::= "0" ("x" | "X") (["_"] hexdigit)+ nonzerodigit ::= "1"..."9" digit ::= "0"..."9" bindigit ::= "0" | "1" octdigit ::= "0"..."7" hexdigit ::= digit | "a"..."f" | "A"..."F"
The way to read this is from bottom-to-top. If you start from bottom you will see there are several definitions for
digit, bindigit, octdigit
and then digit
is used in hexdigit
. Thus, if
you read from bottom-to-top it will be easier.
There is no limit for the length of integer literals. An integer in Python can be as big as memory can store. Underscores are ignored for determining the value of the literal. They are used purely for readability. Note that zeros in a non-zero decimal number are not allowed. This is done for disambiguation with C-style octal literals, which Python used before version 3.0.
Given below are some examples for integer literals:
7 2147483647 0o177 0b100110111 3 79228162514264337593543950336 0o377 0xdeadbeef 100_000_000_000 0b_1110_0101
Note that is and is .
Unlike statically typed languages like C/C++ we do not have short, long, double
modifiers for
integers. There are no separate classes for integers like long
of C/C++. One classes of
integers cover all integers in Python.
If you assign such literal to a variable then the type of that variable will be of type int
, which is a
class. You can find the type by invoking type
function on the variable. For example,
>>> x = 5 >>> type(x) <class 'int'>
Floating point literals i.e. fractional numbers are described by the following lexical definitions:
floatnumber ::= pointfloat | exponentfloat pointfloat ::= [digitpart] fraction | digitpart "." exponentfloat ::= (digitpart | pointfloat) exponent digitpart ::= digit (["_"] digit)* fraction ::= "." digitpart exponent ::= ("e" | "E") ["+" | "-"] digitpart
As you can observe the radix(base of number system) is always for these literals. For example, is legal, and denotes the same number as . The allowed range of floating point literals is implementation-dependent. As in integer literals, underscores are supported for digit grouping.
Again unlike C/C++, we do not have two different precision for floating-point numbers i.e. we do not have
float
and double
types in Python. Also, fixed width IEEE-754 does not apply to the
float class of Python.
Some examples are give below:
3.14 10. .001 1e100 3.14e-10 0e0 3.14_15_93
If you assign such literal to a variable then the type of that variable will be of type float
, which is a
class. You can find the type by invoking type
function on the variable like showed in example for
ints. For example,
>>> x = 5.7 >>> type(x) <class 'float'>
Imaginary literals are described by the following lexical definitions:
imagnumber ::= (floatnumber | digitpart) ("j" | "J")
In an imaginary literal real part is 0.0. Complex numbers are represented as a pair of floating point numbers and have the same restriction on their range. Example of complex literal is 6+8j. Some examples of imaginary literals are given below:
3.14j 10.j 10j .001j 1e100j 3.14e-10j 3.14_15_93j
If you assign such literal to a variable then the type of that variable will be of type complex
, which is a
class. For example,
>>> x = 4 + 4j >>> type(x) <class 'complex'>
String and bytes literals are described by the following definitions:
stringliteral ::= [stringprefix](shortstring | longstring) stringprefix ::= "r" | "u" | "R" | "U" | "f" | "F" | "fr" | "Fr" | "fR" | "FR" | "rf" | "rF" | "Rf" | "RF" shortstring ::= "'" shortstringitem* "'" | '"' shortstringitem* '"' longstring ::= "'''" longstringitem* "'''" | '"""' longstringitem* '"""' shortstringitem ::= shortstringchar | stringescapeseq longstringitem ::= longstringchar | stringescapeseq shortstringchar ::= <any source character except "\" or newline or the quote> longstringchar ::= <any source character except "\"> stringescapeseq ::= "\" <any source character>
bytesliteral ::= bytesprefix(shortbytes | longbytes) bytesprefix ::= "b" | "B" | "br" | "Br" | "bR" | "BR" | "rb" | "rB" | "Rb" | "RB" shortbytes ::= "'" shortbytesitem* "'" | '"' shortbytesitem* '"' longbytes ::= "'''" longbytesitem* "'''" | '"""' longbytesitem* '"""' shortbytesitem ::= shortbyteschar | bytesescapeseq longbytesitem ::= longbyteschar | bytesescapeseq shortbyteschar ::= <any ASCII character except "\" or newline or the quote> longbyteschar ::= <any ASCII character except "\"> bytesescapeseq ::= "\" <any ASCII character>
One syntactic restriction not indicated by these definitions is that whitespace is not allowed between the stringprefix or bytesprefix and the rest of the literal. The source character set is defined by the encoding declaration; it is UTF-8 if no encoding declaration is given in the source file; see section Encoding of Source Files.
Both types of literals can be enclosed in matching single quotes('
), double quotes("
) or
triple quotes('''
or """
). The backslash(\
) character is used to form
escape sequences. When you use triple quotes such strings are called
triple-quoted strings.
Bytes literals are sequences of bytes i.e. they represent values between 0
to 255
. So
they do not take unicode as a single character rather any numeric value greater than 128 must be represented with
escapes. And like all other things they accept ASCII characters. Bytes literals must always have the prefix
'b'
or 'B'
.
Suppose we have a string literal called s
then we can convert it to bytes by calling
s.encode('utf-8')
which implies UTF-8 encoding should be used to encode this. Similarly, a byte sequence
can be converted to string sequence by b.decode('utf-8')
where b
is the byte literal and
UTF-8 is the desired encoding.
Both literals can be prefixed with 'r'
or 'R'
; such strings are called raw strings because they treat backslashes as literal characters. Raw strings are useful when
writing regular expressions because you do not have to escape backslashes making the regular expressions much more
readable. We will see raw strings' usage a lot when we will study regular expressions. Now because of this behavior of raw
strings '\U'
and '\u'
escapes are not treated specially.
A string literal with 'f'
and 'F'
in its prefix is formatted
string literal; see Formatted String Literals. The
'f'
may be combined with 'r'
, but not with 'b'
or 'u'
,
therefore while formatted string literals can be formed but formatted byte literals cannot be formed.
In triple-quoted literals, unescaped newlines and quotes are allowed and are retained, except that three unescaped quotes in a row terminate the literal.
If a string is raw string, escape sequences in string and bytes literals are interpreted according to rules similar to those used by standard C. The recognized escape sequences are:
Table 2.2. Escape Sequences
Escape Sequence | Meaning | Notes |
---|---|---|
\newline |
Backslash and newline ignored | |
\\ |
Backslash(\ ) |
|
\' |
Single quote(' ) |
|
\" |
Double quote(" ) |
|
\a |
ASCII Bell(BEL) | |
\b |
ASCII Backspace(BS) | |
\f |
ASCII Formfeed(FF) | |
\n |
ASCII Linefeed(LF) | |
\r |
ASCII Carriage Return(CR) | |
\t |
ASCII Horizontal Tab(TAB) | |
\v |
ASCII Vertical Tab(VT) | |
\ooo |
Charatcer with octal value ooo | (1, 3) |
\xhh |
Charatcer with octal value hh | (2, 3) |
Escape sequences only recognized in string literals are:
Table 2.3. Escape Sequences inside String Literals
Escape Sequence | Meaning | Notes |
---|---|---|
\N{name} |
Character named name in the unicode database | (4) |
\uxxxx | Character with 16-bit hex value xxxx | (5) |
\Uxxxxxxxx | Character with 32-bit hex value xxxxxxxx | (6) |
Notes:
Unlike Standard C, all unrecognized escape sequences are left in the string unchanged i.e. the backslash is left in the
result. Even in a raw liteal, quotes can be escaped with a backslash, but the backslash remains in the result; for example,
r"\""
is a valdi string liteal consisting of two characters: a backslash and a double
quote. r"\"
is not a valid string in itself (even a raw string cannot end in an odd number of
backslashes). Note that a single backslash followed by a newline is interpreted as those two characters part of literal,
not as a line continuation.
Multiple adjacent string or bytes literals (delimited by whitespace), possibly using different quoting conventions, are
allowed, and their meaning is the same as their concatenation. Thus, "hello" 'world'
is equivalent to
"helloworld"
. This feature is used to reduce the number of backslashes needed, to split long strings
across the long lines, or even to add comments to parts of strings, for exammple
re.compile("[A-Za-z_]" # letter or underscore "[A-Za-z0-9_]*" # letter, digit or underscore )
Note that while this feature is defined at the syntactical level it is implemented at compile time. The
'+'
operator must be used to concatenate string expressions at run-time. This even allows concatenation of
plain string literals with formatted string literals.
The typical way to concatenate two strings is by using the '+'
operator which is achieved by operator
overloading for str
class. For example, if a = 'hello'
and b = 'world'
then a + b
will give 'hello world'
. We will study about operator overloading (it is a
features of object-oriented programming) after we have studies classes.
Suppose you have an integer i
which you want to append to a string s
then before
format-strings(they are popularly known as f-strings) you would write s +
str(i)
however with f-string you can write s = f'{s}{i}'
to append the integer and the variable
will be interpolated.
A formatted string literal or f-string is a
string literal that is prefixed with 'f'
or 'F'
. These may contain replacement fields(in
previous case both s
and i
are replacement fields), which are expressions delimited by
curly braces {}
. While other string literals always have a constant value, formatted string are really
expression evaluated at run time.
Escape sequences are decoded like in ordinary string literals (except when a literal is also marked as a raw string). After decoding, the grarmmar for the contents of the string is:
f_string ::= (literal_char | "{{" | "}}" | replacement_field)* replacement_field ::= "{" f_expression ["="] ["!" conversion] [":" format_spec] "}" f_expression ::= (conditional_expression | "*" or_expr) ("," conditional_expression | "," "*" or_expr)* [","] | yield_expression conversion ::= "s" | "r" | "a" format_spec ::= (literal_char | NULL | replacement_field)* literal_char ::= <any code point except "{", "}" or NULL>
The parts of the string outside curly braces are treated literally, except that any doubled curly braces
'{{'
or '}}'
are replaced with the corresponding single curly brace. A single opening
curly brace '{'
marks a replacement field, which starts with a Python expression. To display both the
expresion text and its value after evaluationm an equal sign may be added after the expression. A conversion field may be
introduced by an exclamation point '!'
. A format specifier may also be appended, introduced by a colon
':'
. A replacement field ends with a closing curly brace '}'
.
Expressions in formatted string literals are treated like Python expressions surrounded by parentheses, with a few
exceptions. An empty expression is not allowed, and both lambda and assignment expressions :=
must be
surrounded by explicit parentheses. Replacement expressions can contain line breaks, but they cannot contain comments. Each
expression is evaluated in the context where the formatted string literal appears, in order from left to right.
When the equal sign '='
is provided, the output will have the expression text, the '='
and the evaluated value. Spaces after the opening brace '{'
, within the expression and after the
'='
are all retained in the output. By default, the '='
causes the
repr()
of the expression to be provided, unless there is a format specified. When a format is specified it
defaults to the str()
of the expression unless a conversion '!r'
is declared.
If a conversion is specified, the result of evaluating the expression is converted before formatting. Conversion
'!s'
calls str()
on the result, '!r'
calls repr()
, and
'!a'
calls ascii()
.
The result is then formatted using the format()
protocol. The format specifier is passed to the
__format__()
method of the expression or conversion result. An empty string is passed when the format
specifier is omitted. The formatted result is then included in the final value of the whole string.
Top-level format specifiers may include nested replacement fields. These nested fields may include their own conversion
fields and format specifiers, but may not include more deeply-nested
replacement fields. The format specifier mini-language is the same as that used by the str.format()
method.
Formatted string literals may be concatenated, but replacement fields cannot be split across literals.
Given below are some examples of formatted string literals:
>>> name = "Fred" >>> f"He said his name is {name!r}." "He said his name is 'Fred'." >>> f"He said his name is {repr(name)}." # repr() is equivalent to !r "He said his name is 'Fred'." >>> width = 10 >>> precision = 4 >>> value = decimal.Decimal("12.34567") >>> f"result: {value:{width}.{precision}}" # nested fields 'result: 12.35' >>> today = datetime(year=2017, month=1, day=27) >>> f"{today:%B %d, %Y}" # using date format specifier 'January 27, 2017' >>> f"{today=:%B %d, %Y}" # using date format specifier and debugging 'today=January 27, 2017' >>> number = 1024 >>> f"{number:#0x}" # using integer format specifier '0x400' >>> foo = "bar" >>> f"{ foo = }" # preserves whitespace " foo = 'bar'" >>> line = "The mill's closed" >>> f"{line = }" 'line = "The mill\'s closed"' >>> f"{line = :20}" "line = The mill's closed " >>> f"{line = !r:20}" 'line = "The mill\'s closed" '
A consequence of sharing the same syntax as regular string literals is that characters in the replacement fields must not conflict with the quoting used in the outer formatted string literal:
f"abc {a["x"]} def" # error: outer string literal ended prematurely f"abc {a['x']} def" # workaround: use different quoting
Backslashes are not allowed in format expressions and will raise an error:
f"newline: {ord('\n')}" # raises SyntaxError
To include a value in which a backslash escape is required, create a temporary variable.
>>> newline = ord('\n') >>> f"newline: {newline}" 'newline: 10'
Formatted string literals cannot be used as docstrings, even if they do not include expressions.
>>> def foo(): ... f"Not a docstring" ... >>> foo.__doc__ is None True
See also PEP 498 for the
proposal that added formatted string literals, and str.format()
, which uses a related format string
mechanism.
The following tokens are operators:
+ - * ** / // % @
<< >> & | ^ ~ :=
< > <= >= == !=
The following tokens serve as delimiters in the grammar:
( ) [ ] { }
, : . ; @ = ->
+= -= *= /= //= %= @=
&= |= ^= >>= <<= **=
The period can also occur in floating-point and imaginary literals. A sequence of three periods has a special meaning as an ellipsis literal. The second half of the list, the augmented assignment operators, serve lexically as delimiters, but also perform an operation.
The following printing ASCII characters have special meaning as part of other tokens or are otherwise significant to the lexical analyzer:
' " # \
The following printing ASCII characters are not used in Python. Their occurrence outside string literals and comments is an unconditional error:
$ ? `
Boooleans are used to represent truth values in a program. They are used in if
or while
conditon or as operands of boolean operations which you will see in next chapter. The boolean values are
True
and False
.
By default an object is considered true unless a __bool__()
method is defined in its class that returns
Flase
or a __len__()
method that returns zero, when called with the object. Given below is
the list of built-in objects which are equal to False
:
In this chapter we have covered a lot of theory. Before we can build meaningful programs we need to study a bit of theory including operators, control-flow statements, lists, dictionaries and functions. In the next chapter we will study operators(which we have listed them here) along with some basic programs involving operators. These are foundational building blocks of programming therefore needs to be studied carefully and in detail.
© 2022 Shiv S. Dayal. www.ashtavakra.org. GNU FDL license v1.3 or later is applicable where not stated.