Few chapter wise pickings from Dive into Python by Mark Pilgrim.
Objects in Python
- Everything in Python is an object. Strings are objects. Lists are objects. Functions are objects. Even modules are objects. Almost everything has attributes and methods. All functions have a built-in attribute __doc__, which returns the doc string defined in the function’s source code.
- Different programming languages define “object” in different ways. In some, it means that all objects must have attributes and methods; in others, it means that all objects are subclassable. In Python, the definition is looser; some objects have neither attributes nor methods, and not all objects are subclassable. But everything is an object in the sense that it can be assigned to a variable or passed as an argument to a function.
Dictionaries in Python
- Dictionaries are unordered.
- Within a single dictionary, the values don’t all need to be the same type
- del lets you delete individual items from a dictionary by key.
- clear deletes all items from a dictionary.
- The set of empty curly braces signifies a dictionary without any items.
List in Python
- A list is an ordered set of elements enclosed in square brackets.
- Negative List Indices li[-n] == li[len(li) – n].
- Reading the list from left to right, the first slice index specifies the first element you want, and the second slice index specifies the first element you don’t want. The return value is everything in between.
- li[:n] will always return the first n elements, and li[n:] will return the rest, regardless of the length of the list.
- li[:] is shorthand for making a complete copy of a list.
Index in Python
- Index finds the first occurrence of a value in the list and returns the index.
- If the value is not found in the list, Python raises an exception. This is notably different from most languages, which will return some invalid index. While this may seem annoying, it is a good thing, because it means your program will crash at the source of the problem, rather than later on when you try to use the invalid index.
+, *, += with lists in Python
>> li = [‘a’, ‘b’, ‘mpilgrim’]
>>> li = li + [‘example’, ‘new’] (1)
[‘a’, ‘b’, ‘mpilgrim’, ‘example’, ‘new’]
>>> li += [‘two’]
[‘a’, ‘b’, ‘mpilgrim’, ‘example’, ‘new’, ‘two’]
>>> li = [1, 2] * 3
[1, 2, 1, 2, 1, 2]
+ and extend() in Python
- The + operator returns a new (concatenated) list as a value, whereas extend only alters an existing list. This means that extend is faster, especially for large lists.
Tuple in Python
- A tuple is an immutable list. A tuple can not be changed in any way once it is created.
Classes in Python
- Classes can (and should) have doc strings too, just like modules and functions.
- __init__ is called immediately after an instance of the class is created. It would be tempting but incorrect to call this the constructor of the class. It’s tempting, because it looks like a constructor (by convention, __init__ is the first method defined for the class), acts like one (it’s the first piece of code executed in a newly created instance of the class), and even sounds like one (“init” certainly suggests a constructor-ish nature). Incorrect, because the object has already been constructed by the time __init__ is called, and you already have a valid reference to the new instance of the class. But __init__ is the closest thing you’re going to get to a constructor in Python, and it fills much the same role.
- The first argument of every class method, including __init__, is always a reference to the current instance of the class. By convention, this argument is always named self. In the __init__ method, self refers to the newly created object; in other class methods, it refers to the instance whose method was called. Although you need to specify self explicitly when defining the method, you do not specify it when calling the method; Python will add it for you automatically.
- __init__ methods can take any number of arguments, and just like functions, the arguments can be defined with default values, making them optional to the caller. In this case, filename has a default value of None, which is the Python null value.
When to use self and __init__
- When defining your class methods, you must explicitly list self as the first argument for each method, including __init__. When you call a method of an ancestor class from within your class, you must include the self argument. But when you call your class method from outside, you do not specify anything for the self argument; you skip it entirely, and Python automatically adds the instance reference for you. I am aware that this is confusing at first; it’s not really inconsistent, but it may appear inconsistent because it relies on a distinction (between bound and unbound methods) that you don’t know about yet.
- If you forget everything else, remember this one thing, because I promise it will trip you up: __init__ Methods __init__ methods are optional, but when you define one, you must remember to explicitly call the ancestor’s __init__ method (if it defines one). This is more generally true: whenever a descendant wants to extend the behavior of the ancestor, the descendant method must explicitly call the ancestor method at the proper time, with the proper arguments.
Garbage Collection in Python
- If creating new instances is easy, destroying them is even easier. In general, there is no need to explicitly free instances, because they are freed automatically when the variables assigned to them go out of scope.
- Memory leaks are rare in Python.
- The technical term for this form of garbage collection is “reference counting”. Python keeps a list of references to every instance created.
- In previous versions of Python, there were situations where reference counting failed, and Python couldn’t clean up after you. If you created two instances that referenced each other (for instance, a doubly-linked list, where each node has a pointer to the previous and next node in the list), neither instance would ever be destroyed automatically because Python (correctly) believed that there is always a reference to each instance. Python 2.0 has an additional form of garbage collection called “mark-and-sweep” which is smart enough to notice this virtual gridlock and clean up circular references correctly.
I will write a separate blog post on this.
Function Overloading in Python
- Python supports neither of these; it has no form of function overloading whatsoever.
- Methods are defined solely by their name, and there can be only one method per class with a given name. So if a descendant class has an __init__ method, it always overrides the ancestor __init__ method, even if the descendant defines it with
a different argument list. And the same rule applies to any other method.
- Guido, the original author of Python, explains method overriding this way: “Derived classes may override methods of their base classes. Because methods have no special privileges when calling other methods of the same object, a method of a base class that calls another method defined in the same base class, may in fact end up calling a method of a derived class that overrides it.
- All methods in Python are effectively virtual
Special Class Methods in Python
- special class method; not only can you call it yourself, you can get Python to call it for you by using the right syntax.
- __repr__ is a special method that is called when you call repr(instance). The repr function is a built-in function that returns a string representation of an object. It works on any object, not just class instances. You’re already intimately familiar with repr and you don’t even know it. In the interactive window, when you type just a variable name and press the ENTER key, Python uses repr to display the variable’s value. Go create a dictionary d with some data and then print repr(d) to see for yourself.
- __cmp__ is called when you compare class instances. In general, you can compare any two Python objects, not just class instances, by using ==. There are rules that define when built-in datatypes are considered equal; for instance, dictionaries are equal when they have all the same keys and values, and strings are equal when they are the same length and contain the same sequence of characters. For class instances, you can define the __cmp__ method and code the comparison logic yourself, and then you can use == to compare instances of your class and Python will call your __cmp__ special method for you.
- __len__ is called when you call len(instance). The len function is a built-in function that returns the length of an object. It works on any object that could reasonably be thought of as having a length. The len of a string is its number of characters; the len of a dictionary is its number of keys; the len of a list or tuple is its number of elements. For class instances, define the __len__ method and code the length calculation yourself, and then call len(instance) and Python will call your __len__ special method for you.
- __delitem__ is called when you call del instance[key], which you may remember as the way to delete individual items from a dictionary. When you use del on a class instance, Python calls the __delitem__ special method for you.
Class attributes in Python
- Class attributes can be used as class-level constants, but they are not really constants. You can also change them.
- There are no constants in Python. Everything can be changed if you try hard enough. This fits with one of the core principles of Python: bad behavior should be discouraged but not banned. If you really want to change the value of None, you can do it, but don’t come running to me when your code is impossible to debug.
Private, Public in Python
- Unlike in most languages, whether a Python function, method, or attribute is private or public is determined entirely by its name.
- If the name of a Python function, class method, or attribute starts with (but doesn’t end with) two underscores, it’s private; everything else is public. Python has no concept of protected class methods (accessible only in their own class and descendant classes). Class methods are either private (accessible only in their own
class) or public (accessible from anywhere).
- __setitem__ is a special method; normally, you would call it indirectly by using the dictionary syntax on a class instance, but it is public, and you could call it directly if you had a really good reason. However, __parse is private, because it has two underscores at the beginning of its name.
- Strictly speaking, private methods are accessible outside their class, just not easily accessible. Nothing in Python is truly private; internally, the names of private methods and attributes are mangled and unmangled on the fly to make them seem inaccessible by their given names. You can access the __parse method of the ABC class by the name _ABC__parse.
- Acknowledge that this is interesting, but promise to never, ever do it in real code. Private methods are private for a reason, but like many other things in Python, their privateness is ultimately a matter of convention, not force.
Exception Handling in Python
- Python uses try…except to handle exceptions and raise to generate them. Java and C++ use try…catch to handle exceptions, and throw to generate them.
- a try…finally block is for: code in the finally block will always be executed, even if something in the try block raises an exception. Think of it as code that gets executed on the way out, regardless of what happened before.
Modules in Python
- Modules, like everything else in Python, are objects. Once imported, you can always get a reference to a module through the global dictionary sys.modules
- Every Python class has a built-in class attribute __module__, which is the name of the module in which the class is defined.
Getattr and hasattr in Python
- getattr, which gets a reference to an object by name. hasattr is a complementary
function that checks whether an object has a particular attribute; in this case, whether a module has a particular class (although it works for any object and any attribute, just like getattr).
I will write a separate blog post on this.
Listdir function in Python
- The listdir function takes a pathname and returns a list of the contents of the directory.
- listdir returns both files and folders, with no indication of which is which. You can use list filtering and the isfile function of the os.path module to separate the files from the folders. isfile takes a pathname and returns 1 if the path represents a file, and 0 otherwise. Here you’re using os.path.join to ensure a full pathname, but isfile also works with a partial path, relative to the current working directory. You can use os.getcwd() to get the current working directory.
- os.path also has a isdir function which returns 1 if the path represents a directory, and 0 otherwise. You can use this to get a list of the subdirectories within a directory.
glob module in Python
- The glob module, on the other hand, takes a wildcard and returns the full path of all files and directories matching the wildcard. Here the wildcard is a directory path plus “*.mp3”, which will match all .mp3 files. Note that each element of the returned list already includes the full path of the file.
- You have a music directory, with several subdirectories within it, with .mp3 files within each subdirectory. You can get a list of all of those with a single call to glob, by using two wildcards at once. One wildcard is the “*.mp3” (to match .mp3 files), and one wildcard is within the directory path itself, to match any subdirectory within c:\music. That’s a crazy amount of power packed into one deceptively simple-looking function!
Namespaces in Python
- At any particular point in a Python program, there are several namespaces available. Each function has its own namespace, called the local namespace, which keeps track of the function’s variables, including function arguments and locally defined variables. Each module has its own namespace, called the global namespace, which keeps track of the module’s variables, including functions, classes, any other imported modules, and module-level variables and constants. And there is the built-in namespace, accessible from any module, which holds built-in functions and exceptions.
- When a line of code asks for the value of a variable x, Python will search for that variable in all the available namespaces, in order:
1. local namespace – specific to the current function or class method. If the function defines a local variable x, or has an argument x, Python will use this and stop searching.
2. global namespace – specific to the current module. If the module has defined a variable, function, or class called x, Python will use that and stop searching.
3. built-in namespace – global to all modules. As a last resort, Python will assume that x is the name of built-in function or variable.
- If Python doesn’t find x in any of these namespaces, it gives up and raises a NameError with the message.
Difference between module import and import module
- Remember the difference between from module import and import module? With import module, the module itself is imported, but it retains its own namespace, which is why you need to use the module name to access any of its functions or attributes: module.function. But with from module import, you’re actually importing specific functions and attributes from another module into your own namespace, which is why you access them directly without referencing the original module they came from.
Package in Python
- A package is a directory with the special __init__.py file in it. The __init__.py file defines the attributes and methods of the package. It doesn’t need to define anything; it can just be an empty file, but it has to exist. But if __init__.py doesn’t exist, the directory is just a directory, not a package, and it can’t be imported or contain modules or nested packages.
- Before unicode, there were separate character encoding systems for each language,
each using the same numbers (0-255) to represent that language’s characters. Some languages (like Russian) have multiple conflicting standards about how to represent the same characters; other languages (like Japanese) have so many characters that they require multiple-byte character sets. Exchanging documents between systems was difficult because there was no way for a computer to tell for certain which character encoding scheme the document author had used; the computer only saw numbers, and the numbers could mean different things. Then think about trying to store these documents in the same place (like in the same database table); you would need to store the character encoding alongside each piece of text, and make
sure to pass it around whenever you passed the text around. Then think about multilingual documents, with characters from multiple languages in the same document. (They typically used escape codes to switch modes; poof, you’re in Russian koi8-r mode, so character 241 means this; poof, now you’re in Mac Greek
mode, so character 241 means something else. And so on.) These are the problems which unicode was designed to solve.
- To solve these problems, unicode represents each character as a 2-byte number, from 0 to 65535.5 Each 2-byte number represents a unique character used in at least one of the world’s languages. (Characters that are used in multiple languages have the same numeric code.) There is exactly 1 number per character, and exactly 1 character per number. Unicode data is never ambiguous. Of course, there is still the matter of all these legacy encoding systems. 7-bit ASCII, for instance, which stores English characters as numbers ranging from 0 to 127. (65 is capital “A”, 97 is lowercase “a”, and so forth.) English has a very simple alphabet, so it can be completely expressed in 7-bit ASCII. Western European languages like French, Spanish, and German all use an encoding system called ISO-8859-1 (also called “latin-1”), which uses the 7-bit ASCII characters for the numbers 0 through 127, but then extends into the 128-255 range for characters like n-with-a-tilde-over-it (241), and u-with-two-dots-over-it (252). And unicode uses the same characters as 7-bit ASCII for 0 through 127, and the same characters as ISO- 8859-1 for 128 through 255, and then extends from there into characters for other languages with the remaining numbers, 256 through 65535.
- To create a unicode string instead of a regular ASCII string, add the letter “u” before the string. Note that this particular string doesn’t have any non-ASCII characters. That’s fine; unicode is a superset of ASCII (a very large superset at that), so any regular ASCII string can also be stored as unicode. When printing a string, Python will attempt to convert it to your default encoding, which is usually ASCII. (More on this in a minute.) Since this unicode string is made up of characters that are also ASCII characters, printing it has the same result as printing a normal ASCII string; the conversion is seamless, and if you didn’t know that s was a unicode string, you’d never notice the difference.
- The real advantage of unicode, of course, is its ability to store non-ASCII characters, like the Spanish “ñ” (n with a tilde over it). The unicode character code for the tilde-n is 0xf1 in hexadecimal (241 in decimal), which you can type like this: \xf1
Command line arguments in Python
- The first thing to know about sys.argv is that it contains the name of the script you’re calling.
- Command-line arguments are separated by spaces, and each shows up as a separate element in the sys.argv list.
- Command-line flags, like –help, also show up as their own element in the sys.argv list.
- To make things even more interesting, some command-line flags themselves take arguments. For instance, here you have a flag (-m) which takes an argument (kant.xml). Both the flag itself and the flag’s argument are simply sequential elements in the sys.argv list. No attempt is made to associate one with the other; all you get is a list.