Chapter 2. Lists and Dictionaries

728x90

Item 11: Sequence를 슬라이스 할 줄 알자

a = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']

a[:]     # ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']
a[:5]    # ['a', 'b', 'c', 'd', 'e']
a[:-1]   # ['a', 'b', 'c', 'd', 'e', 'f', 'g']
a[4:]    # ['e', 'f', 'g', 'h']
a[-3:]   # ['f', 'g', 'h']
a[2:5]   # ['c', 'd', 'e']
a[2:-1]  # ['c', 'd', 'e', 'f', 'g']
a[-3:-1] # ['f', 'g']

인덱스를 생략하는 것은 처음과 끝을 의미하는 것이고, 마이너스(-)가 붙은 건 뒤에서부터 몇번째 원소인지를 의미함. 파이썬의 장점은 이런 기능을 활용해 실수를 줄일 수 있다는 것.

Slicing의 결과로 만들어진 list는 완전히 새로운 list이므로 original list 와는 별개임. 따라서 슬라이싱한 리스트를 수정해도 본래 리스트에는 영향을 주지 않는다.

Item6에서 봤던 multiple assignment에서처럼 a, b = c[:2] 도 가능하지만, list의 흥미로운 특징은 꼭 assign할 때 길이가 같지 않아도 된다는 것이다.

print('Before ', a)
a[2:7] = [99, 22, 14]
print('After ', a)

>>>
Before ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']
After ['a', 'b', 99, 22, 14, 'h']

위처럼 2번 인덱스부터 6번 인덱스까지 5개의 인덱스를 [99, 22, 14]로 대체했으므로 0, 1, 7번 인덱스는 보존된다.

print('Before ', a)
a[2:3] = [47, 11]
print('After ', a)

>>>
Before ['a', 'b', 99, 22, 14, 'h']
After ['a', 'b', 47, 11, 22, 14, 'h']

반대로, 위의 예처럼 늘어날 수도 있다. 근데 위의 예를 보니 a[2:3]이 2번째 인덱스만을 가리키므로 a[2]을 써도 된다고 생각할 수도 있지만, 그렇게 되면 a[2]에 [47, 11] 리스트가 원소로 들어가는 것이므로 ['a', 'b', [47, 11], 22, 14, 'h']가 될 것이다. 따라서 a[2:3] 과 같이 표현을 해주어야 slice assignment가 이루어진다.

b = a[:]

와 같이 인덱스를 앞뒤 모두 생략하면 deep copy를 해주는 것이다. 따라서 b와 a는 완전히 다르다.

b = a
print('Before a', a)
print('Before b', b)
a[:] = [101, 102, 103]
assert a is b         # Still the same list object
print('After a ', a)  # Now has different contents
print('After b ', b)  # Same list, so same contents as a

>>>
Before a ['a', 'b', 47, 11, 22, 14, 'h']
Before b ['a', 'b', 47, 11, 22, 14, 'h']
After a [101, 102, 103]
After b [101, 102, 103]

근데 위처럼 대입연산자를 이용하게 되면 shallow copy 이므로 a 와 b는 같은 list object를 가리키게 된다. 따라서 둘 중 하나에 변화가 일어나면 둘 다 영향을 받게 된다.

Item 12: Single expression에서 stride와 slicing 사용은 피하자.

파이썬에서는 somelist[start:end:stride] 의 형식으로 stride를 제공하는데, stride 기능이 유용하게 사용될 때도 있음.

x = ['red', 'orange', 'yellow', 'green', 'blue', 'purple']
odds = x[::2]
evens = x[1::2]
print(odds)
print(evens)

>>>
['red', 'yellow', 'blue']
['orange', 'green', 'purple']

위처럼 홀수, 짝수 번의 index에 있는 원소를 추출해낼 때와 같은 경우 유용하게 사용되기도 한다.

대부분의 byte와 str 형식에서 stride가 -1인 경우 reverse한 결과값을 내므로 많이 사용을 하지만, UTF-8 byte string의 경우에는

위와 같이 뜻하지 않은 결과를 초래하기도 한다. 물론 -1을 stride로 사용하는 경우 유용하다고 할 수도 있지만, 2::2, -2::-2 -2:2:-2, 2:2:-2 와 같이 혼란을 줄 수 있는 경우도 있기 때문에 가급적 사용을 피하는 것이 좋다.

만약 꼭 stride를 써야겠다면 두 단계로 나눠서 쓰기를 필자는 권함.

y = x[::2]   # ['a', 'c', 'e', 'g']
z = y[1:-1]  # ['c', 'e']

하지만 위와 같이 striding 후에 slicing을 하는 것은 별도의 shallow copy를 만들어야 하기 때문에 가급적이면 이와 같은 방식보다는 intertools 라는 빌트인 모듈 사용을 권장.

Item 13: 슬라이싱에 Catch-All Unpacking을 선호하자.

Basic unpacking의 한계점은 unpacking할 때 sequence의 길이를 알고있어야 한다는 것이다.

car_ages = [0, 9, 4, 8, 7, 20, 19, 1, 6, 15]
car_ages_descending = sorted(car_ages, reverse=True)
oldest, second_oldest = car_ages_descending

>>>
Traceback ...
ValueError: too many values to unpack (expected 2)

위의 코드에서 oldest와 second_oldest에 첫번째, 두번째 값을 넣고 싶지만 뒤에 나머지 원소들을 처리할 수 없기 때문에 에러가 난다. 물론 indexing을 이용해서 [0], [1], [2:]로 슬라이싱을 할 수도 있겠지만 좀 어수선해 보이고, 파이썬에서는 더 좋은 기능을 제공한다. 바로 starred expression 이라는 기능이다.

oldest, second_oldest, *others = car_ages_descending
print(oldest, second_oldest, others)

>>>
20 19 [15, 9, 8, 7, 6, 4, 1, 0]

위처럼 하면 others에 나머지 값들이 대입된다.

oldest, *others, youngest = car_ages_descending
print(oldest, youngest, others)

*others, second_youngest, youngest = car_ages_descending
print(youngest, second_youngest, others)

>>>
20 0 [19, 15, 9, 8, 7, 6, 4, 1]
0 1 [20, 19, 15, 9, 8, 7, 6, 4]

위처럼 순서를 바꿀 수도 있음.

*others = car_ages_descending

>>>
Traceback ...
SyntaxError: starred assignment target must be in a list or tuple

하지만 위처럼 starred 변수가 아닌 일반 변수가 하나도 없으면 에러가 나고,

first, *middle, *second_middle, last = [1, 2, 3, 4]

>>>
Traceback ...
SyntaxError: two starred expressions in assignment

두 개 이상의 starred 변수가 있어도 에러가 난다. 대부분 두 개 이상의 starred 변수가 있으면 안되고, 또 필자는 추천하지 않지만 아래와 같이

car_inventory = {
 'Downtown': ('Silver Shadow', 'Pinto', 'DMC'),
 'Airport': ('Skyline', 'Viper', 'Gremlin', 'Nova'),
}

((loc1, (best1, *rest1)),
 (loc2, (best2, *rest2))) = car_inventory.items()
 
print(f'Best at {loc1} is {best1}, {len(rest1)} others')
print(f'Best at {loc2} is {best2}, {len(rest2)} others')

>>>
Best at Downtown is Silver Shadow, 2 others
Best at Airport is Skyline, 3 others

다른 part에 있으면 적용이 되기도 한다.

csv파일 값을 추출할 때도 유용하게 사용되기도 한다. 예를 들어

def generate_csv():
 yield ('Date', 'Make' , 'Model', 'Year', 'Price')
 ...

all_csv_rows = list(generate_csv())
header = all_csv_rows[0]
rows = all_csv_rows[1:]
print('CSV Header:', header)
print('Row count: ', len(rows))

>>>
CSV Header: ('Date', 'Make', 'Model', 'Year', 'Price')
Row count: 200

위처럼 쓸 수도 있지만, 인덱스를 쓰면 불필요하게 여러 줄을 써야 하고 어수선해보이므로 아래와 같이 쓰는 것이 좋다.

it = generate_csv()
header, *rows = it
print('CSV Header:', header)
print('Row count: ', len(rows))

>>>
CSV Header: ('Date', 'Make', 'Model', 'Year', 'Price')
Row count: 200

다만 필자는 starred expression의 결과는 항상 list가 되기 때문에 메모리적 관점에서 위험할 수 있으므로 데이터가 메모리에 맞을 때만 사용할 것을 권장하고 있다

Item 14: key 파라미터를 이용해서 복잡한 기준에 대해서도 정렬을 할 수 있다.

기본 sort는 오름차순 정렬이며, 거의 대부분의 built-in type에 대해서는 상응하는 기준에 따라 정렬이 가능하다.

하지만 아래의 Tool 이라는 두 가지의 변수를 가진 class를 생각해보자.

class Tool:
    def __init__(self, name, weight):
        self.name = name
        self.weight = weight
 
    def __repr__(self):
        return f'Tool({self.name!r}, {self.weight})'
        
tools = [
    Tool('level', 3.5),
    Tool('hammer', 1.25),
    Tool('screwdriver', 0.5),
    Tool('chisel', 0.25),
]

위의 경우 Tool 이라는 class로 이루어진 tools 리스트를 정렬하려고 하면,

tools.sort()

>>>
Traceback ...
TypeError: '<' not supported between instances of 'Tool' and 'Tool'

위와 같이 에러가 발생한다. 그 이유는 서로 다른 Tool 인스턴스 간에 무엇이 크고 작은지 비교하는 기준이 없기 때문이다.

이 때, 정렬을 할 수 있도록 기준을 넘겨주게 되는데 그 인자가 바로 key parameter이다. key 인자로 넘겨주는 function으로 크고 작음을 결정할 수 있는데, 이 때 lambda 를 사용한다. 아래 예를 살펴보자.

print('Unsorted:', repr(tools))
tools.sort(key=lambda x: x.name)
print('\nSorted: ', tools)

>>>
Unsorted: [Tool('level', 3.5),
	   Tool('hammer', 1.25),
	   Tool('screwdriver', 0.5),
	   Tool('chisel', 0.25)]
           
Sorted: [Tool('chisel', 0.25),
 	 Tool('hammer', 1.25),
 	 Tool('level', 3.5),
	 Tool('screwdriver', 0.5)]

이처럼 key function으로 lambda x: x.name을 넘겨주게 되면 name을 기준으로 sorting을 진행하게 되고,

tools.sort(key=lambda x: x.weight)
print('By weight:', tools)

>>>
By weight: [Tool('chisel', 0.25),
  	    Tool('screwdriver', 0.5),
 	    Tool('hammer', 1.25),
	    Tool('level', 3.5)]

lambda x: x.weight 은 weight을 기준으로 sorting을 진행한다.

뿐만 아니라 변수가 아닌 함수를 집어넣을 수도 있다.

places = ['home', 'work', 'New York', 'Paris']
places.sort()
print('Case sensitive: ', places)
places.sort(key=lambda x: x.lower())
print('Case insensitive:', places)

>>>
Case sensitive: ['New York', 'Paris', 'home', 'work']
Case insensitive: ['home', 'New York', 'Paris', 'work']

위처럼 문자열을 소문자로 바꾼 후, 정렬을 하도록 만들 수도 있다.

앞에서는 기준이 하나인 경우만을 살펴보았는데, 만약에 여러 개의 기준을 통해서 정렬을 하고 싶을 때는 어떻게 해야할지 궁금할 것이다. 이 때, tuple을 이용할 수가 있다.

power_tools.sort(key=lambda x: (x.weight, x.name))
print(power_tools)

>>>
[Tool('drill', 4),
 Tool('sander', 4),
 Tool('circular saw', 5),
 Tool('jackhammer', 40)]

위의 코드를 보면, weight를 기준으로 먼저 정렬이 되고, 만약 같은 weight이라면 name을 기준으로 정렬이 되도록 한 코드이다. 다만 이렇게 tuple을 이용할 경우에는 weight, name의 두 변수에 대해서 모두 같은 방향으로 크기를 정해야한다는 것이다. 즉, weight도 오름차순, name도 오름차순으로 해야하고, 반대로 reverse=True 를 넣는다고 하면 둘 다 내림차순으로 정렬이 된다는 것이다.

Numerical한 type의 경우에는 마이너스를 붙여서 해당 변수는 다른 방향으로 sorting이 되도록 할 수가 있다.

power_tools.sort(key=lambda x: (-x.weight, x.name))
print(power_tools)

>>>
[Tool('jackhammer', 40),
 Tool('circular saw', 5),
 Tool('drill', 4),
 Tool('sander', 4)]

위의 코드의 예를 보면, weight에 대해서는 내림차순으로 정렬이 된 것을 확인할 수 있다. 다만 위의 방식이 모두 적용되는 것은 아니다.

power_tools.sort(key=lambda x: (x.weight, -x.name), reverse=True)
 
>>>
Traceback ...
TypeError: bad operand type for unary -: 'str'

위처럼 str의 경우에는 마이너스(negation)가 적용이 되지 않는다.

따라서 그냥 원하는 기준 순서대로 차례대로 sorting 하는 것이 그 해결책이다.

power_tools.sort(key=lambda x: x.name) # Name ascending
power_tools.sort(key=lambda x: x.weight, reverse=True) # Weight descending
print(power_tools)

>>>
[Tool('jackhammer', 40),
 Tool('circular saw', 5),
 Tool('drill', 4),
 Tool('sander', 4)]

Item 15: dict 삽입 순서에 의존할 때는 조심하자

필자는 dictionary의 iteration 순서는 삽입순서랑 다르다고 말하는데, 이는 파이썬 3.6 버전 이후부터는 삽입 순서대로 iterate 하도록 바꼈음.

def my_func(**kwargs):
    for key, value in kwargs.items():
    print(f'{key} = {value}')
    
my_func(goose='gosling', kangaroo='joey')

>>>
goose = gosling
kangaroo = joey

뿐만 아니라 parameter로 **kwargs(모든 paramter를 키워드=값 형태로 전달하고 이는 딕셔너리 {키워드:값} 형태로 전달이 됨)를 사용해도 이제는 삽입 순서대로 출력이 됨.

Class의 경우도 마찬가지다. 아래의 예를 보자.

# Python 3.5
class MyClass:
    def __init__(self):
        self.alligator = 'hatchling'
        self.elephant = 'calf'
        
a = MyClass()
for key, value in a.__dict__.items():
    print('%s = %s' % (key, value))
    
>>>
elephant = calf
alligator = hatchling

이전의 파이썬 버전에서는 instance field가 alligator, elephant 순서대로 선

언이 되었는데, 출력은 랜덤으로 되었다.

>>>
alligator = hatchling
elephant = calf

하지만 지금은 위와 같은 결과가 나온다.

하지만, 딕셔너리를 다룰 때는 꼭 삽입 순서가 지켜진다고 생각해서는 안된다.

아래와 같은 예를 보자.

votes = {
    'otter': 1281,
    'polar bear': 587,
    'fox': 863,
}

def populate_ranks(votes, ranks):
    names = list(votes.keys())
    names.sort(key=votes.get, reverse=True)
    for i, name in enumerate(names, 1):
        ranks[name] = i

def get_winner(ranks):
    return next(iter(ranks))
    
ranks = {}
populate_ranks(votes, ranks)
print(ranks)
winner = get_winner(ranks)
print(winner)

>>>
{'otter': 1, 'fox': 2, 'polar bear': 3}
otter

위에서는 투표의 표 순서대로(내림차순) sort를 해서 출력을 하는 것이 목적이었다.

만약, rank order가 아니라 alphabetical order로 출력을 해야한다고 가정해보자.

이를 위해서는 alphabetical order로 contents를 interate할 수 있는 dictionary-like class를 정의하기 위해 collections.abc라는 빌트인 모듈을 사용해야 한다.

from collections.abc import MutableMapping

class SortedDict(MutableMapping):
    def __init__(self):
        self.data = {}
        
    def __getitem__(self, key):
        return self.data[key]
        
    def __setitem__(self, key, value):
        self.data[key] = value
        
    def __delitem__(self, key):
        del self.data[key]
        
    def __iter__(self):
        keys = list(self.data.keys())
        keys.sort()
        for key in keys:
            yield key
            
    def __len__(self):
        return len(self.data)

이렇게 선언한 SortedDict 클래스의 instance를 이용하면 에러없이 사용을 할 수 있다. 하지만 결과는 우리가 예상했던 것과 다르다.

sorted_ranks = SortedDict()
populate_ranks(votes, sorted_ranks)
print(sorted_ranks.data)
winner = get_winner(sorted_ranks)
print(winner)

>>>
{'otter': 1, 'fox': 2, 'polar bear': 3}
fox

원래대로라면 {'fox': 1, 'otter': 2, 'plar bear': 3}이 되어야 하지만, 여기서는 get_winner 의 구현 때문에 문제가 발생한다. get_winner 는populate_ranks와 매칭하기 위해서 dictionary의 iteration을 삽입 순서로 진행하기 때문이다.

이 문제를 해결하기 위해서는 3가지 방법이 존재한다.

첫 번째, get_winner 함수가 더 이상 ranks 딕셔너리에 어떠한 iteration order를 가진다고 assume 하지 않게하는 것이다.

def get_winner(ranks):
    for name, rank in ranks.items():
        if rank == 1:
            return name
            
winner = get_winner(sorted_ranks)
print(winner)

>>>
otter

두 번째, 함수의 최상단에 explicit check을 넣어서 ranks의 type을 ensure 하는 것이다. (아니면 exception 발생)

def get_winner(ranks):
    if not isinstance(ranks, dict):
        raise TypeError('must provide a dict instance')
    return next(iter(ranks))
    
get_winner(sorted_ranks)

>>>
Traceback ...
TypeError: must provide a dict instance

세 번째, get_winner에 전달되는 value가 dict instance이도록 type annotation을 사용하는 것이다.

from typing import Dict, MutableMapping

def populate_ranks(votes: Dict[str, int], ranks: Dict[str, int]) -> None:
    names = list(votes.keys())
    names.sort(key=votes.get, reverse=True)
    for i, name in enumerate(names, 1):
        ranks[name] = i
        
def get_winner(ranks: Dict[str, int]) -> str:
    return next(iter(ranks))
    
class SortedDict(MutableMapping[str, int]):
    ...
    
votes = {
    'otter': 1281,
    'polar bear': 587,
    'fox': 863,
}

sorted_ranks = SortedDict()
populate_ranks(votes, sorted_ranks)
print(sorted_ranks.data)
winner = get_winner(sorted_ranks)
print(winner)

$ python3 -m mypy --strict example.py
.../example.py:48: error: Argument 2 to "populate_ranks" has 
➥incompatible type "SortedDict"; expected "Dict[str, int]"
.../example.py:50: error: Argument 1 to "get_winner" has 
➥incompatible type "SortedDict"; expected "Dict[str, int]"

Item 16: Missing Dictionary Keys를 다룰 때 in과 keyError 보다 get을 선호할 것

딕셔너리의 가장 기본적인 operation은 key와 value를 accessing, assigning, 그리고 deleting 하는 것이다. 하지만 딕셔너리의 contents는 굉장히 dynamic 하기 때문에 이런 operation을 하려고 할 때, key가 존재하지 않을 수도 있다.

예를 들어, 샌드위치 샵에서 메뉴를 고안하기 위해 사람들이 가장 좋아하는 빵이 뭔지 알고싶다고 하자.

counters = {
    'pumpernickel': 2,
    'sourdough': 1,
}

새로운 vote에 대해서 counter의 값을 늘리고 싶을 때 먼저 해당 key가 존재하는지를 알아야 하고, 만약 없다면 default값을 0으로 하고 key를 삽입한다. 이 과정에서 key를 accessing 하는 과정이 2번, assigning 과정이 1번 일어난다.

만약 in과 keyError 를 쓰는 경우를 생각해보자.

key = 'wheat'

if key in counters:
    count = counters[key]
else:
    count = 0
    
counters[key] = count + 1

try:
    count = counters[key]
except KeyError:
    count = 0
    
counters[key] = count + 1

하지만 위와 같은 예보다 get을 쓰는 것이 훨씬 좋은 방법이다.

count = counters.get(key, 0)
counters[key] = count + 1

보기에도 simple하고 또 access와 assignment 과정이 1번 밖에 일어나지 않는다. 또, dictionary의 value가 list와 같이 복잡한 type일 경우에도 마찬가지로 get을 쓰는 것이 좋다는 게 필자의 의견이다.

dict 타입은 뿐만 아니라 setdefault라는 method를 지원을 한다. 근데 필자는 이 setdefault 또한 문제가 있다고 설명한다. setdefault에 전달된 default value는 dictionary에 directly하게 assign 되기 때문에 performance 적으로 overhead가 발생할 수 있기 때문이다.

Item 17: Internal state 내의 missing item을 다룰 때는 defaultdict를 setdefault 보다 선호할 것

get이 가장 좋은 option이 될 때가 많지만 간혹 setdefault가 필요할 수도 있다. 다음의 예를 보자.

visits = {
    'Mexico': {'Tulum', 'Puerto Vallarta'},
    'Japan': {'Hakone'},
}

도시의 이름이 이미 dictionary에 있어도 새로운 도시를 set에 추가하기 위해서 setdefault를 사용할 수 있다. 이 예에서는 get을 쓰는 것보다 setdefault를 쓰는 게 더 간결할 것이다.

visits.setdefault('France', set()).add('Arles') # Short

if (japan := visits.get('Japan')) is None: # Long
    visits['Japan'] = japan = set()
japan.add('Kyoto')

print(visits)

>>>
{'Mexico': {'Tulum', 'Puerto Vallarta'},
 'Japan': {'Kyoto', 'Hakone'},
 'France': {'Arles'}}

그렇다면, 만약 dictionary의 creation을 컨트롤하고 싶을 때는 어떻게 해야 할까?

class Visits:
    def __init__(self):
        self.data = {}
        
    def add(self, country, city):
        city_set = self.data.setdefault(country, set())
        city_set.add(city)

위의 class는 setdefault를 호출할 때의 복잡함을 감추고 프로그래머에게 좀 더 좋은 interface를 제공한다.

visits = Visits()
visits.add('Russia', 'Yekaterinburg')
visits.add('Tanzania', 'Zanzibar')
print(visits.data)

>>>
{'Russia': {'Yekaterinburg'}, 'Tanzania': {'Zanzibar'}}

하지만 이 add 메소드는 ideal 하지 않다. 여전히 sedefault 메소드의 이름은 confusing하고, new reader 들에게 코드를 즉각적으로 이해하는 데 어려움을 주기 때문이다. 뿐만 아니라 구현 또한 그닥 효율적이지 않다. 주어진 나라가 이미 dictionary의 data에 존재한다면, 매번 호출 때마다 새로운 set instance를 생성해야하기 때문이다.

collections 빌트인 모듈의 defaultdict class는 이런 문제점을 해결해준다.

from collections import defaultdict

class Visits:
    def __init__(self):
        self.data = defaultdict(set)
        
    def add(self, country, city):
        self.data[country].add(city)
        
visits = Visits()
visits.add('England', 'Bath')
visits.add('England', 'London')
print(visits.data)

>>>
defaultdict(<class 'set'>, {'England': {'London', 'Bath'}})

Item 18: missing 를 사용해 Key-Dependent default value를 생성할 줄 알 것

setdefault와 defaultdict가 둘 다 적절하지 않은 때가 있다.

예를 들어, 만약 내가 파일 시스템에 social network profile 사진을 관리하는 프로그램을 짠다고 생각해보자. 그렇다면 file handle을 열어서 image를 읽고 쓸 수 있도록 프로필 사진의 경로명을 mapping 하는 딕셔너리가 필요하다.

pictures = {}
path = 'profile_1234.png'

if (handle := pictures.get(path)) is None:
    try:
        handle = open(path, 'a+b')
    except OSError:
        print(f'Failed to open path {path}')
        raise
    else:
        pictures[path] = handle
        
handle.seek(0)
image_data = handle.read()

만약 file handle 이 딕셔너리에 이미 존재한다면, 위의 코드는 한 번의 dictionary access만 일어난다. 존재하지 않는다면, get을 써서 access가 한 번만 일어나도록 할 수 있다.

in과 keyError를 쓸 수도 있지만 여러 문제점을 야기할 수 있다고 필자는 주장한다.

__missing__ 이라는 special method를 사용하면 이런 문제를 해결할 수 있다.

class Pictures(dict):
    def __missing__(self, key):
        value = open_picture(key)
        self[key] = value
            return value
            
pictures = Pictures()
handle = pictures[path]
handle.seek(0)
image_data = handle.read()

/ /문제제기 및 피드백 언제든지 감사히 받겠습니다.