{"nbformat":4,"nbformat_minor":0,"metadata":{"colab":{"provenance":[],"mount_file_id":"1kutDdvLwMbalcFlH8yapowLGb6D2G-Ph","authorship_tag":"ABX9TyMmdSTimz63aCvmfe6P18ln"},"kernelspec":{"name":"python3","display_name":"Python 3"}},"cells":[{"cell_type":"markdown","metadata":{"id":"7XkqASKKOgcp"},"source":["# приклад"]},{"cell_type":"markdown","metadata":{"id":"xu6_5u4IOnd8"},"source":["Дані цього блокноту дані отримані з веб-сайту: http://www.ssa.gov/oact/babynames/limits.html\n","\n","Дані завантажено та збережено у файлі **yob2010.txt**.\n","\n","Набір даних містить кортежі з такою структурою: (ім’я, стать, кількість народжених).\n","\n","Файл даних yob2010.txt слід зберігати в тому ж каталозі, щоб працював наступний код. Якщо у вас немає цього файлу даних, ви можете завантажити його з порталу\n"]},{"cell_type":"markdown","metadata":{"id":"Hau5-ELqOysv"},"source":["# Import numpy library\n","\n","Наступна команда імпортує бібліотеку numpy або встановлює, якщо вона ще не встановлена."]},{"cell_type":"code","source":[" import numpy as np\n"," import pandas as pd\n"],"metadata":{"id":"jdZwjqqJS3wV","executionInfo":{"status":"ok","timestamp":1707718008002,"user_tz":-120,"elapsed":454,"user":{"displayName":"Galina Marchuk","userId":"08159968800729403674"}}},"execution_count":1,"outputs":[]},{"cell_type":"code","metadata":{"id":"ryM-ikNFOdYL"},"source":["try:\n"," import numpy as np\n","except:\n"," !pip install numpy\n"," import numpy as np"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"9l463PMtO67_"},"source":["# Import pandas library"]},{"cell_type":"code","metadata":{"id":"YfccnTJYO-zJ"},"source":["try:\n"," import pandas as pd\n","except:\n"," !pip install pandas\n"," import pandas as pd"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"94M995Z2PDYm"},"source":["# Load data\n","\n","Використовуємо **read_csv pandas**, щоб прочитати файл даних у фреймі даних pandas. Аргумент names визначає імена стовпців для імпортованих даних. Функція **head()** покаже перші 5 рядків кадру даних."]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/","height":206},"id":"AAi-gpdfPRrs","executionInfo":{"status":"ok","timestamp":1707718269214,"user_tz":-120,"elapsed":749,"user":{"displayName":"Galina Marchuk","userId":"08159968800729403674"}},"outputId":"107a8f44-03c1-469a-c7d7-e173dc8d662a"},"source":["names2010DF = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Лекції 2022-2023/yob2010.txt', names=['name','sex','births'])\n","names2010DF.head()"],"execution_count":2,"outputs":[{"output_type":"execute_result","data":{"text/plain":[" name sex births\n","0 Isabella F 22925\n","1 Sophia F 20648\n","2 Emma F 17354\n","3 Olivia F 17030\n","4 Ava F 15436"],"text/html":["\n","
\n","
\n","\n","
\n"," \n","
\n","
\n","
name
\n","
sex
\n","
births
\n","
\n"," \n"," \n","
\n","
0
\n","
Isabella
\n","
F
\n","
22925
\n","
\n","
\n","
1
\n","
Sophia
\n","
F
\n","
20648
\n","
\n","
\n","
2
\n","
Emma
\n","
F
\n","
17354
\n","
\n","
\n","
3
\n","
Olivia
\n","
F
\n","
17030
\n","
\n","
\n","
4
\n","
Ava
\n","
F
\n","
15436
\n","
\n"," \n","
\n","
\n","
\n","\n","
\n"," \n","\n"," \n","\n"," \n","
\n","\n","\n","
\n"," \n","\n","\n","\n"," \n","
\n","
\n","
\n"]},"metadata":{},"execution_count":2}]},{"cell_type":"markdown","metadata":{"id":"AZUcZ5urPY-n"},"source":["# Створити новий стовпець\n","\n","Змінимо існуючий фрейм даних, додавши стовпці з пропорцією. Значення в цьому стовпчику буде означати частку немовлят, народжених з іменем у рядку, щодо загальної кількості народжених.\n"," Створимо функцію, яка визначатиме частку загальної кількості народжених.\n","\n","Функція **.astype(float)** передає значення у стовпець народження і додає значення до змінної births. Потім створюється новий стовпець з назвою **prop**."]},{"cell_type":"code","metadata":{"id":"21UQEukHPtSR","executionInfo":{"status":"ok","timestamp":1707718273203,"user_tz":-120,"elapsed":305,"user":{"displayName":"Galina Marchuk","userId":"08159968800729403674"}}},"source":["def add_prop(group):\n"," births = group.births.astype(float)\n"," group['prop'] = births / births.sum()\n"," return group"],"execution_count":3,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"fx8_6fqgPwTJ"},"source":["Застосувати функцію **add_prop** до фрейму даних. Згрупуємо за статтю за допомогою функції **groupby**, а потім застосуємо функцію **apply**"]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/","height":387},"id":"_YzWGZ-MP1P1","executionInfo":{"status":"ok","timestamp":1707718276500,"user_tz":-120,"elapsed":286,"user":{"displayName":"Galina Marchuk","userId":"08159968800729403674"}},"outputId":"edcf6a27-7442-4aeb-f49d-318bf3b28264"},"source":["names2010DF = names2010DF.groupby(['sex']).apply(add_prop)\n","names2010DF.head()"],"execution_count":4,"outputs":[{"output_type":"stream","name":"stderr","text":[":1: FutureWarning: Not prepending group keys to the result index of transform-like apply. In the future, the group keys will be included in the index, regardless of whether the applied function returns a like-indexed object.\n","To preserve the previous behavior, use\n","\n","\t>>> .groupby(..., group_keys=False)\n","\n","To adopt the future behavior and silence this warning, use \n","\n","\t>>> .groupby(..., group_keys=True)\n"," names2010DF = names2010DF.groupby(['sex']).apply(add_prop)\n"]},{"output_type":"execute_result","data":{"text/plain":[" name sex births prop\n","0 Isabella F 22925 0.012907\n","1 Sophia F 20648 0.011625\n","2 Emma F 17354 0.009770\n","3 Olivia F 17030 0.009588\n","4 Ava F 15436 0.008690"],"text/html":["\n","
\n","
\n","\n","
\n"," \n","
\n","
\n","
name
\n","
sex
\n","
births
\n","
prop
\n","
\n"," \n"," \n","
\n","
0
\n","
Isabella
\n","
F
\n","
22925
\n","
0.012907
\n","
\n","
\n","
1
\n","
Sophia
\n","
F
\n","
20648
\n","
0.011625
\n","
\n","
\n","
2
\n","
Emma
\n","
F
\n","
17354
\n","
0.009770
\n","
\n","
\n","
3
\n","
Olivia
\n","
F
\n","
17030
\n","
0.009588
\n","
\n","
\n","
4
\n","
Ava
\n","
F
\n","
15436
\n","
0.008690
\n","
\n"," \n","
\n","
\n","
\n","\n","
\n"," \n","\n"," \n","\n"," \n","
\n","\n","\n","
\n"," \n","\n","\n","\n"," \n","
\n","
\n","
\n"]},"metadata":{},"execution_count":4}]},{"cell_type":"markdown","metadata":{"id":"uGWpwqx9P3tI"},"source":["Check prop column\n","\n","Перевіримо стовпець prop.Переконаємося, що значення в стовпці prop складають 1 у всіх групах за допомогою бібліотеки numpy."]},{"cell_type":"code","metadata":{"id":"mbQs8E7TQAUQ","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1707718301659,"user_tz":-120,"elapsed":273,"user":{"displayName":"Galina Marchuk","userId":"08159968800729403674"}},"outputId":"be2f6199-c125-46c4-d8ba-efe1043e8c21"},"source":["np.allclose(names2010DF.groupby('sex').prop.sum(),1)"],"execution_count":5,"outputs":[{"output_type":"execute_result","data":{"text/plain":["True"]},"metadata":{},"execution_count":5}]},{"cell_type":"markdown","metadata":{"id":"dKxy2J65P9Jr"},"source":["З набору даних виберемо 10 найпоширеніших імен для кожної статі. Створимо функцію, яка вибирае 10 найпоширеніших імен для кожної статі."]},{"cell_type":"code","metadata":{"id":"JeIW7kmVQHc7","executionInfo":{"status":"ok","timestamp":1707718304978,"user_tz":-120,"elapsed":287,"user":{"displayName":"Galina Marchuk","userId":"08159968800729403674"}}},"source":["def get_top10(group):\n"," return group.sort_values(by='births', ascending=False)[:10]"],"execution_count":6,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"ufAfwusjQMB8"},"source":["Застосуємо функцію до фрейму даних, щоб створити новий фрейм даних під назвою **top10**, що містить 10 найменувань для кожної статі."]},{"cell_type":"code","metadata":{"id":"y_di2gBRQLq6","colab":{"base_uri":"https://localhost:8080/","height":708},"executionInfo":{"status":"ok","timestamp":1707718307728,"user_tz":-120,"elapsed":288,"user":{"displayName":"Galina Marchuk","userId":"08159968800729403674"}},"outputId":"b9ebf1d3-d6c0-44d4-a238-5932c2935caf"},"source":["top10 = names2010DF.groupby(['sex']).apply(get_top10)\n","top10"],"execution_count":7,"outputs":[{"output_type":"execute_result","data":{"text/plain":[" name sex births prop\n","sex \n","F 0 Isabella F 22925 0.012907\n"," 1 Sophia F 20648 0.011625\n"," 2 Emma F 17354 0.009770\n"," 3 Olivia F 17030 0.009588\n"," 4 Ava F 15436 0.008690\n"," 5 Emily F 14278 0.008038\n"," 6 Abigail F 14250 0.008023\n"," 7 Madison F 13189 0.007425\n"," 8 Chloe F 11757 0.006619\n"," 9 Mia F 10644 0.005992\n","M 19823 Jacob M 22139 0.011546\n"," 19824 Ethan M 18008 0.009392\n"," 19825 Michael M 17366 0.009057\n"," 19826 Jayden M 17191 0.008966\n"," 19827 William M 17060 0.008897\n"," 19828 Alexander M 16759 0.008740\n"," 19829 Noah M 16463 0.008586\n"," 19830 Daniel M 15853 0.008268\n"," 19831 Aiden M 15544 0.008107\n"," 19832 Anthony M 15501 0.008084"],"text/html":["\n","
\n","
\n","\n","
\n"," \n","
\n","
\n","
\n","
name
\n","
sex
\n","
births
\n","
prop
\n","
\n","
\n","
sex
\n","
\n","
\n","
\n","
\n","
\n","
\n"," \n"," \n","
\n","
F
\n","
0
\n","
Isabella
\n","
F
\n","
22925
\n","
0.012907
\n","
\n","
\n","
1
\n","
Sophia
\n","
F
\n","
20648
\n","
0.011625
\n","
\n","
\n","
2
\n","
Emma
\n","
F
\n","
17354
\n","
0.009770
\n","
\n","
\n","
3
\n","
Olivia
\n","
F
\n","
17030
\n","
0.009588
\n","
\n","
\n","
4
\n","
Ava
\n","
F
\n","
15436
\n","
0.008690
\n","
\n","
\n","
5
\n","
Emily
\n","
F
\n","
14278
\n","
0.008038
\n","
\n","
\n","
6
\n","
Abigail
\n","
F
\n","
14250
\n","
0.008023
\n","
\n","
\n","
7
\n","
Madison
\n","
F
\n","
13189
\n","
0.007425
\n","
\n","
\n","
8
\n","
Chloe
\n","
F
\n","
11757
\n","
0.006619
\n","
\n","
\n","
9
\n","
Mia
\n","
F
\n","
10644
\n","
0.005992
\n","
\n","
\n","
M
\n","
19823
\n","
Jacob
\n","
M
\n","
22139
\n","
0.011546
\n","
\n","
\n","
19824
\n","
Ethan
\n","
M
\n","
18008
\n","
0.009392
\n","
\n","
\n","
19825
\n","
Michael
\n","
M
\n","
17366
\n","
0.009057
\n","
\n","
\n","
19826
\n","
Jayden
\n","
M
\n","
17191
\n","
0.008966
\n","
\n","
\n","
19827
\n","
William
\n","
M
\n","
17060
\n","
0.008897
\n","
\n","
\n","
19828
\n","
Alexander
\n","
M
\n","
16759
\n","
0.008740
\n","
\n","
\n","
19829
\n","
Noah
\n","
M
\n","
16463
\n","
0.008586
\n","
\n","
\n","
19830
\n","
Daniel
\n","
M
\n","
15853
\n","
0.008268
\n","
\n","
\n","
19831
\n","
Aiden
\n","
M
\n","
15544
\n","
0.008107
\n","
\n","
\n","
19832
\n","
Anthony
\n","
M
\n","
15501
\n","
0.008084
\n","
\n"," \n","
\n","
\n","
\n","\n","
\n"," \n","\n"," \n","\n"," \n","
\n","\n","\n","
\n"," \n","\n","\n","\n"," \n","
\n","
\n","
\n"]},"metadata":{},"execution_count":7}]},{"cell_type":"markdown","metadata":{"id":"u8VmD1lVQTcg"},"source":["Знайдемо кількість народжень із першою літерою кожного імені для кожної статі.\n","\n","*Спочатку створюємо просту вбудовану лямбда-функцію з назвою get_first_letter, яка витягує першу літеру з заданого рядка.*\n","\n","Потім прив'язуємо це до фрейму даних за допомогою функції **map** і зберігаємо в новому стовпці з назвою first_letter.\n"]},{"cell_type":"code","metadata":{"id":"5Z8eiSrhQg-f","colab":{"base_uri":"https://localhost:8080/","height":363},"executionInfo":{"status":"ok","timestamp":1707718313773,"user_tz":-120,"elapsed":284,"user":{"displayName":"Galina Marchuk","userId":"08159968800729403674"}},"outputId":"05b506c9-3162-4644-ceb5-d71ba4bae588"},"source":["get_first_letter = lambda x: x[0]\n","names2010DF['first_letter'] = names2010DF.name.map(get_first_letter)\n","names2010DF.head(10)"],"execution_count":8,"outputs":[{"output_type":"execute_result","data":{"text/plain":[" name sex births prop first_letter\n","0 Isabella F 22925 0.012907 I\n","1 Sophia F 20648 0.011625 S\n","2 Emma F 17354 0.009770 E\n","3 Olivia F 17030 0.009588 O\n","4 Ava F 15436 0.008690 A\n","5 Emily F 14278 0.008038 E\n","6 Abigail F 14250 0.008023 A\n","7 Madison F 13189 0.007425 M\n","8 Chloe F 11757 0.006619 C\n","9 Mia F 10644 0.005992 M"],"text/html":["\n","
\n","
\n","\n","
\n"," \n","
\n","
\n","
name
\n","
sex
\n","
births
\n","
prop
\n","
first_letter
\n","
\n"," \n"," \n","
\n","
0
\n","
Isabella
\n","
F
\n","
22925
\n","
0.012907
\n","
I
\n","
\n","
\n","
1
\n","
Sophia
\n","
F
\n","
20648
\n","
0.011625
\n","
S
\n","
\n","
\n","
2
\n","
Emma
\n","
F
\n","
17354
\n","
0.009770
\n","
E
\n","
\n","
\n","
3
\n","
Olivia
\n","
F
\n","
17030
\n","
0.009588
\n","
O
\n","
\n","
\n","
4
\n","
Ava
\n","
F
\n","
15436
\n","
0.008690
\n","
A
\n","
\n","
\n","
5
\n","
Emily
\n","
F
\n","
14278
\n","
0.008038
\n","
E
\n","
\n","
\n","
6
\n","
Abigail
\n","
F
\n","
14250
\n","
0.008023
\n","
A
\n","
\n","
\n","
7
\n","
Madison
\n","
F
\n","
13189
\n","
0.007425
\n","
M
\n","
\n","
\n","
8
\n","
Chloe
\n","
F
\n","
11757
\n","
0.006619
\n","
C
\n","
\n","
\n","
9
\n","
Mia
\n","
F
\n","
10644
\n","
0.005992
\n","
M
\n","
\n"," \n","
\n","
\n","
\n","\n","
\n"," \n","\n"," \n","\n"," \n","
\n","\n","\n","
\n"," \n","\n","\n","\n"," \n","
\n","
\n","
\n"]},"metadata":{},"execution_count":8}]},{"cell_type":"markdown","metadata":{"id":"NFCHujh-Qkn-"},"source":["# Зведена таблиця\n","\n","Cтворимо зведену таблицю, яка показує кількість народжень для кожної першої літери щодо статі.\n"]},{"cell_type":"code","metadata":{"id":"ukBP43oIQo0i","colab":{"base_uri":"https://localhost:8080/","height":394},"executionInfo":{"status":"ok","timestamp":1707718318716,"user_tz":-120,"elapsed":322,"user":{"displayName":"Galina Marchuk","userId":"08159968800729403674"}},"outputId":"7c58f676-4426-41f7-d61f-2d8ed0a423e1"},"source":["pTable = names2010DF.pivot_table('births', index=['first_letter'],\n"," columns=['sex'], aggfunc=sum)\n","pTable.head(10)"],"execution_count":9,"outputs":[{"output_type":"execute_result","data":{"text/plain":["sex F M\n","first_letter \n","A 312344 200701\n","B 64837 109456\n","C 97693 169966\n","D 47651 124502\n","E 120000 103530\n","F 14109 16325\n","G 49890 62442\n","H 47470 36727\n","I 47585 38124\n","J 119391 302367"],"text/html":["\n","
\n","
\n","\n","
\n"," \n","
\n","
sex
\n","
F
\n","
M
\n","
\n","
\n","
first_letter
\n","
\n","
\n","
\n"," \n"," \n","
\n","
A
\n","
312344
\n","
200701
\n","
\n","
\n","
B
\n","
64837
\n","
109456
\n","
\n","
\n","
C
\n","
97693
\n","
169966
\n","
\n","
\n","
D
\n","
47651
\n","
124502
\n","
\n","
\n","
E
\n","
120000
\n","
103530
\n","
\n","
\n","
F
\n","
14109
\n","
16325
\n","
\n","
\n","
G
\n","
49890
\n","
62442
\n","
\n","
\n","
H
\n","
47470
\n","
36727
\n","
\n","
\n","
I
\n","
47585
\n","
38124
\n","
\n","
\n","
J
\n","
119391
\n","
302367
\n","
\n"," \n","
\n","
\n","
\n","\n","
\n"," \n","\n"," \n","\n"," \n","
\n","\n","\n","
\n"," \n","\n","\n","\n"," \n","
\n","
\n","
\n"]},"metadata":{},"execution_count":9}]},{"cell_type":"markdown","metadata":{"id":"r-efjV3WQr7p"},"source":["Знайдемо загальну кількість народжених для кожної статі"]},{"cell_type":"code","metadata":{"id":"0eiFrfafQzCY","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1707718322957,"user_tz":-120,"elapsed":296,"user":{"displayName":"Galina Marchuk","userId":"08159968800729403674"}},"outputId":"cd189d95-47e4-45e1-8b74-1db84053f878"},"source":["pTable.sum()"],"execution_count":10,"outputs":[{"output_type":"execute_result","data":{"text/plain":["sex\n","F 1776223\n","M 1917416\n","dtype: int64"]},"metadata":{},"execution_count":10}]},{"cell_type":"markdown","metadata":{"id":"7N433ZyZQ1l6"},"source":["**Нормалізуэмо** результати. Результати нормуються шляхом ділення на загальну кількість імен."]},{"cell_type":"code","metadata":{"id":"ny86ZQiXQ6fN","colab":{"base_uri":"https://localhost:8080/","height":238},"executionInfo":{"status":"ok","timestamp":1707718331948,"user_tz":-120,"elapsed":317,"user":{"displayName":"Galina Marchuk","userId":"08159968800729403674"}},"outputId":"bba13dcb-7ea0-4800-9cf2-efd6772b7122"},"source":["pTable = pTable / pTable.sum().astype('float')\n","pTable.head()"],"execution_count":11,"outputs":[{"output_type":"execute_result","data":{"text/plain":["sex F M\n","first_letter \n","A 0.175847 0.104673\n","B 0.036503 0.057085\n","C 0.055000 0.088643\n","D 0.026827 0.064932\n","E 0.067559 0.053995"],"text/html":["\n","