import numpy as np
import pandas as pd
"display.max_columns", None) pd.set_option(
Producing a final-year project allocation
Assign a cohort of students their dissertations according to their grade
In this tutorial we will be solving an instance of SA using raw, tabular data.
This tutorial covers all aspects of the project allocation process, from wrangling the data, all the way through to analysing the results. In fact, this tutorial has been used to train staff at Cardiff University in performing automatic project allocation.
The data for this tutorial have been archived on Zenodo, and the source code used to generate them is here.
= pd.read_csv(
raw_students "https://zenodo.org/record/3514287/files/students.csv"
)= pd.read_csv(
raw_projects "https://zenodo.org/record/3514287/files/projects.csv"
)= pd.read_csv(
raw_supervisors "https://zenodo.org/record/3514287/files/supervisors.csv"
)
Cleaning the data
In order to use Matching’s implementation of SA, we require the data to be “clean”. This process will require us to inspect the data and remove any erroneous rows from each of our datasets.
Students
We begin with the students. The first thing we have to do is find out how many choices students were allowed to make:
raw_students.columns
Index(['name', 'rank', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10',
'11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22',
'23', '24'],
dtype='object')
We can see that students could make up to 25 selections, and the choice columns are numbered 0 through 24.
= 25
n_choices = map(str, range(n_choices)) choices
Now we want to get rid of any students that didn’t make any choices.
= (
students
raw_students.copy()=choices, how="all")
.dropna(subset=True)
.reset_index(drop
)
students.head()
name | rank | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 190000 | 3 | G2 | P1 | V2 | S0 | A0 | O0 | L0 | D2 | K1 | V1 | R2 | G2 | Y2 | G2 | W0 | K0 | X0 | O1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 190001 | 56 | Q0 | P1 | P0 | M1 | N0 | P1 | T2 | N1 | I1 | K0 | P3 | X1 | F0 | P1 | S0 | C0 | Z0 | L0 | H2 | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 190002 | 60 | D0 | U0 | Y1 | E2 | R0 | P0 | V1 | R2 | X1 | Y0 | R1 | I1 | T1 | V1 | Y2 | A1 | I2 | N1 | A0 | J2 | A0 | NaN | NaN | NaN | NaN |
3 | 190003 | 67 | C0 | J1 | I2 | N1 | W1 | E1 | M1 | Z1 | Q0 | B1 | O0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 190004 | 19 | I0 | M1 | C2 | D0 | W0 | F1 | B2 | O1 | T0 | L0 | W1 | X2 | W0 | N0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
We can see that, in fact, some students won’t be included in this game:
len(raw_students) - len(students)
3
Projects
Now that students are done, we can move onto the projects.
Each project needs a code, a supervisor and a non-zero capacity. We drop any project that is missing any of these items.
= raw_projects.copy()
projects = projects.dropna()
projects = projects[projects["capacity"] > 0]
projects
projects.head()
code | capacity | supervisor | |
---|---|---|---|
0 | A0 | 2 | A |
1 | A1 | 3 | A |
2 | A2 | 2 | A |
3 | B0 | 2 | B |
4 | B1 | 2 | B |
Verify that no projects were lost:
len(raw_projects) == len(projects)
True
Supervisors
Each supervisor requires a unique name and a non-zero capacity. We drop any supervisor without either of these things.
= raw_supervisors.copy()
supervisors = supervisors.dropna()
supervisors = supervisors[supervisors["capacity"] > 0]
supervisors
supervisors.head()
name | capacity | |
---|---|---|
0 | A | 3 |
1 | B | 1 |
2 | C | 8 |
3 | D | 5 |
4 | E | 2 |
Verify no supervisors were lost:
len(supervisors) == len(raw_supervisors)
True
Creating the dictionaries
With the now clean data, Python dictionaries must be created that we can pass to Matching. These dictionaries detail the following relationships:
- Project supervisor affiliations
- Project capacities
- Supervisor capacities
- Student preference lists
- Supervisor preference lists
Capacities and affiliations
We’ll begin with the capacity and affiliation dictionaries for projects. With both dictionaries, a project is only added if both their capacity and the capacity of their supervisor is non-zero.
= supervisors["name"].values
supervisor_names = projects["code"].values project_codes
= {}, {}
project_to_capacity, project_to_supervisor for _, (project, capacity, supervisor) in projects.iterrows():
if project in project_codes and supervisor in supervisor_names:
= supervisor
project_to_supervisor[project] = capacity project_to_capacity[project]
Now we can create the supervisor capacity dictionary. In this case, we only include a supervisor if they have at least one valid project.
= {}
supervisor_to_capacity for _, (supervisor, capacity) in supervisors.iterrows():
if supervisor in project_to_supervisor.values():
= capacity supervisor_to_capacity[supervisor]
Preference lists (dictionaries)
The only remaining dictionaries are for the preference lists of the students and the supervisors.
Let’s begin with the students.
As is discussed here, students must have a strict, unique preference of some non-empty subset of the projects. This means that if a student ranks no projects, they are excluded from the game.
.. note:: In the case where a student ranks the same project more than once, we take their highest ranking of it and shift everything else up.
= {}
student_to_preferences for _, (student, _, *prefs) in students.iterrows():
= []
student_preferences for project in prefs:
if project in project_codes and project not in student_preferences:
student_preferences.append(project)
if student_preferences:
= student_preferences student_to_preferences[student]
With that done, we’re only left with the supervisors’ preferences.
As is often the case in real-world instances of SA, each student is assigned a unique rank in their cohort. This ranking is typically done independently of the supervisors to make the game as fair as possible. In cases where “backroom” deals are allowed between supervisors and students, the overall effect is that everyone is worse off.
= students.sort_values("rank", ascending=True)["name"].values sorted_students
This complete ranking of the students allows the supervisors to draw their preference list from the same source.
As is discussed here, each supervisor must rank all (and only) those students who have ranked at least one of their projects.
If a student has ranked multiple projects from the same supervisor, they still only appear once in the supervisor’s preference list. As with students, if a supervisor’s preference list is empty (i.e. none of their projects were ranked by any student) then they are excluded from the game.
= {}
supervisor_to_preferences for supervisor in supervisor_names:
= []
supervisor_preferences = [
supervisor_projects for p, s in project_to_supervisor.items() if s == supervisor
p
]
for student in sorted_students:
= student_to_preferences[student]
student_preferences if set(student_preferences).intersection(supervisor_projects):
supervisor_preferences.append(student)
if supervisor_preferences:
= supervisor_preferences supervisor_to_preferences[supervisor]
Final clean up
We’re almost there but now we need to adjust the entries in our dictionaries that don’t make sense anymore.
Removing extra players from the game
To begin, we have to go back through supervisor_to_capacity
to remove any unranked supervisors. Likewise for unranked projects but we must also remove them from project_to_supervisor
.
It should follow that if a supervisor is unranked then none of their projects are either.
= set(supervisor_names).difference(
unranked_supervisors
supervisor_to_preferences.keys()
)
= set(project_codes).difference(
unranked_projects for prefs in student_to_preferences.values() for project in prefs)
(project
)
unranked_supervisors, unranked_projects
(set(), {'L1'})
for supervisor in unranked_supervisors:
del supervisor_to_capacity[supervisor]
for project in unranked_projects:
del project_to_capacity[project]
del project_to_supervisor[project]
Checking and adjusting capacities
The final step is to adjust the capacities. There are some stipulations on the capacities of projects and their supervisors:
Each project’s capacity must be no larger than that of its supervisor.
Each supervisor’s capacity must be:
- At least as large as its largest project’s capacity.
- No larger than the sum of its projects’ capacities.
We begin by reducing the capacities of too-large projects. This means that no supervisor will have to increase their capacity for the sake of a single project, and we don’t need to worry about the first point in (2).
for project, project_capacity in project_to_capacity.items():
= project_to_supervisor[project]
supervisor = supervisor_to_capacity[supervisor]
supervisor_capacity
if project_capacity > supervisor_capacity:
print(
f"{project} has a capacity of {project_capacity} but",
f"{supervisor} has capacity {supervisor_capacity}.",
)= supervisor_capacity project_to_capacity[project]
B0 has a capacity of 2 but B has capacity 1.
B1 has a capacity of 2 but B has capacity 1.
B2 has a capacity of 2 but B has capacity 1.
E1 has a capacity of 3 but E has capacity 2.
E2 has a capacity of 3 but E has capacity 2.
F1 has a capacity of 3 but F has capacity 2.
I1 has a capacity of 2 but I has capacity 1.
I2 has a capacity of 2 but I has capacity 1.
M0 has a capacity of 8 but M has capacity 7.
O1 has a capacity of 2 but O has capacity 1.
Q0 has a capacity of 2 but Q has capacity 1.
R0 has a capacity of 2 but R has capacity 1.
R2 has a capacity of 2 but R has capacity 1.
S1 has a capacity of 7 but S has capacity 6.
T2 has a capacity of 2 but T has capacity 1.
U0 has a capacity of 6 but U has capacity 5.
V1 has a capacity of 3 but V has capacity 2.
V2 has a capacity of 3 but V has capacity 2.
X0 has a capacity of 8 but X has capacity 7.
Now we can make sure that no supervisor is providing more spaces than they’re offering through their projects.
for supervisor, supervisor_capacity in supervisor_to_capacity.items():
= [
supervisor_projects for p, s in project_to_supervisor.items() if s == supervisor
p
]= [
supervisor_project_capacities for project in supervisor_projects
project_to_capacity[project]
]
if supervisor_capacity > sum(supervisor_project_capacities):
print(
f"{supervisor} has capacity {supervisor_capacity} but their projects",
f"{', '.join(supervisor_projects)} have a total capacity of",
f"{sum(supervisor_project_capacities)}.",
)= sum(supervisor_project_capacities) supervisor_to_capacity[supervisor]
K has capacity 8 but their projects K0, K1, K2 have a total capacity of 6.
L has capacity 2 but their projects L0 have a total capacity of 1.
W has capacity 5 but their projects W0, W1 have a total capacity of 2.
Playing the game
Now that we have formulated our players according to the rules of the game, we are ready to play it.
Matching has an implementation of the SA algorithm which takes the dictionaries we’ve built as parameters. From there, the game can be solved to be student- or supervisor-optimal. We’ll be using the former.
from matching.games import StudentAllocation
= StudentAllocation.create_from_dictionaries(
game
student_to_preferences,
supervisor_to_preferences,
project_to_supervisor,
project_to_capacity,
supervisor_to_capacity,
)
= game.solve(optimal="student")
matching assert game.check_validity()
assert game.check_stability()
matching
{A0: [], A1: [190019, 190034], A2: [190017], B0: [], B1: [190091], B2: [], C0: [190003, 190068], C1: [190070], C2: [190079, 190062], D0: [190015, 190039, 190009], D1: [190008, 190056], D2: [], E0: [], E1: [190063, 190022], E2: [], F0: [190011], F1: [190094], G0: [190096], G1: [], G2: [190000, 190045, 190077], H0: [], H1: [190078], H2: [190023, 190098], I0: [190004], I1: [], I2: [], J0: [190021], J1: [190097, 190089, 190061, 190014], J2: [190085, 190090], K0: [190074, 190095], K1: [190024, 190013], K2: [190052, 190006], L0: [190072], M0: [190086, 190053], M1: [190054, 190081, 190075], N0: [190049, 190032], N1: [190046, 190044, 190037, 190057], N2: [190031, 190060], O0: [190016], O1: [], P0: [190047], P1: [190076, 190027], P2: [190093], P3: [190071], Q0: [190001], R0: [190048], R1: [], R2: [], S0: [190069], S1: [190092, 190042, 190084, 190066, 190067], T0: [], T1: [], T2: [190036], U0: [190010, 190002, 190059, 190030], U1: [], U2: [190055], V0: [190073], V1: [], V2: [190018], W0: [190026], W1: [190058], X0: [190088, 190080, 190012], X1: [190028], X2: [190043, 190064, 190020], Y0: [190083, 190007, 190099], Y1: [190065, 190050], Y2: [190033], Z0: [190051], Z1: [190005, 190038]}
Analysis
The raw matching provided above is less than easy to read, let alone interpret. So, let’s observe the data in a different form using some visualisation tools.
from collections import Counter, defaultdict
import matplotlib.pyplot as plt
"tableau-colorblind10")
plt.style.use(%matplotlib inline
Supervisors
Using the matching.Game
instance we’ve created, we can extract the utilisation of our supervisors.
= {
supervisor_free_spaces - len(supervisor.matching)
supervisor: supervisor.capacity for supervisor in game.supervisors
}
= {
supervisor_utilisation len(supervisor.matching) / supervisor.capacity
supervisor: for supervisor in game.supervisors
}
= plt.subplots(figsize=(8, 4), dpi=300)
fig, ax
= Counter(supervisor_free_spaces.values())
data
ax.bar(data.keys(), data.values())
"No. free spaces")
ax.set_xlabel("Frequency")
ax.set_ylabel(range(max(data.keys()) + 1))
ax.set_xticks("Supervisor free spaces") ax.set_title(
Text(0.5, 1.0, 'Supervisor free spaces')
= plt.subplots(figsize=(8, 4), dpi=300)
fig, ax
= supervisor_utilisation.values()
values
ax.hist(values)
= ax.get_ylim()
ylims
ax.vlines(list(values)), *ylims, "tab:orange", "dashed", label="Mean", lw=3
np.mean(
)*ylims)
ax.set_ylim(
"Utilisation")
ax.set_xlabel("Frequency")
ax.set_ylabel("Supervisor utilisation")
ax.set_title( ax.legend()
<matplotlib.legend.Legend at 0x7fda4126fca0>
So we can see that the supervisors are almost all working at capacity. This is a good thing so long as they accurately estimated their own workloads.
Projects
We can do the same visualisation with the projects that were included in the game. This is a nice way to observe the quantity of excess projects offered to students.
= {
project_free_spaces - len(project.matching)
project.name: project.capacity for project in game.projects
}
= {
project_utilisation len(project.matching) / project.capacity
project.name: for project in game.projects
}
= plt.subplots(figsize=(8, 4), dpi=300)
fig, ax
= Counter(project_free_spaces.values())
data
ax.bar(data.keys(), data.values())
"No. free spaces")
ax.set_xlabel("Frequency")
ax.set_ylabel(range(max(data.keys()) + 1))
ax.set_xticks("Project free spaces") ax.set_title(
Text(0.5, 1.0, 'Project free spaces')
= plt.subplots(figsize=(8, 4), dpi=300)
fig, ax
= project_utilisation.values()
values
ax.hist(values)
= ax.get_ylim()
ylims
ax.vlines(list(values)), *ylims, "tab:orange", "dashed", label="Mean", lw=3
np.mean(
)*ylims)
ax.set_ylim(
"Utilisation")
ax.set_xlabel("Frequency")
ax.set_ylabel("Project utilisation")
ax.set_title( ax.legend()
<matplotlib.legend.Legend at 0x7fda715f64f0>
The conclusion to draw from this is that the majority of projects seem to be split between two groups:
- very popular (and busy) projects
- very unpopular (and are excessive) projects
Results like this would help indicate where more projects should be offered in later years.
Students
Arguably, the most valuable analysis is of the students and their matchings. To do this, we will need to “invert” the matching found by Matching.
= {}
inverted_matching = []
student_preference_of_matching for project, project_students in matching.items():
for student in project_students:
= project.name
inverted_matching[student.name]
student_preference_of_matching.append(
student._pref_names.index(project.name) )
This inverted matching can now be converted to a pandas.DataFrame
. In this form, it is easy to link students between the matching data and the original as well as being able to extract those students who remain unmatched.
= pd.DataFrame(
df_matching
{"name": list(inverted_matching.keys()),
"project_code": list(inverted_matching.values()),
"preference": student_preference_of_matching,
}
)
= df_matching.sort_values(by="name").reset_index(drop=True)
df_matching
= df_matching.set_index("name")
name_indexed_df_matching = raw_students.set_index("name")
name_indexed_raw_students
= pd.concat(
df_matching "rank"]), axis=1
(name_indexed_df_matching, name_indexed_raw_students[ ).reset_index()
= df_matching[df_matching["preference"].isnull()]
unassigned_students
unassigned_students
name | project_code | preference | rank | |
---|---|---|---|---|
93 | 190025 | NaN | NaN | 89 |
94 | 190029 | NaN | NaN | 97 |
95 | 190035 | NaN | NaN | 99 |
96 | 190040 | NaN | NaN | 98 |
97 | 190041 | NaN | NaN | 72 |
98 | 190082 | NaN | NaN | 82 |
99 | 190087 | NaN | NaN | 65 |
= df_matching[df_matching["preference"].notnull()]
assigned_students = assigned_students.astype({"preference": int})
assigned_students assigned_students.head()
name | project_code | preference | rank | |
---|---|---|---|---|
0 | 190000 | G2 | 0 | 3 |
1 | 190001 | Q0 | 0 | 56 |
2 | 190002 | U0 | 1 | 60 |
3 | 190003 | C0 | 0 | 67 |
4 | 190004 | I0 | 0 | 19 |
Now the data from the matching we’ve found is more manageable, we can try to understand how “good” it is.
Below is a bar chart showing the frequency of the students’ preference over their match.
= plt.subplots(figsize=(8, 4), dpi=300)
fig, ax
= Counter(assigned_students["preference"])
values
ax.bar(values.keys(), values.values())-1.5, len(unassigned_students))
ax.bar(
-1.5] + list(range(0, 10, 2)))
ax.set_xticks(["Unmatched"] + list(range(0, 10, 2)))
ax.set_xticklabels(["Preference")
ax.set_xlabel("Frequency")
ax.set_ylabel("Student matching preferences") ax.set_title(
Text(0.5, 1.0, 'Student matching preferences')
We can see that the vast majority of students got their first or second choice which is great news.
Another thing we should consider is how appropriate the ranking is. Ideally, a student who ranks amongst the best in the cohort should get one of their first choices, and as the rank of the students go down, as should their preference of their match (on average, anyway).
Below is a scatter plot of student’s rank and their preference of the project they were allocated.
= plt.subplots(figsize=(8, 4), dpi=300)
fig, ax
ax.scatter("rank"],
assigned_students["preference"],
assigned_students[=".",
marker="matched students",
label
)
ax.scatter("rank"],
unassigned_students[-0.5] * len(unassigned_students),
[="|",
marker=3,
lw="unmatched students",
label
)
"Student rank")
ax.set_xlabel("Preference of matching")
ax.set_ylabel("Student rank vs. their preference of their matching")
ax.set_title( ax.legend()
<matplotlib.legend.Legend at 0x7fda20a271c0>
This is the kind of behaviour we want to see. There is a strong line along the bottom, as expected. Then, the cloud of less-preferable matches only grows towards the bottom of the ranking.
Figuring out the remaining spaces
Alas, this is only a starting point. From here, adjustments can be made to the allocation – although this is not recommended as this could break its mathematical soundness.
One avenue to take would be to offer the unassigned students a project with space left over. To figure out where there are spaces, we can use the StudentAllocation
object and the data we cleaned earlier.
import warnings
"ignore") warnings.filterwarnings(
= [
project_with_space_names
project.namefor project in game.projects
if len(project.matching) < project.capacity
+ list(unranked_projects)
]
= [
supervisor_with_space_names
supervisor.namefor supervisor in game.supervisors
if len(supervisor.matching) < supervisor.capacity
+ list(unranked_supervisors) ]
def get_number_of_matches(name, party, game):
for player in vars(game)[party]:
if player.name == name:
return len(player.matching)
return 0
def get_capacity(data, party, name):
if party == "project":
= "code"
column else:
= "name"
column
return data[data[column] == name]["capacity"].iloc[0]
= projects[
projects_with_space "code"].isin(project_with_space_names))
(projects[& (projects["supervisor"].isin(supervisor_with_space_names))
]
"supervisor_capacity"] = projects_with_space[
projects_with_space["supervisor"
apply(lambda x: get_capacity(supervisors, "supervisor", x))
].
"project_matches"] = projects_with_space["code"].apply(
projects_with_space[lambda x: get_number_of_matches(x, "projects", game)
)
"supervisor_matches"] = projects_with_space[
projects_with_space["supervisor"
apply(lambda x: get_number_of_matches(x, "supervisors", game))
].
= projects_with_space[
projects_with_space
["code",
"capacity",
"project_matches",
"supervisor",
"supervisor_capacity",
"supervisor_matches",
] ]
= projects_with_space.set_index(
projects_with_space "supervisor", "code"]
[
).sort_index()
projects_with_space
capacity | project_matches | supervisor_capacity | supervisor_matches | ||
---|---|---|---|---|---|
supervisor | code | ||||
C | C0 | 8 | 2 | 8 | 5 |
C2 | 4 | 2 | 8 | 5 | |
M | M0 | 8 | 2 | 7 | 5 |
Z | Z0 | 5 | 1 | 5 | 3 |
So, with that we have completed this tutorial. We have taken raw data and created a strong allocation of projects to students. Not only that, but we have offered some analysis of this data to demonstrate its effectiveness.