2022: Week 49 - Cleaning Status History (HR month)
Challenge by: Kelly Gilbert
Welcome to HR month (well five weeks actually but one of those will come in 2023!). Long term participant Kelly Gilbert has supplied the Dr Preppers some challenges to finish the year. Firstly, thank you to Kelly for these challenges and we're all looking forward to learning more about some of the challenges you face in your day-to-day role. Secondly, we love receiving real world challenges so we hope you enjoy them as much as we do.
The challenge
The applicant tracking system records the statuses and timestamps as a candidate moves through in the hiring process.
Each morning, any candidates/positions that changed during the previous day are loaded to the raw table. However, each time a candidate’s status changes, we receive all of their previous statuses again, in addition to the new one. We call this dataset “The 12 Days of Christmas,” because, like the song, each time a status is added, we get all of the history again. As a result, our dataset has a lot of duplicates!
In this challenge, we want to clean up the data by keeping only the most recent set of statuses for each candidate/position.
Example for candidate 36, position 1:
Candidate 36 created their profile on 2021-01-12, and we received that status in the 2021-01-13 file.
On 2021-02-11, candidate 36 completed the application. We received the Application Completed status, plus the Profile Created status (again).
On 2021-02-12, candidate 36’s application for position 1 was reviewed. We received that status (plus all of the previous statuses, again) in the 2021-02-13 file.
In this example, for candidate 36/position 1, we would only want to keep the records from the 2021-02-28 file, because that is the most recent file received for candidate 36, position 1.
Input
Requirements
For each combination of candidate and position, find the most recent file.
Watch out - the vendor has not been consistent with file naming. However, the file timestamp is always the 19 characters before “.csv”
Filter out any records that are not from the most recent file
- Output the data
Output
Output #1: output the cleaned status history to a CSV file (5 columns, 1513 rows, 1514 with header)
Output #2: Using the cleaned dataset (output #1), find the most recent status for each candidate/position combination (3 columns, 225 rows, 226 with header)
For each candidate/position, find the max timestamp
Keep the record that has the max timestamp
Keep only the candidate_id, position_id, and status columns
Rename the status column to current_status
Output to a CSV file
What was the most recent status for candidate 12 and position 4?
You can download the full outputs here.