2022: Week 49 - Cleaning Status History (HR month)

Challenge by: Kelly Gilbert

Welcome to HR month (well five weeks actually but one of those will come in 2023!). Long term participant Kelly Gilbert has supplied the Dr Preppers some challenges to finish the year. Firstly, thank you to Kelly for these challenges and we're all looking forward to learning more about some of the challenges you face in your day-to-day role. Secondly, we love receiving real world challenges so we hope you enjoy them as much as we do.

The challenge

The applicant tracking system records the statuses and timestamps as a candidate moves through in the hiring process. 

Each morning, any candidates/positions that changed during the previous day are loaded to the raw table. However, each time a candidate’s status changes, we receive all of their previous statuses again, in addition to the new one. We call this dataset “The 12 Days of Christmas,” because, like the song, each time a status is added, we get all of the history again. As a result, our dataset has a lot of duplicates!

In this challenge, we want to clean up the data by keeping only the most recent set of statuses for each candidate/position.

Example for candidate 36, position 1:

  • Candidate 36 created their profile on 2021-01-12, and we received that status in the 2021-01-13 file.

  • On 2021-02-11, candidate 36 completed the application. We received the Application Completed status, plus the Profile Created status (again).

  • On 2021-02-12, candidate 36’s application for position 1 was reviewed. We received that status (plus all of the previous statuses, again) in the 2021-02-13 file.

  • In this example, for candidate 36/position 1, we would only want to keep the records from the 2021-02-28 file, because that is the most recent file received for candidate 36, position 1.



  • Input the data

  • For each combination of candidate and position, find the most recent file. 

    • Watch out - the vendor has not been consistent with file naming. However, the file timestamp is always the 19 characters before “.csv”

    • Filter out any records that are not from the most recent file

  • Output the data


  • Output #1: output the cleaned status history to a CSV file (5 columns, 1513 rows, 1514 with header)

  • Output #2: Using the cleaned dataset (output #1), find the most recent status for each candidate/position combination (3 columns, 225 rows, 226 with header)

    • For each candidate/position, find the max timestamp

    • Keep the record that has the max timestamp

    • Keep only the candidate_id, position_id, and status columns

    • Rename the status column to current_status

    • Output to a CSV file

What was the most recent status for candidate 12 and position 4?

 You can download the full outputs here

After you finish the challenge make sure to fill in the participation tracker, then share your solution on Twitter using #PreppinData and tagging @Datajedininja@JennyMartinDS14@TomProwse1 & @kelly_gilbert

You can also post your solution on the Tableau Forum where we have a Preppin' Data community page. Post your solutions and ask questions if you need any help! 

Popular posts from this blog

2023: Week 1 The Data Source Bank

2023: Week 2 - International Bank Account Numbers

How to...Handle Free Text