Data Cleaning – Aaron Tyrell Mitchell

In a recent project, I ran a series of calculations based on employee work history for a large national retailer. The calculations themselves were not complicated. However, the real challenge was organizing the dataset in such a way that allowed for the calculations to be run for all employees over the entire time period.

Task:

1. Convert very irregular shift data for 173 employees into an easily readable format

2. Perform the following calculations for each shift: total length of shift, length of lunch break, amount of time worked prior to lunch, amount of time worked following lunch

3. Determine the following characteristics of each shift: lunch breaks that were exactly 30 minutes in length, shifts long enough to qualify for lunch but no lunch was taken, lunch breaks that occurred very early in the shift, and lunch breaks that occurred very late in the shift.

Primary challenge: A very irregular dataset

This sample dataset runs from August-December 2017 and is arranged by employee. For example, the dataset lists all shifts worked by Employee #1 from August-December 2017 and then lists all shifts for Employee #2, then Employee #3, and so on. However, the employee number is only listed once at the very start of that employee’s data.

Employees of this company often worked split shifts, e.g. an employee with an 8-hour shift working 3 hours in the morning and then returning to work several hours later that day to complete the remaining 5 hours. In the above example, Employee #163 worked a split shift on 8/1/2017, but a regular shift with a 28-minute lunch break on 8/3/2017.

Additionally, shifts are sometimes split between two lines at the lunch break or shift split, but at other times, they are all listed on a single line

Python Script

Given the tasks and challenges, I decided that Python was best suited for this analysis.

The first task was to fill in the missing employee numbers so that each shift was matched to an employee. Also, as you can see in the examples above, there are a lot of empty lines and lines with dates but no shift data.

Here is how the raw dataset appears in a pandas dataframe:

My first step was to fill in the missing ID numbers and delete all blank spaces. The script below deletes all rows that have no data for the first clock in time at the start of the shift (“In”):

Next, I transformed the basic time data into datetime by combining the date column and each punch of the clock to form four new columns: Punch 1-4

In order to make this easily readable for the client and easier to perform subsequent calculations, we wanted the shifts that were separated on several lines for the same day to all be on the same line. We grouped the data by ID Number and Date and then created two new columns: Punch 3a and 4a.

You’ll notice on line 102, that I pulled the information from Punch 1 and 2 from line 104 and applied it to Punch 3a and 4a. This transformed each workday from split between two lines to just one. This brought to light a new problem that will present itself once we begin calculating the length of each shift. Notice on line 102 that the final clock out (Punch 4a) occurred at 2:24am on 8/2/17 but the date is listed as 8/1/17, the date that the shift began. Leaving the date that way will make it impossible to accurately calculate the full shift length in Python. Therefore, I wrote the following code to check whether the times in Punches 2 through 4 were less than that of Punch 1 and if so, add 1 day onto the date. I then created 4 new columns to include any date adjustments: Punch 1Adj-4Adj

Compared to the previous example, the final clock out on line 102 is now 8/2/2017, with Punches 1-3 occurring on 8/1/2017. At this point, there are still two lines for this shift: one with the data as it originally was (the last line in the example above) and one with the data from Punch 1 and 2 moved to Punch 3 and 4. I used the follow code to clean this up:

At this point, the hard part (Task 1) is over: the employee numbers have been filled in, blank lines have been deleted, each shift is presented on a single line, and dates for shifts that extend past midnight have been adjusted as necessary. We are now working with a clean, well-organized dataset with no empty spaces and no duplicates. What comes next is series of fairly simple calculations to satisfy Tasks 2 and 3, as listed at the top of the post.

There you have it. With the above Python script, I was able to take the raw Excel files furnished by the client and transform them into a single, easily read dataframe. Please see my Github page for the raw data sample and the finished product.

End Notes

While I used the structure of the dataset in its original form, I anonymized the data, including employee ID numbers.

The full Python code and sample dataset can be found on Github. The above represents the bulk of the analysis that I ran, but I made some additional clean up and adjustments that are not reflected here or on Github.