2020: Week 40

Challenge by: Jenny Martin

I often see dashboards and wonder about the data prep behind them. Sometimes the most beautiful of dashboards can be hiding the most horrendous of data preparation. Let's take this Viz of the Day from dataschooler Matthew Armstrong. The visualisation itself is fairly simple, but how did the data start off? 

Explore Matthew's viz here

Inputs

There are three inputs this week:
  1. The poems, scarped from everypoet.com

  2. The Scrabble scores for each letter

  3. (Optional) Scaffolding list

Requirements

  • Input the data
  • Lines of the poem will not contain any HTML, css or js e.g. <head>, e9=new Object() etc. Filter out any rows which are not lines of the poem
  • Wordsworth is very original, so there shouldn't be any duplicate lines in our data set. Filter out any repeated rows
  • The first line of each poem is also the title of the poem. Ensure this is the case and number the lines of each poem
  • Split the data out so there is a line for each word and assign a word number for each line
  • Split the data into individual letters and combine with the associated Scrabble score
  • Aggregate so each word has a Scrabble score 
  • Create a flag for the highest scoring word in each poem
  • Output the data

Output


  • 7 fields
    • Poem
    • Line #
    • Line
    • Word #
    • Word
    • Score
    • Highest Scoring Word?
  • 633 rows (634 including headers)
Here is the Output file to let you check your structure.

After you finish the challenge make sure to fill in the participation tracker, then share your solution on Twitter using #PreppinData and tagging @Datajedininja@JennyMartinDS14@JonathanAllenby & @TomProwse1

You can also post your solution on the brand new Tableau Forum where we have a Preppin' Data community page. Post your solutions and ask questions if you need any help! 

Popular posts from this blog

2023: Week 1 The Data Source Bank

2023: Week 2 - International Bank Account Numbers

How to...Handle Free Text