RedJamJar Software Services

Full-Stack Data Engineering Consultant

  • The AI Ethics Playbook: Building Trust While Breaking Ground

    Because “move fast and break things” shouldn’t include your clients’ trust or your reputation


    Remember last week when we talked about keeping client data safe while using AI? Well, today we’re diving into the bigger picture – the ethical minefield that comes with wielding these powerful tools. And before you click away thinking “ethics = boring compliance talk,” stick with me. This is about protecting your professional reputation, your clients’ businesses, and maybe even sleeping better at night.

    The urgency is real: According to McKinsey’s latest survey, 78% of organizations now use AI in at least one business function, up from just 55% a year ago. We’re all part of this wave, and getting the ethics right isn’t optional anymore.

    (more…)
  • AI at Work: The Critical Don’ts Every Consultant Needs to Know

    How to harness AI’s power while keeping client data safe and your career intact

    We’re all riding the AI wave – and rightfully so! From ChatGPT helping us brainstorm solutions to Copilot speeding up our code reviews, AI tools are transforming how we work as consultants. But as we embrace these powerful capabilities, there’s a critical balance we need to strike: maximizing AI’s potential while protecting the sensitive information that our clients trust us with.

    Here are the essential don’ts that every consultant should keep in mind:

    (more…)
  • Apache Airflow has become a cornerstone tool in the data engineering world, offering robust workflow orchestration capabilities. As someone who has extensively used Airflow, I can attest to its flexibility and power in automating ETL processes. In this guide, I’ll walk you through setting up Airflow using Docker, creating and scheduling Directed Acyclic Graphs (DAGs), and share best practices, including integration with dbt.

    (more…)

  • As a seasoned software engineer, I’m always on the lookout for tools that can streamline my workflow and enhance productivity. Over the past month, I’ve been using dbt (Data Build Tool), and it has transformed the way I handle data operations. In this blog post, I’ll share why I made the switch and how dbt can benefit fellow data engineers.

    Streamlining Data Transformation

    Before adopting dbt, my data workflow was a patchwork of various processes and manual operations, some Scala applications, bash scripts, airflow and python glue. I spent a considerable amount of time munging data to fit standardized schemas and creating reports. This often involved writing complex scripts and dealing with ad-hoc mutations. The process was not only time-consuming but also prone to errors and inconsistencies.

    Using dbt-core simplifies data transformation by allowing you to write modular SQL queries (called models in dbt speak) that can be tested, version controlled in git and documented easily. With dbt, I can define models that represent the desired state of my data. These models are built incrementally, which means I can create a clear and maintainable pipeline of transformations. The result is a more organized and efficient workflow that reduces the chances of errors.

    Automating Data Quality Checks

    One of the biggest challenges I faced was ensuring the quality and consistency of the data. A large amount of my work is concerned with receiving, normalising and persisting data. It can arrive late, can arrive in different schemas and sometimes with missing information. For example duplicate records and inconsistencies upstream were common issues that required manual intervention to fix. dbt addresses this problem through its powerful testing framework. I can write tests for my models to ensure they meet certain criteria, such as uniqueness or non-null constraints.

    By automating these checks, dbt significantly reduces the time spent on manual quality assurance. It provides immediate feedback if something goes wrong, allowing me to catch and fix issues early in the process. This has greatly improved the reliability of my data and the confidence in the reports generated from it.

    Version Control and Collaboration

    Another major advantage of dbt is its integration with version control systems like Git. This feature is a game-changer for collaboration and maintaining a history of changes unlike many other ETL tools. I can track every modification made to my data models, understand why changes were made, and roll back if necessary. This is particularly useful when working in a team, as it ensures that everyone is on the same page and changes are well-documented.

    The ability to version control my SQL scripts has brought a level of discipline and transparency to my workflow that was previously missing. It’s now easier to collaborate with colleagues, conduct code reviews, and manage deployments.

    Documentation and Transparency

    dbt automatically generates documentation for your data models, making it easy to understand the structure and lineage of your data. This documentation includes information about the source of each model, the transformations applied, and the tests performed. This level of transparency is invaluable for onboarding new team members and for auditing purposes.

    Having up-to-date documentation readily available has saved me countless hours that would have otherwise been spent explaining the intricacies of our data pipeline to others. It ensures that everyone has access to the same information and can easily navigate the data landscape.
    Conclusion

    Switching to dbt has been good for our data workflows. It has replaced various manual processes and ad-hoc operations with a more structured, automated, and reliable approach to data transformation. The benefits of modular SQL queries, automated testing, version control, and comprehensive documentation have significantly enhanced my productivity and the quality of my data.

    For fellow data engineers who haven’t yet explored dbt, I highly recommend giving it a try. It’s a powerful tool that can bring order to the chaos of data transformation and help you achieve more with less effort. If you’re looking to streamline your data processes, improve data quality, and foster better collaboration within your team, dbt might just be the solution you need.

  • What is a minimum viable product (MVP)?

    An MVP is intended to create the minimum amount of product to be able to get something in front of your user (or leaner). The intention is that it shortens the time that it takes to work out what your customer (in our case the learner) really wants through an iterative approach.

    The approach was originally described by Eric Ries in The Lean Startup. Begin with a hypothesis which can be built quickly and cheaply. This minimum build includes the data gathering tools to extract valuable learnings to determine if your hypothesis was correct or not. Over time you refine the process with the intent that you are orientating closer and closer to meeting your goal. Failing that you identify quickly that the hypothesis was wrong and you should start over.

    This process is often described as Build – Measure – Learn as shown in the cycle below.

    Build Measure Learn Cycle

    The technology startup and innovation world are awash with this approach with many adopting solely this approach over more traditional research and focus groups. Within HR, Learning & Development the approach has been discussed previously by KAnthony in the minimum viable practice, by Teachable: How to run your online course like a lean startup and by Andrea May in minimum viable products development

    The focus in this post is to explore the measurement and learning aspect and how it can bring about an opportunity to measure bottom line performance improvement and bring strong cohesion between the learner and the learning designer.

    Do you know what your learner wants and needs?

    The focus is to take a different perspective to the usual approach and really understand the needs of your workforce and the problems they are facing.

    Once you change focus towards identifying the problems and looking for repeating trends across the organisation then you can start to see what you might offer.

    This provides an opportunity to move learning and development towards performance improvement:

    You’ve identified a problem which impacts 30% of your workforce. Through your MVP learning cycle (above), you identify that the cost (time) spent doing X in a current way is 17 mins per person per day. Gross this up across the workforce and a strong rationale for action can be formed.

    Although this is not the aim of this focus, by identifying user problems often yields insights which can be used in a myriad of ways. Improving the organisation performance and developing a valuable business case are 2 examples.

    How do you learners want to receive information?

    Understanding how to reach a learner is critical to engage them in. We often are told there isn’t enough time. However if it really matters and the channel is effective then you can often be surprised by how much time people can find. It’s about choice, not time, is your learner motivated to do something about their development.

    This requires effort on the learning designers part to understand and align motivation of the learner to objectives of the programme. For example, a new employee going through their induction programme may be highly motivated to ‘fit in’. This is quite different to the motivation levels associated with someone stepping up to managing or heading up a business unit.

    This is a fantastic creative opportunity; disrupting someone with content which is appealing, relevant and timely. Using an MVP approach you might begin by curating resources from Vimeo or Youtube and sharing that through your social network.

    One important consideration, however, is how you can learn from this trial.

    Example
    
    Let’s say you find a short video clip on how to rebuild an engine component.
    
    At this point, our understanding is that this is a complex process and many technicians don’t feel skilled to carry out the task in the manufacturers stated repair time.
    
    Looking on Youtube we come across:
    
    But before we provide this we want to be able to know if it's working so let’s include some analytics. Using the link to the video page we shorten it with bit.ly. Then we share and use this resource in our target group or emails we send. This way we can see who engages with the content.
    
    http://bit.ly/2a2uAml
    
    Using this link I can now track and monitor this in bit.ly. This is very rough and ready and a better solution would be to embed the video into a blog post or portal so that I can use Google Analytics to get improved data on how its being used and include a poll to get further learning.

    The value at this stage is the opportunity to try content in front of real users and get some valuable feedback. Optimising for the best channel and mix of resources is a great way to avoid overtraining or delivering information which doesn’t hit the spot.

    Experiment

    Discovering what the learner needs is only part of the puzzle. If you don’t know why then you can’t expect to apply this to any viable product and replicate at the scale the same results.

    So far you’ve identified a problem which a number of people have. For that problem, you’ve shared some existing resources (no cost) and tracked each resource to see where there was uptake.

    During this exercise, you are dealing with very low volumes of users. This is one of the key principles of a minimum viable product. It provides you with the chance to be personal, to speak to each user and go deep on why something is or isn’t working.

    This experimental approach doesn’t scale but neither does it need to. At the end of this, you will have a great understanding of what the problem is and how your learners would like to see it solved.

    Your discussions should look to extract information around:

    1. Identify the triggers – what gave them the initial spur.
    2. What were the channels they used to find a solution?
    3. What were their expectations of what they would find Vs what was provided?
    4. Was there an appropriate supplementary, web or resources which allowed them to go further?
    5. Can they recall the key learnings and apply these in their practice?
    6. Were they motivated to contribute either a rating, comment or question for the next person?

    Try new ways and see how best you can meet your goals. This process can be rapid and low cost by curating something which is available already to develop an effective learning solution.

    Q&A — I was asked for some further suggestions to test ideas with low-cost curated resources …

    • If you identify a tool or calculator: Try Google Docs to provide a read-only version that each person can copy and adapt.
    • Looking for curated infographics: Try Pinterest.
    • Presentations: Slideshare
    • Videos: Vimeo or Youtube or TED
    • Industry leaders: Twitter
    • Surveys: Google Forms or WuFoo
    • Ask and understand how they solve the problem today. The feedback can be surprising, from my experience you can get responses like:

    “Oh I’ve created this little tool”

    or

    “Bill in Customer Service keeps a report on each process”.

    What next?

    Through the MVP process, you gain an understanding of the elements required to address a problem. The Build->Measure->Learn loop removes barriers, can recognise the role of motivation to engage learners and drives collaboration between the learning designer and the learner to co-create an effective, proven set of insights. These insights are the ingredients of your next programme.

    It’s now time to develop a solution which scales to meet the size of your audience and offers the resources and support your users need.

  • Notes on Variance

    I’ve recently been working through the Rock The Jvm Cats course having previously completed the Scala Advanced course and being very impressed with his explanations and training materials.

    Here are my take away notes on variance in Scala based on the session by Daniel Ciocîrlan.

    package RockTheJvm.cats.chapter1
    
    object TCVariance {
    
      import cats.Eq
      import cats.instances.int._           // Eq[Int] TC Instance
      import cats.instances.option._        // construct a Eq[Option[Int]] TC Instance
      import cats.syntax.eq._
    
      val aCompare = Option(2) === Option(3)
      val anInvCompare = Option(2) === None
    
      // variance
      class Animal
      class Cat extends Animal
    
      // covariant type: subtyping is propagated to the generic type
      class Cage[+T]
      val cage: Cage[Animal] = new Cage[Cat]  // Cat <: Animal, so Cage[Cat] <: Cage[Animal]
    
      // contravariant type: subtyping is propagated BACKWARDS to the generic type
      class Vet[-T]
      val vet: Vet[Cat] = new Vet[Animal]     // Cat <: Animal then Vet[Animal] <: Vet[Cat]
    
      // rule of thumb: "Has a T" = covariant, "Acts on T" = contravariant
      // variance affects how TC instances are being fetched
    
      // contravariant
      trait SoundMaker[-T]
      implicit object AnimalSoundMaker extends SoundMaker[Animal]
      def makeSound[T](implicit soundMaker: SoundMaker[T]) = println("Wow")
    
      makeSound[Animal] // OK - TC instance defined above
      makeSound[Cat]    // OK - TC instance for Animal is also applicable to Cats - like the vet example above
      // rule 1: contravariant TCs can use the superclass instances if nothing is available strictly for that type
    
      // has implications for subtypes
      implicit object OptionSoundMaker extends SoundMaker[Option[Int]]
      makeSound[Option[Int]]
      makeSound[Some[Int]]
    
      // covariant
      trait AnimalShow[+T] {
        def show: String
      }
    
      implicit object GeneralAnimalShow extends AnimalShow[Animal] {
        override def show: String = "Animals Everywhere"
      }
    
      implicit object CatShow extends AnimalShow[Cat] {
        override def show: String = "Cats everywhere"
      }
    
      def organiseShow[T](implicit event: AnimalShow[T]): String = event.show
      // rule 2: covariant TCs will always use the more specific TC instance for that type
      // but may confuse the compiler if the general TC is also present
    
      // rule 3: you can't have both benefits
      // Cats uses Invariant type classes
      Option(2) === Option.empty[Int]
    
      def main(args: Array[String]): Unit = {
        println(organiseShow[Cat])           // ok - the compiler will inject CatShow as implicit
    //    println(organiseShow[Animal])   // will not compile - ambiguous values
      }
    }
    
  • Thanks for visiting

    There are no bells or whistles here, just a simple site with some software engineering content which you probably wont find useful but I do. I finally decided to put together this very simple site.

    What to expect?

    Short notes on programming particular with Scala