Skip to content
OnData.blog

OnData.blog

Menu

  • Articles
  • By topic
  • About
  • Linkedin
  • Facebook
  • twitter
  • RSS

Data Lake: simple usage example

Data Lakes vary from each other. Standards are only emerging. The Lagoon Data Lake we have internally built at Sopra Steria (introduced in the previous post) is an internal IaaS Data Lake solution, built mostly of open source components (Spark,

Pawel Plaszczak March 26, 2020April 1, 2020 Articles No Comments Read more

Techniques for comparing populations of IT data

I recently work a lot with IT Infrastructure Management data. At Sopra Steria, we manage sizeable ecosystems of our corporate clients that include thousands of apps and infrastructure elements. We handle events, incidents, alarms, and support tickets. We process thousands

Pawel Plaszczak March 2, 2020March 11, 2020 Articles No Comments Read more

The Data Lake at Sopra Steria

Following on my previous post, we have spent some time on building an internal Data Lake at Sopra Steria. The infrastructure is functional now and admitting its first users. Much has been said on building successful Data Science teams. Multidisciplinary

Pawel Plaszczak February 17, 2020February 18, 2020 Articles 4 Comments Read more

Synchronizing an SQL Database to a Data Lake (Change Data Capture at ingest)

The considerations below result from some recent projects at Sopra Steria. The goal: having built a Data Lake, we want to deliver (ingest) in the Raw Zone the data from various sources,including several instances of an Oracle Database. We want

Pawel Plaszczak November 29, 2019November 29, 2019 Articles 1 Comment Read more

Why Analysts need Data Lakes?

Why Analysts need Data Lakes?

With substantial analytical needs at Sopra Steria Apps, we are looking to expand our Data Science environment. My thoughts go towards a Data Lake architecture, from a concrete angle, having practical requirements and knowing quite precisely what we want. I’ve

Pawel Plaszczak October 14, 2019October 14, 2019 Articles No Comments Read more

Simple hack to improve data clustering visualizations

Simple hack to improve data clustering visualizations

Here is how to make your data clusters look pretty in no time (with python and matplotlib), with one-liner code hack. I wanted to visualize in python and matplotlib the data clusters returned by clustering algorithms such as K-means (sklearn.cluster.KMeans)

Pawel Plaszczak October 8, 2019October 10, 2019 Articles No Comments Read more

How to isolate data that constitutes a spike in histogram?

How to isolate data that constitutes a spike in histogram?

We would all love to spot business problems early on, to react before they become painful. You can learn a lot by looking at past problems. Hence, understanding the nature of anomalies in data can bring substantial operational benefits and

Pawel Plaszczak October 1, 2019November 20, 2021 Articles 2 Comments Read more

Getting started: Apache Spark, PySpark and Jupyter in a Docker container

Getting started: Apache Spark, PySpark and Jupyter in a Docker container

Apache Spark is the popular distributed computation environment. It is written in Scala, however you can also interface it from Python. For those who want to learn Spark with Python (including students of these BigData classes), here’s an intro to

Pawel Plaszczak May 17, 2019May 17, 2019 Articles 1 Comment Read more

Building End-to-End Production Machine Learning pipelines

Building End-to-End Production Machine Learning pipelines

Much of the public discourse in Data Science focuses on model optimization (selection of regressors/classifiers, hyperparameter tuning, model training and improvment of the prediction accuracy). Less material is available on using and deploying these trained Machine Learning models in production.

Pawel Plaszczak April 21, 2019May 13, 2019 Articles No Comments Read more

Audit logging for Amazon Redshift

Audit logging for Amazon Redshift

I lately spent a while configuring and analysing the logs for Amazon Redshift warehouse. I am summarizing the experience here so others can achieve the same faster. Amazon Redshift is the analytical data warehouse platform on the AWS cloud, with

Pawel Plaszczak April 3, 2019April 3, 2019 Articles No Comments Read more
  • « Previous
  • Next »

Recent Posts

  • Data Literacy: Six examples of bad data interpretation April 29, 2024
  • Porting PyTorch neural network to Amazon AWS June 30, 2022
  • Porting pyTorch cloud detection model to Amazon AWS S3 June 17, 2022
  • pushing data to AWS. SageMaker sucks. So does Anaconda June 14, 2022
  • Linear Regression: Killer App with 19-century maths January 19, 2022
  • Democratization of statistics: Chi2 for non-experts January 12, 2022
  • An approach to categorize multi-lingual phrases December 15, 2021
  • The implications of Scikit-learn bug #21455 November 29, 2021
  • Your model may be inaccurate November 25, 2021
  • Answering Why (with Chi-Square) November 19, 2021
Copyright © 2025 OnData.blog. All rights reserved. Theme Spacious by ThemeGrill. Powered by: WordPress.