A simple introduction for basic use of git and git hub.
Since I'm a rookie to Git, If there is anything wrong, please contact me.
Hope you'll enjoy it.
[PyCon US 2025] Scaling the Mountain_ A Framework for Tackling Large-Scale Te...Jimmy Lai
Managing tech debt in large legacy codebases isn’t just a challenge—it’s an ongoing battle that can drain developer productivity and morale. In this talk, I’ll introduce a Python-powered Tech Debt Framework bar-raiser designed to help teams tackle even the most daunting tech debt problems with 100,000+ violations. This open-source framework empowers developers and engineering leaders by: - Tracking Progress: Measure and visualize the state of tech debt and trends over time. - Recognizing Contributions: Celebrate developer efforts and foster accountability with contribution leaderboards and automated shoutouts. - Automating Fixes: Save countless hours with codemods that address repetitive debt patterns, allowing developers to focus on higher-priority work.
Through real-world case studies, I’ll showcase how we: - Reduced 70,000+ pyright-ignore annotations to boost type-checking coverage from 60% to 99.5%. - Converted a monolithic sync codebase to async, addressing blocking IO issues and adopting asyncio effectively.
Attendees will gain actionable strategies for scaling Python automation, fostering team buy-in, and systematically reducing tech debt across massive codebases. Whether you’re dealing with type errors, legacy dependencies, or async transitions, this talk provides a roadmap for creating cleaner, more maintainable code at scale.
PyCon JP 2024 Streamlining Testing in a Large Python Codebase .pdfJimmy Lai
Maintaining code quality in a growing codebase is challenging. We faced issues like increased test suite execution time, slow test startups, and coverage reporting overhead. By leveraging open-source tools, we significantly enhanced testing efficiency. We utilized pytest-xdist for parallel test execution, reducing test times and accelerating development. Optimizing test startup with Docker and Kubernetes for CI, and pytest-hot-reloading for local development, improved productivity. Customizing coverage tools to target updated files minimized overhead. This resulted in an 8000-case increase in test volume, 85% test coverage, and CI tests completing in under 15 minutes.
EuroPython 2024 - Streamlining Testing in a Large Python CodebaseJimmy Lai
Maintaining code quality through effective testing becomes increasingly challenging as codebases expand and developer teams grow. In our rapidly expanding codebase, we encountered common obstacles such as increasing test suite execution time, slow test coverage reporting and delayed test startup. By leveraging innovative strategies using open-source tools, we achieved remarkable enhancements in testing efficiency and code quality.
As a result, in the past year, our test case volume increased by 8000, test coverage was elevated to 85%, and Continuous Integration (CI) test duration was maintained under 15 minute
Black, Flake8, isort, and Mypy are useful Python linters but it’s challenging to use them effectively at scale in the case of multiple codebases, in a large codebase, or with many developers. Manually managing consistent linter versions and configurations across codebases requires endless effort. Linter analysis on large codebases is slow. Linters may slow down developers by asking them to fix trivial issues. Running linters in distributed CI jobs makes it hard to understand the overall developer experience.
To handle these scale challenges, we developed a reusable linter framework that releases new linter updates automatically, reuses consistent configurations, runs linters on only updated code to speedup runtime, collects logs and metrics to provide observability, and builds auto fixes for common linter issues. Our linter runs are fast and scalable. Every week, they run 10k times on multiple millions of lines of code in over 25 codebases, generating 25k suggestions for more than 200 developers. Its autofixes also save 20 hours of developer time every week.
In this talk, we’ll walk you through popular Python linters and configuration recommendations, and we will discuss common issues and solutions when scaling them out. Using linters more effectively will make it much easier for you to apply best practices and more quickly write better code.
EuroPython 2022 - Automated Refactoring Large Python CodebasesJimmy Lai
Like many companies with multi-million-line Python codebases, Carta has struggled to adopt best practices like Black formatting and type annotation. The extra work needed to do the right thing competes with the almost overwhelming need for new development, and unclear code ownership and lack of insight into the size and scope of type problems add to the burden. We’ve greatly mitigated these problems by building an automated refactoring pipeline that applies Black formatting and backfills missing types via incremental Github pull requests. Our refactor applications use LibCST and MonkeyType to modify the Python syntax tree and use GitPython/PyGithub to create and manage pull requests. It divides changes into small, easily reviewed pull requests and assigns appropriate code owners to review them. After creating and merging more than 3,000 pull requests, we have fully converted our large codebase to Black format and have added type annotations to more than 50,000 functions. In this talk, you’ll learn to use LibCST to build automated refactoring tools that fix general Python code quality issues at scale and how to use GitPython/PyGithub to automate the code review process.
Annotate types in large codebase with automated refactoringJimmy Lai
Add missing type annotations to a large Python codebase is not easy. The major challenges include: limited developer time, tons of missing types, code ownership, and active development. We solved the problem by building an automated refactoring pipeline that run CircleCI jobs to create incremental Github pull requests to backfill missing types using heuristic rules and MonkeyType. The refactor apps use LibCST to modify Python syntax tree. Changes are split into small reviewable pull requests and assigned to code owners to review. So far, the work has added type annotations to more than 45,000 Python functions and saved tons of engineering efforts.
The journey of asyncio adoption in instagramJimmy Lai
In this talk, we share our strategy to adopt asyncio and the tools we built: including common helper library for asyncio testing/debugging/profiling, static analysis and profiling tools for identify call stack, bug fixes and optimizations for asyncio module, design patterns for asyncio, etc. Those experiences are learn from large scale project -- Instagram Django Service.
Hung-Che Lai successfully completed the Data Analyst Nanodegree program from Udacity in 2016. The certificate verifies that Hung-Che Lai learned data analysis skills and discovered insights from data. Sebastian Thrun, CEO of Udacity, certified that Hung-Che Lai completed the program on October 19, 2016.
Distributed system coordination by zookeeper and introduction to kazoo python...Jimmy Lai
Zookeeper is a coordination tool to let people build distributed systems easier. In this slides, the author summarizes the usage of zookeeper and provides Kazoo Python library as example.
In this talk, the speaker will demonstrate how to build a searchable knowledge base from scratch. The process includes data wrangling, entity indexing and full text search.
In this slide, we introduce the mechanism of Solr used in Search Engine Back End API Solution for Fast Prototyping (LDSP). You will learn how to create a new core, update schema, query and sort in Solr.
[LDSP] Search Engine Back End API Solution for Fast PrototypingJimmy Lai
In this slides, I propose a solution for fast prototyping of search engine back end API. It consists of Linux + Django + Solr + Python (LDSP), and all are open source softwares. The solution also provides code repository with automation scripts. Everyone can build a Search Engine back end API in seconds by exploiting LDSP.
This document provides an overview of text classification in Scikit-learn. It discusses setting up necessary packages in Ubuntu, loading and preprocessing text data from the 20 newsgroups dataset, extracting features from text using CountVectorizer and TfidfVectorizer, performing feature selection, training classification models, evaluating performance through cross-validation, and visualizing results. The goal is to classify newsgroup posts by topic using machine learning techniques in Scikit-learn.
Big data analysis in python @ PyCon.tw 2013Jimmy Lai
Big data analysis involves several processes: collecting, storage, computing, analysis and visualization. In this slides, the author demonstrates these processes by using python tools to build a data product. The example is based on text-analyzing an online forum.
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...Jimmy Lai
Big data analysis relies on exploiting various handy tools to gain insight from data easily. In this talk, the speaker demonstrates a data mining flow for text classification using many Python tools. The flow consists of feature extraction/selection, model training/tuning and evaluation. Various tools are used in the flow, including: Pandas for feature processing, scikit-learn for classification, IPython, Notebook for fast sketching, matplotlib for visualization.
In this slides, the author demonstrates many software development practices in Python. Including: runtime environment setup, source code management, version control, unit test, coding convention, code duplication, documentation and automation.
Fast data mining flow prototyping using IPython NotebookJimmy Lai
Big data analysis requires fast prototyping on data mining process to gain insight into data. In this slides, the author introduces how to use IPython Notebook to sketch code pieces for data mining stages and make fast observations easily.
Apache thrift-RPC service cross languagesJimmy Lai
This slides illustrate how to use Apache Thrift for building RPC service and provide demo example code in Python. The example scenario is: we have a prepared machine learning model, and we'd like to load the model in advance as a server for providing prediction service.
NetworkX - python graph analysis and visualization @ PyHugJimmy Lai
NetworkX is a Python package for analyzing and visualizing graphs and networks. It allows users to construct graphs from data, model network topology and examine properties like centrality and connectivity. The document provides instructions on installing NetworkX and links to tutorials, demonstrates analyzing a social network from a PTT bulletin board, and lists the top users by PageRank centrality.