Artwork
iconShare
 
Manage episode 514675686 series 2475293
Content provided by CCC media team. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by CCC media team or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://staging.podcastplayer.com/legal.
Data is the fossil fuel of the machine learning world, essential for developing high quality models but in limited supply. Yet institutions handling sensitive documents — such as financial, medical, or legal records often cannot fully leverage their own data due to stringent privacy, compliance, and security requirements, making training high quality models difficult. A promising solution is to replace the personally identifiable information (PII) with realistic synthetic stand-ins, whilst leaving the rest of the document in tact. In this talk, we will discuss the use of open source tools and models that can be self hosted to anonymize documents. We will go over the various approaches for Named Entity Recognition (NER) to identify sensitive entities and the use of diffusion models to inpaint anonymized content. about this event: https://talks.python-summit.ch/sps25/talk/EWMBRM/
  continue reading

2014 episodes