At Mistral AI, we are a tight-knit, nimble team dedicated to bringing our cutting-edge AI technology to the world
Our mission is to make AI ubiquitous and open
We are creative, low-ego, team-spirited, and have been passionate about AI for years
We hire people that foster in competitive environments, because they find them more fun to work in
We hire passionate women and men from all over the world
Our teams are distributed between France, UK and USA
Role Summary
We are seeking a skilled and motivated Web Crawling and Data Indexing Engineer to join our dynamic engineering team
The ideal candidate will have a strong background in web scraping, data extraction and indexing, with a focus on leveraging advanced tools and technologies to gather and process large-scale data from various web sources
The role is based in Paris or London
Key Responsibilities
Develop and maintain web crawlers using Python libraries such as Beautiful Soup to extract data from target websites
Utilize headless browsing techniques, such as Chrome DevTools, to automate and optimize data collection processes
Collaborate with cross-functional teams to identify, scrape, and integrate data from APIs to support business objectives
Create and implement efficient parsing patterns using regular expressions, XPaths, and CSS selectors to ensure accurate data extraction
Design and manage distributed job queues using technologies such as Redis, Kubernetes, and Postgres to handle large-scale data processing tasks
Develop strategies to monitor and ensure data quality, accuracy, and integrity throughout the crawling and indexing process
Continuously improve and optimize existing web crawling infrastructure to maximize efficiency and adapt to new challenges
Qualifications & Profile
Bachelor’s or master’s degree in computer science, information systems, or information technology
Strong understanding of web technologies, data structures, and algorithms
They should have knowledge of database management systems and data warehousing
Programming Languages: Proficiency in programming languages such as Python, Java, or C++ is essential
Masterings of Web Technologies: Understanding of HTML, CSS, and JavaScript is crucial to navigate and scrape data from websites
Knowledge of HTTP and HTTPS protocols
A good understanding of data structures (like queues, stacks, and hash maps) and algorithms is necessary
Knowledge of databases (SQL or NoSQL) is important to store and manage the crawled data
Understanding distributed systems and technologies like Hadoop or Spark Experience using web Scraping Libraries and Frameworks like Scrapy, BeautifulSoup, Selenium, or MechanicalSoup
Understanding how search engines work and how to optimize web crawling
Experience in Machine Learning to improve the efficiency and accuracy of web crawling
Familiar with tools such as Pandas, NumPy, and Matplotlib to analyze and visualize data